@eryx/unicode Module

The unicode module provides Unicode-aware string operations built on top of Luau's built-in utf8 library. It covers encoding conversion (UTF-8 <-> UTF-16 <-> UTF-32), normalisation (NFC/NFD/NFKC/NFKD), case conversion, character name lookup, and grapheme-based string slicing.

All string inputs and outputs use UTF-8 unless otherwise stated.

local unicode = require("@eryx/unicode")

-- Case conversion
print(unicode.toUpperCase("héllo")) -- "HÉLLO"
print(unicode.toLowercase("ÑOÑO")) -- "ñoño"

-- Normalisation
local nfc = unicode.normalise("e\xCC\x81", "NFC") -- é (composed)
print(unicode.isNormalised(nfc, "NFC")) -- true

-- Character name lookup
print(unicode.name("€")) -- "EURO SIGN"
print(unicode.fromName("SNOWMAN")) -- "☃"

-- Grapheme-aware length and slicing
print(unicode.len("héllo")) -- 5
print(unicode.sub("héllo", 2, 4)) -- "éll"

-- Encoding round-trip
local utf16 = unicode.toUtf16("hello")
print(unicode.fromUtf16(utf16)) -- "hello"

Summary

Functions

unicode.fromName(name: string, default: string?)string?
unicode.name(grapheme: string)string
unicode.normalise(source: string, format: "NFC" | "NFKC" | "NFD" | "NFKD")string
unicode.isNormalised(source: string, format: "NFC" | "NFKC" | "NFD" | "NFKD")boolean
unicode.graphemes(source: string)(() → string?)
unicode.fromUtf16(source: string, endian: ("little" | "big")?, strict: boolean?, preserveBom: boolean?)string
unicode.fromUtf32(source: string, endian: ("little" | "big")?, strict: boolean?, preserveBom: boolean?)string
unicode.toUtf16(source: string, endian: ("little" | "big")?, includeBom: boolean?, strict: boolean?)string
unicode.toUtf32(source: string, endian: ("little" | "big")?, includeBom: boolean?)string
unicode.sub(source: string, start: number, end_: number?)string

API Reference

Functions

unicode.fromName

Get the unicode representation (UTF-8) of a character by name (e.g. LEFT CURLY BRACKET)

unicode.fromName(name: string, default: string?)string?

Parameters

name: string

The unicode character name to look up

default: string?

The value to return if the name is not found

Returns

The UTF-8 string for the named character, or default if not found

unicode.toNumber

Converts a unicode grapheme to its numeric value, e.g. "٩" (ARABIC-INDIC DIGIT NINE) -> 9, "½" -> 0.5

unicode.toNumber(grapheme: string)number?

Parameters

grapheme: string

The UTF-8 grapheme to convert

Returns

The numeric value of the grapheme, or nil if it is not a numeric character

unicode.category

Returns the Unicode general category for a codepoint.

unicode.category(cp: number)string

Parameters

cp: number

The Unicode codepoint to inspect

Returns

The two-letter Unicode general category (for example "Lu" or "Nd")

unicode.name

Get the unicode name for a grapheme (e.g. LATIN SMALL LETTER A)

unicode.name(grapheme: string)string

Parameters

grapheme: string

The UTF-8 grapheme to look up

Returns

The unicode name of the grapheme

unicode.normalise

Normalise a UTF-8 string to the given unicode normalisation form

unicode.normalise(source: string, format: "NFC" | "NFKC" | "NFD" | "NFKD")string

Parameters

source: string

The UTF-8 string to normalise

format: "NFC" | "NFKC" | "NFD" | "NFKD"

The normalisation form to apply

Returns

The normalised UTF-8 string

unicode.isNormalised

Returns whether a UTF-8 string is already in the given normalisation form

unicode.isNormalised(source: string, format: "NFC" | "NFKC" | "NFD" | "NFKD")boolean

Parameters

source: string

The UTF-8 string to check

format: "NFC" | "NFKC" | "NFD" | "NFKD"

The normalisation form to check against

Returns

Whether the string is already normalised to the given form

unicode.foldCase

Unicode case folding algorithm; produces a canonical lowercase form suitable for case-insensitive comparison

unicode.foldCase(source: string)string

Parameters

source: string

The UTF-8 string to case-fold

Returns

The case-folded UTF-8 string

unicode.toUpperCase

Convert a UTF-8 string to uppercase

unicode.toUpperCase(source: string)string

Parameters

source: string

The UTF-8 string to convert

Returns

The uppercase UTF-8 string

unicode.toLowercase

Convert a UTF-8 string to lowercase

unicode.toLowercase(source: string)string

Parameters

source: string

The UTF-8 string to convert

Returns

The lowercase UTF-8 string

unicode.toTitleCase

Convert a UTF-8 string to title case, capitalising the first codepoint

unicode.toTitleCase(source: string)string

Parameters

source: string

The UTF-8 string to convert

Returns

The title-cased UTF-8 string

unicode.graphemes

Returns an iterator over all graphemes in a UTF-8 string

unicode.graphemes(source: string)(() → string?)

Parameters

source: string

The UTF-8 string to iterate over

Returns

(() → string?)

An iterator that yields each grapheme in turn

unicode.detectBom

Inspects the leading bytes of a string for a Unicode byte order mark (BOM) and returns the encoding information encoded within it. UTF-32 BOMs are checked before UTF-16 to avoid ambiguity (the UTF-32 LE BOM starts with the same two bytes as the UTF-16 LE BOM). Returns nil for all three values when no BOM is present.

unicode.detectBom(source: string)(Endian?, BomKind?, number)

Parameters

source: string

The input string to inspect

Returns

The byte order indicated by the BOM, or nil if no BOM was found

The encoding kind indicated by the BOM (utf16 or utf32), or nil if no BOM was found

The number of bytes consumed by the BOM (0 if no BOM was found)

unicode.fromUtf16

Convert a UTF-16 encoded string to UTF-8

unicode.fromUtf16(source: string, endian: ("little" | "big")?, strict: boolean?, preserveBom: boolean?)string

Parameters

source: string

The UTF-16 encoded string to convert

endian: ("little" | "big")?

The byte order; inferred from BOM if not specified, defaults to little-endian

strict: boolean?

Whether to error on unpaired surrogates; defaults to false (permissive, preserves them)

preserveBom: boolean?

Whether to decode BOM bytes as U+FEFF / U+FFFE rather than using them for endian detection; defaults to false

Returns

The UTF-8 encoded string

unicode.fromUtf32

Convert a UTF-32 encoded string to UTF-8

unicode.fromUtf32(source: string, endian: ("little" | "big")?, strict: boolean?, preserveBom: boolean?)string

Parameters

source: string

The UTF-32 encoded string to convert

endian: ("little" | "big")?

The byte order; inferred from BOM if not specified, defaults to little-endian

strict: boolean?

Whether to error on surrogate code points; defaults to false (permissive, preserves them)

preserveBom: boolean?

Whether to decode BOM bytes as U+FEFF rather than using them for endian detection; defaults to false

Returns

The UTF-8 encoded string

unicode.toUtf16

Convert a UTF-8 string to UTF-16

unicode.toUtf16(source: string, endian: ("little" | "big")?, includeBom: boolean?, strict: boolean?)string

Parameters

source: string

The UTF-8 string to convert

endian: ("little" | "big")?

The byte order for the output; defaults to little-endian

includeBom: boolean?

Whether to prepend a byte order mark; defaults to true

strict: boolean?

Whether to error on surrogate code points in the source; defaults to false

Returns

The UTF-16 encoded string

unicode.toUtf32

Convert a UTF-8 string to UTF-32

unicode.toUtf32(source: string, endian: ("little" | "big")?, includeBom: boolean?)string

Parameters

source: string

The UTF-8 string to convert

endian: ("little" | "big")?

The byte order for the output; defaults to little-endian

includeBom: boolean?

Whether to prepend a byte order mark; defaults to false

Returns

The UTF-32 encoded string

unicode.len

Returns the length of a UTF-8 string measured in graphemes

unicode.len(source: string)number

Parameters

source: string

The UTF-8 string to measure

Returns

The number of graphemes in the string

unicode.sub

Returns a substring of a UTF-8 string using grapheme indices rather than byte offsets

unicode.sub(source: string, start: number, end_: number?)string

Parameters

source: string

The UTF-8 string to slice

start: number

The first grapheme index (1-based; negative values count from the end)

end_: number?

The last grapheme index, inclusive; defaults to the end of the string

Returns

The extracted substring

Types

Endian

type Endian = "little" | "big"

BomKind

type BomKind = "utf16" | "utf32"