Get the unicode representation (UTF-8) of a character by name (e.g. LEFT CURLY BRACKET)
@eryx/unicode Module
The unicode module provides Unicode-aware string operations built on top of Luau's built-in utf8 library.
It covers encoding conversion (UTF-8 <-> UTF-16 <-> UTF-32), normalisation (NFC/NFD/NFKC/NFKD),
case conversion, character name lookup, and grapheme-based string slicing.
All string inputs and outputs use UTF-8 unless otherwise stated.
local unicode = require("@eryx/unicode")
-- Case conversion
print(unicode.toUpperCase("héllo")) -- "HÉLLO"
print(unicode.toLowercase("ÑOÑO")) -- "ñoño"
-- Normalisation
local nfc = unicode.normalise("e\xCC\x81", "NFC") -- é (composed)
print(unicode.isNormalised(nfc, "NFC")) -- true
-- Character name lookup
print(unicode.name("€")) -- "EURO SIGN"
print(unicode.fromName("SNOWMAN")) -- "☃"
-- Grapheme-aware length and slicing
print(unicode.len("héllo")) -- 5
print(unicode.sub("héllo", 2, 4)) -- "éll"
-- Encoding round-trip
local utf16 = unicode.toUtf16("hello")
print(unicode.fromUtf16(utf16)) -- "hello"
Summary
Functions
API Reference
Functions
unicode.fromName
Parameters
The unicode character name to look up
The value to return if the name is not found
Returns
The UTF-8 string for the named character, or default if not found
unicode.toNumber
Converts a unicode grapheme to its numeric value, e.g. "٩" (ARABIC-INDIC DIGIT NINE) -> 9, "½" -> 0.5
Parameters
The UTF-8 grapheme to convert
Returns
The numeric value of the grapheme, or nil if it is not a numeric character
unicode.category
Returns the Unicode general category for a codepoint.
Parameters
The Unicode codepoint to inspect
Returns
The two-letter Unicode general category (for example "Lu" or "Nd")
unicode.name
Get the unicode name for a grapheme (e.g. LATIN SMALL LETTER A)
Parameters
The UTF-8 grapheme to look up
Returns
The unicode name of the grapheme
unicode.normalise
Normalise a UTF-8 string to the given unicode normalisation form
Parameters
The UTF-8 string to normalise
The normalisation form to apply
Returns
The normalised UTF-8 string
unicode.isNormalised
Returns whether a UTF-8 string is already in the given normalisation form
Parameters
The UTF-8 string to check
The normalisation form to check against
Returns
Whether the string is already normalised to the given form
unicode.foldCase
Unicode case folding algorithm; produces a canonical lowercase form suitable for case-insensitive comparison
Parameters
The UTF-8 string to case-fold
Returns
The case-folded UTF-8 string
unicode.toUpperCase
Convert a UTF-8 string to uppercase
Parameters
The UTF-8 string to convert
Returns
The uppercase UTF-8 string
unicode.toLowercase
Convert a UTF-8 string to lowercase
Parameters
The UTF-8 string to convert
Returns
The lowercase UTF-8 string
unicode.toTitleCase
Convert a UTF-8 string to title case, capitalising the first codepoint
Parameters
The UTF-8 string to convert
Returns
The title-cased UTF-8 string
unicode.graphemes
Returns an iterator over all graphemes in a UTF-8 string
Parameters
The UTF-8 string to iterate over
Returns
An iterator that yields each grapheme in turn
unicode.detectBom
Inspects the leading bytes of a string for a Unicode byte order mark (BOM) and returns the encoding information encoded within it. UTF-32 BOMs are checked before UTF-16 to avoid ambiguity (the UTF-32 LE BOM starts with the same two bytes as the UTF-16 LE BOM). Returns nil for all three values when no BOM is present.
Parameters
The input string to inspect
Returns
unicode.fromUtf16
Convert a UTF-16 encoded string to UTF-8
Parameters
The UTF-16 encoded string to convert
The byte order; inferred from BOM if not specified, defaults to little-endian
Whether to error on unpaired surrogates; defaults to false (permissive, preserves them)
Whether to decode BOM bytes as U+FEFF / U+FFFE rather than using them for endian detection; defaults to false
Returns
The UTF-8 encoded string
unicode.fromUtf32
Convert a UTF-32 encoded string to UTF-8
Parameters
The UTF-32 encoded string to convert
The byte order; inferred from BOM if not specified, defaults to little-endian
Whether to error on surrogate code points; defaults to false (permissive, preserves them)
Whether to decode BOM bytes as U+FEFF rather than using them for endian detection; defaults to false
Returns
The UTF-8 encoded string
unicode.toUtf16
Convert a UTF-8 string to UTF-16
Parameters
The UTF-8 string to convert
The byte order for the output; defaults to little-endian
Whether to prepend a byte order mark; defaults to true
Whether to error on surrogate code points in the source; defaults to false
Returns
The UTF-16 encoded string
unicode.toUtf32
Convert a UTF-8 string to UTF-32
Parameters
The UTF-8 string to convert
The byte order for the output; defaults to little-endian
Whether to prepend a byte order mark; defaults to false
Returns
The UTF-32 encoded string
unicode.len
Returns the length of a UTF-8 string measured in graphemes
Parameters
The UTF-8 string to measure
Returns
The number of graphemes in the string
unicode.sub
Returns a substring of a UTF-8 string using grapheme indices rather than byte offsets
Parameters
The UTF-8 string to slice
The first grapheme index (1-based; negative values count from the end)
The last grapheme index, inclusive; defaults to the end of the string
Returns
The extracted substring