Encodings and Collations
- Last Updated: April 15, 2026
- 10 minute read
- MarkLogic Server
- Version 11.0
- Documentation
In addition to the language support described in Language Support in MarkLogic Server, MarkLogic Server also supports many character encodings and has the ability to sort the content in a variety of collations. This chapter describes the MarkLogic Server support of encodings and collations, and includes the following sections:
Character Encoding
MarkLogic Server stores all content in the UTF-8 encoding. If you try to load non-UTF-8 content into MarkLogic Server without translating it to UTF-8, the server throws an exception. If you have non-UTF-8 content, then you can specify the encoding for the content during ingestion, and MarkLogic Server will translate it to UTF-8. If the content cannot be translated, MarkLogic Server throws an exception indicating that there is non-UTF-8 content.
You can specify an explicit encoding in the following ways:
- If your content is ingested on behalf of an HTTP request, you can specify an encoding in the HTTP headers, such as setting the
charsetparameter of the Content-type header.
-
Set the
encodingoption of the functions listed in the following table.XQuery JavaScript xdmp:document-loadxdmp.documentLoadxdmp:document-getxdmp.documentGetxdmp:zip-getxdmp.zipGetxdmp:gunzipxdmp.gunzipxdmp:xslt-invokexdmp.xsltInvoke
Encoding is determined using the following precedence, from highest to lowest:
- The encoding option of the ingestion function, if set.
- The encoding specified by the HTTP headers, if present.
- Otherwise, assume UTF-8.
If you set the encoding option to auto, then MarkLogic tries to determine the encoding from the document content.
If the encoding is UTF-8 and any non-UTF-8 characters are found, an exception is thrown indicating the content contains non-UTF-8 characters.
MarkLogic Server assumes the character set you specify is actually the character set of the content. If you specify an encoding that is different from the actual content encoding, the result can be unpredictable: You might get an exception in some situations, but you might end up with the wrong characters in other situations.
For details on the syntax of the encoding option, see the MarkLogic XQuery and XSLT Function Reference.
Collations
This section describes collations in MarkLogic Server. Collations specify the order in which strings are sorted and how they are compared. The section includes the following parts:
- Overview of Collations
- Two Common Collation URIs
- Collation URI Syntax
- Backward Compatibility with 3.1 Range Indexes and Lexicons
- UCA Root Collation
- How Collation Defaults are Determined
- Specifying Collations
Overview of Collations
Javascript does not have the concept of a prolog; therefore, there is no way to declare a default collation in Javascript the way it is done in XQuery.
A collation specifies the order for sorting strings. The collation settings determine the order for operations where the order is specified (either implicitly or explicitly) and for operations that use Range Indexes. Examples of operations that specify the order are XQuery statements with an order by clause, XQuery standard functions that compare order (for example, fn:compare, fn:substring-after, fn:substring-before, and so on), and lexicon functions (for example, cts:words, cts:element-word-match, cts:element-values, and so on). Additionally, collations determine uniqueness in string comparisons, so two strings that are equal according to one collation might be not be equal according to another.
The codepoint-order collation sorts according to the Unicode codepoint order, which does not take into account any language-specific information. There are other collations that are often used to specify language-specific sorting differences. For example, a code point sort puts all uppercase letters before lower-case letters, so the word Zounds sorts before the word abracadabra. If you use a collation that sorts upper and lower-case letters together (for example, the order A a B b C c, and so on), then abracadabra sorts before Zounds.
Collations are specified with a URI (for example, http://marklogic.com/collation/). The collation URIs are specific to MarkLogic Server, but they specify collations according to the Unicode collation standards. There are many variations to collations, and many sort orders that are based on preferences and traditions in various languages. The following section describes the syntax of collation URIs. Although there are a huge number of collation URIs possible, most applications will use only a small number of collations. For more information about collations, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
Two Common Collation URIs
The following are two very common collation URIs used in MarkLogic Server:
http://marklogic.com/collation/
http://marklogic.com/collation/codepoint
The first one is the UCA Root Collation (see UCA Root Collation), and is the system default. The second is the codepoint order collation, and was the default in pre-3.2 releases of MarkLogic Server.
Collation URI Syntax
Collations in MarkLogic Server are specified by a URI. All collations begin with the string http://marklogic.com/collation/. The syntax for collations is as follows:
http://marklogic.com/collation/<locale>[/<attribute>]*
This section describes the following parts of the syntax:
Locale Portion of the Collation URI
The <locale> portion of the collation URI must be a valid locale, and is defined as follows:
<locale> ::= <language>[-<script>][_<region>][@(collation=<value>;)+]
For a list of valid language codes, see the following:
http://www.loc.gov/standards/iso639-2/php/code_list.php
For a list of valid script codes, see the following:
http://www.unicode.org/iso15924/iso15924-codes.html
For a list of valid region codes, see the following:
http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html
Some languages (for example, German and Chinese) have multiple collations you can specify in the locale. To specify one of these language-specific collation variants, use the @collation=<value> portion of the syntax.
If you do not specify a locale in the collation URI, the UCA Root Collation is used by default (for details, see UCA Root Collation).
While you can specify many valid language, script, or region codes, MarkLogic Server only fully supports those that are relevant to and most commonly used with the supported languages. For a list of supported languages along with their common collations, see Collations and Character Sets By Language.
The following table lists some typical locales, along with a brief description:
Attribute Portion of the Collation URI
There can be zero or more <attribute> portions of the collation URI. Attributes further specify characteristics such as which collation to use, whether to be case sensitive or case insensitive, and so on. You only need to specify attributes if they differ from the defaults for the specified locale. Attributes have the following syntax:
<attribute> ::= <strength> | <case-level> | <case-first> |
<alternate> | <numeric-collation> |
<variable-top> | <normalization-checking> |
<french> | <hiragana>
The following table describes the various attributes. For simplicity, terms like case-sensitive, diacritic-sensitive, and others are used. In actuality, the definitions of these terms for use in collations are somewhat more complicated. For the exact technical meaning of each attribute, see http://icu.sourceforge.net/userguide/Collate_Concepts.html.
| Attribute | Legal Values | Descriptions |
|---|---|---|
<strength>The level of comparison to use. |
S1 |
Specifies case and diacritic insensitive. |
S2 |
Specifies diacritic sensitive and case insensitive. | |
S3 |
Specifies case and diacritic sensitive. | |
S4 |
Specifies punctuation sensitive. | |
SI |
Specifies identity (codepoint differentiated). | |
<case-level> Enable or disable the case sensitive level, skipping the diacritic sensitive level. So diacritic insensitive, case sensitive is Default: |
EO |
Specifies enable case-level. |
EX |
Specifies disable case-level. | |
<case-first> Specifies whether uppercase sorts before or after lowercase. Default: |
CU |
Specifies that uppercase sorts first. |
CL |
Specifies that lowercase sorts first. | |
CX |
Off. | |
<alternate> Specifies how to handle variable characters. (As completely ignorable or as normal characters.) Default: |
AN |
Specifies that all characters are non-ignorable; that is, include all spaces and punctuation characters when sorting characters. |
AS |
Specifies that variable characters are shifted (ignored) according to the variable-top setting. |
|
<numeric-collation> Order numbers as numbers rather than collation order (for example, 20 < 100). Default: |
MO |
Specifies numeric ordering. |
MX |
Specifies non-numeric ordering (order according to the collation). | |
<variable-top> Used with Default: |
T0000 |
Specifies that all variable characters (typically whitespace and punctuation) are ignored for sorting variable characters. |
T0020 |
Specifies that whitespace is ignorable when sorting characters. For example, /T0020/AS means that period (a variable character) would be treated as a regular character but space would be ignorable. Therefore:
A B = AB and AB < A.B. |
|
T00BB |
Specifies that most punctuation and space characters are ignorable when sorting characters. Specifically, characters whose sort key is less than or equal to 00BB are ignorable. |
|
<normalization-checking> Specifies whether to perform Unicode normalization on the input string. Default: |
NO |
Specifies normalize Unicode. |
NX |
Specifies do not normalize Unicode. | |
<french> Specifies whether to apply the French accent ordering rule (that is, to reverse the ordering at the Default: |
FO |
Specifies French accent ordering. |
FX |
Specifies normal ordering (according to the collation). | |
<hiragana> Specifies whether to add an additional level to distinguish Hiragana from Katakana. Default: |
HO |
Hiragana mode on. |
HX |
Hiragana mode off. |
Backward Compatibility with 3.1 Range Indexes and Lexicons
Range Indexes and lexicons that were created in MarkLogic Server 3.1 use the Unicode codepoint collation order. If you want them to use a different collation in any of these indexes and/or lexicons, you must change the collation and re-create the index, and then reindex the database (if reindex enable is set to true, it will automatically begin reindexing).
UCA Root Collation
The Unicode collation algorithm (UCA) root collation in MarkLogic Server is used when no default exists. It uses the Unicode codepoint collation with S3 (case and diacritic sensitive) strength, and it has the following URI:
http://marklogic.com/collation/
The UCA root collation adds more useful case and diacritic sensitivity to the Unicode codepoint order, so it will make more sensible sort orders when you take case sensitivity and diacritic sensitivity into consideration. For more details about the UCA, see http://www.unicode.org/unicode/reports/tr10/.
How Collation Defaults are Determined
The collation used for requests in MarkLogic Server is based on the settings of various parameters in the Admin Interface and on what is specified in your XQuery code. Each App Server has a default collation specified, and that is used in the absence of anything else that overrides it. Note the following about collations and their defaults.
- Collations are specified at the App Server level, on Range Indexes, and on lexicons.
- App Servers, Range Indexes, and lexicons upgraded from 3.1 remain in codepoint order (
http://marklogic.com/collation/codepoint).
- New App Servers default to the UCA Root Collation (
http://marklogic.com/collation/).
- New Range Indexes and lexicons default to UCA Root Collation (
http://marklogic.com/collation/).
-
You can specify a default collation in an XQuery prolog, which overrides the App Server default. For example, the following query will use the French collation:
xquery version "1.0-ml"; declare default collation "http://marklogic.com/collation/fr"; for $x in ("c¥te", "cote", "cot©", "c¥t©", "cpte" ) order by $x return $x -
The codepoint collation URI is http://marklogic.com/collation/codepoint
The following is an alias to the codepoint collation URI (used with the `1.0` strict XQuery dialect): http://www.w3.org/2005/xpath-functions/collation/codepoint.
- Collation URIs displayed in the Admin Interface are stored and displayed as the canonical representation of the URI entered. The canonical representation is equivalent to the URI entered, but changes the order and simplifies portions of the collation URI string to a predetermined order. The
xdmp:collation-canonical-uribuilt-in XQuery function returns the canonical URI of any valid collation URI.
-
The empty string URI becomes codepoint collation. Therefore, the following returns as shown:
xdmp:collation-canonical-uri("") => http://marklogic.com/collation/codepoint
- The collation used in an XQuery module is determined on a per-module basis. Therefore, a module might call another module that uses a different collation, as each module determines its collation independent of the module that called it (based on the App Server defaults, collation prolog declaration, and so on).
- When a module is invoked or spawned from another module, or when a request is submitted via an
xdmp:evalcall from another module, the new request inherits the collation context of the calling module. That context can be overridden in the query (for example, with adeclare default collationexpression in the prolog), but it will default to the context from the calling module.
- If no other collations are in effect (for example, for scheduled tasks), the codepoint collation is used.
Specifying Collations
You can specify collations in many places. Some common places to specify collations are:
- In the
order byclause of a FLWOR expression.
- In an App Server configuration in the Admin Interface.
- In a lexicon or Range Index specification in the Admin Interface.
- In many W3C standard XQuery functions (for example,
fn:compare,fn:contains,fn:starts-with,fn:ends-with,fn:substring-after,fn:substring-before,fn:deep-equals,fn:distinct-values,fn:index-of,fn:max,fn:min).
- In the lexicon APIs (
cts:words,cts:word-match,cts:element-words,cts:element-values, and so on).
- In the range query constructors (
cts:element-range-query,cts:element-attribute-range-query).
Collations and Character Sets By Language
The following table lists the languages for which MarkLogic Server supports language-specific tokenization and stemming. It also lists some common collations and character sets for each language.
Note that some of the listed character set names can be ambiguous. MarkLogic uses the International Components for Unicode (ICU) library for character encoding and conversion. For best accuracy, refer to the ICU converter alias mapping at http://demo.icu-project.org/icu-bin/convexp.
All of the languages except English require a license key to enable. If you do not have the license key for one of the supported languages, it is treated as a generic language, and each word is stemmed to itself and it is tokenized in a generic way (on whitespace and punctuation characters for non-Asian characters, and on each character for Asian characters). For more information, see Generic Language Support. The language-specific collations are available to all languages, regardless of what languages are enabled in the license key.