1 of 4

23. 語系

本章從管理員的角度描述可用的本地化語系功能。PostgreSQL 支援兩種本地化的功能：

使用作業系統的區域設定功能來提供特定於語言環境的資料排序、數字格式、翻譯的訊息和其他方面。這在第 23.1 節和第 23.2 節中有介紹。
提供許多不同的字元集以支援各種語言的儲存檔案，並在用戶端和伺服器之間提供字元集轉換。這在第 23.3 節中有介紹。

23.1. 語系支援

版本：11

區域設定支援是指某個應用程序，它提供有關字母、排序、數字格式等文化偏好。PostgreSQL 使用伺服器作業系統提供的標準 ISO C 和 POSIX 區域設定。有關其他訊息，請參閱作業系統文件。

23.1.1. 綜觀

使用 initdb 建立資料庫叢集時，將自動初始化語言環境支援。initdb 將預設使用其執行環境的語言環境設定初始化資料庫叢集。因此，如果您的作業系統已設定為使用資料庫叢集中所需的語言環境，那麼您毌須進行任何額外操作。如果要使用其他語言環境（或者您不確定系統設定的語言環境），可以透過指定 --locale 選項指示 initdb 確切使用哪個語言環境。例如：

Unix 系統的這個範例將語言環境設定為瑞典語（SE）中的瑞典語（sv）。其他可能性可能包括 en_US（美國英語）和 fr_CA（加拿大法語）。如果可以將多個字元集用於語言環境，則規範可採用 language_territory.codeset 形式。例如，fr_BE.UTF-8 表示比利時（BE）中使用的法語（fr），具有 UTF-8 字元集編碼。

系統上可用的區域設定取決於作業系統供應商提供和安裝的內容。在大多數 Unix 系統上，指令 locale -a 將提供可用語言環境的列表。Windows 使用更詳細的區域設定名稱，例如 German_Germany 或 Swedish_Sweden.1252，但原則是相同的。

有時，混合來自多個語言環境的規則很有用，例如，使用英語校對規則，但使用西班牙語訊息。為了支援這一點，可以存在一組區域設定子類別，它們僅控制本地化規則的某些方面：

類別名稱轉換為 initdb 選項的名稱，以覆蓋特定類別的區域設定選項。例如，要將語言環境設定為加拿大法語，但使用美國規則格式化貨幣，請使用 initdb --locale = fr_CA --lc-monetary = en_US。

如果您希望系統的行為就像它沒有語言環境支援一樣，請使用特殊的語言環境名稱 C 或等效的 POSIX。

建立資料庫時，某些區域設定類別必須固定其值。您可以對不同的資料庫使用不同的設定，但是一旦建立了資料庫，就無法再為該資料庫更改它們。LC_COLLATE 和 LC_CTYPE 是這些類別。它們會影響索引的排序順序，因此必須保持不變，否則文字欄位上的索引會損壞。（但是您可以使用排序規則來緩解此限制，如中所述。）這些類別的預設值在執行 initdb 時確定，並且在建立新資料庫時使用這些值，除非在 CREATE DATABASE 指令中另行指定。

透過設定與語言環境類別同名的伺服器配置參數，可以隨時更改其他語言環境類別（有關詳細訊息，請參閱）。initdb 選擇的值實際上只寫入配置文件 postgresql.conf，以在伺服器啟動時用作預設值。如果從 postgresql.conf 中刪除這些設定，則伺服器將從其執行環境繼承設定。

請注意，服務器的區域設定行為由伺服器看到的環境變數決定，而不是由任何用戶端的環境確定。因此，在啟動伺服器之前，請務必配置正確的區域設定。這樣做的結果是，如果用戶端和伺服器設定在不同的區域設定中，則訊息可能會以不同的語言顯示，具體取決於它們的來源。

注意當我們談到從執行環境繼承語言環境時，這意味著在大多數作業系統上都有以下內容：對於給定的語言環境類別，比如排序規則，將按此順序查詢以下環境變數，直到找到一個設定：LC_ALL， LC_COLLATE（或對應於相應類別的變數），LANG。如果未設定這些環境變數，則語言環境預設為 C.

某些訊息的本地化函式庫還會查看環境變數 LANGUAGE，該變數將覆寫所有其他區域設定，以便設定訊息的語言。如有疑問，請參閱作業系統的文件，特別是有關 gettext 的文件。

要使訊息能夠轉換為用戶的偏好語言，必須在編譯時選擇 NLS（configure --enable-nls）。所有其他語言環境支援都是自動編譯的。

23.1.2. 操作行為

語系設定會影響以下的 SQL 功能：

使用 ORDER BY 或標準比較運算子對查詢中文字排序
upper，lower 和 initcap 功能
樣式匹配運算子（LIKE，SIMILAR TO 和 POSIX 形式的正規表示式）；locales 透過字元類的正規表示式影響不區分大小寫的匹配和字元分類
to_char 系列函數
索引可以與 LIKE 子句一起使用

在 PostgreSQL 中使用 C 或 POSIX 以外語言環境的缺點是對效能的影響。它會減慢字元處理速度並阻止 LIKE 使用普通索引。因此，最好只有在實際需要時才進行區域設定。

23.1.3. 問題

如果區域設定依上述說明操作卻不起作用的話，請檢查作業系統中的區域設定是否已正確配置。要檢查作業系統上安裝的語言環境，可以使用命令 locale -a（如果作業系統有提供的話）。

檢查 PostgreSQL 實際上是否正在使用您認為的語言環境。LC_COLLATE 和 LC_CTYPE 設定會在建立資料庫時確定，除非建立新的資料庫，否則無法變更。其他區域設定（包括 LC_MESSAGES 和 LC_MONETARY）最初由伺服器啟動的環境決定，但可以即時變更。您可以使用 SHOW 命令檢查當下有效的區域設定。

原始碼發行版中的目錄 src/test/locale 包含了 PostgreSQL 語言環境支援的測試套件。

當伺服器的訊息使用不同的語言時，透過解析錯誤訊息文字來處理伺服器端錯誤的用戶端應用程序顯然會出現問題。建議此類應用程序的作者使用錯誤代碼方案。

23.3. 字元集支援

PostgreSQL 中的字元集支援允許您將文字以各種字元集（也稱為編碼）儲存，包括單位元組字元集（如 ISO 8859 系列）和多位元組字元集，如 EUC（延伸 Unix 代碼）， UTF-8 和 Mule 內部代碼。用戶端可以透通地使用所有支援的字元集，但有一些並不支援在伺服器中使用（即作為伺服器端編碼）。使用 initdb 初始化 PostgreSQL 資料庫叢集時，會選擇預設字元集。建立資料庫時可以覆寫它，因此您可以擁有多個資料庫，每個資料庫具有不同的字元集。

但是，一個重要的限制是每個資料庫的字元集必須與資料庫的 LC_CTYPE（字元分類）和 LC_COLLATE（字串排序順序）語言環境設定相容。對於 C 或 POSIX 語言環境，允許使用任何字元集，但對於其他 libc 提供的語言環境，只有一個字元集可以正常工作。（但在 Windows 上，UTF-8 編碼可以與任何語言環境一起使用。）如果您配置了 ICU 支援，ICU 提供的語言環境可以與大多數但不是所有伺服器端編碼一起使用。

23.3.1. 支援的字元集

Table 23.1 顯示了可在 PostgreSQL 中使用的字元集。

Table 23.1. PostgreSQL Character Sets

並非所有用戶端 API 都支援所有列出的字元集。例如，PostgreSQL JDBC 驅動程式就不支援 MULE_INTERNAL，LATIN6，LATIN8 和 LATIN10。

SQL_ASCII 設定與其他設定的行為大不相同。當伺服器字元集是 SQL_ASCII 時，伺服器根據 ASCII 標準解譯位元組值 0-127，而位元組值 128-255 作為未解譯的字元。當設定為 SQL_ASCII 時，不會進行編碼轉換。因此，這個設定並不是使用特定編碼的宣告，而是對編碼的未知宣告。在大多數情況下，如果您使用任何非 ASCII 資料，使用 SQL_ASCII 設定是不明智的，因為 PostgreSQL 將無法透過轉換或驗證非 ASCII 字元來幫助您。

23.3.2. 設定字元集

initdb 定義 PostgreSQL 叢集的預設字元集（編碼）。例如，

initdb -E EUC_JP

將預設字元集設定為 EUC_JP（日本語的延伸 Unix 代碼）。如果您喜歡更長的選項字串，則可以使用 --encoding 而不是 -E。如果使用 -E 或 --encoding 選項，initdb 將嘗試根據指定的或預設的語言環境決定要使用的相對應編碼。

您可以在資料庫建立時指定非預設編碼，前提是該編碼與所選語言環境相容：

createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean

這將建立一個名為 korean 的資料庫，該資料庫使用字元集 EUC_KR 和語言環境 ko_KR。另一種方法是使用此 SQL 指令：

CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;

請注意，上述指令指定複製 template0 資料庫。複製任何其他資料庫時，無法更改原資料庫的編碼和語言環境設定，因為這可能會導致資料損壞。有關更多訊息，請參閱第 22.3 節。

資料庫的編碼儲存在系統目錄 pg_database 中。您可以使用 psql -l 選項或 \l 指令查看。

$ psql -l
                                         List of databases
   Name    |  Owner   | Encoding  |  Collation  |    Ctype    |          Access Privileges          
-----------+----------+-----------+-------------+-------------+-------------------------------------
 clocaledb | hlinnaka | SQL_ASCII | C           | C           | 
 englishdb | hlinnaka | UTF8      | en_GB.UTF8  | en_GB.UTF8  | 
 japanese  | hlinnaka | UTF8      | ja_JP.UTF8  | ja_JP.UTF8  | 
 korean    | hlinnaka | EUC_KR    | ko_KR.euckr | ko_KR.euckr | 
 postgres  | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | 
 template0 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 template1 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
(7 rows)

注意在大多數現代作業系統上，PostgreSQL 可以確定 LC_CTYPE 設定所隱含的字元集，並強制只使用相符合的資料庫編碼。在較舊的系統上，您有責任確保使用所選區域設定所需的編碼。此區域中的錯誤可能會導致與區域設定相關操作（如排序）的奇怪行為。

即使 LC_CTYPE 不是 C 或 POSIX，PostgreSQL 也允許超級使用者使用 SQL_ASCII 編碼建立資料庫。如上所述，SQL_ASCII 不強制儲存在資料庫中的資料具有任何特定編碼，因此這種選擇會帶來相依於語言環境的錯誤行為風險。不推薦使用這種設定組合，有一天可能會被禁止使用。

23.3.3. 伺服器和用戶端之間的自動字元集轉換

PostgreSQL 支援伺服器和用戶端之間針對某些字元集組合的自動字元集轉換。轉換訊息儲存在 pg_conversion 系統目錄中。PostgreSQL 帶有一些預先定義的轉換，如 Table 23.2 所示。您可以使用 SQL 指令 CREATE CONVERSION 建立新的轉換。

Table 23.2. Client/Server Character Set Conversions

要啟用自動字元集轉換，您必須告訴 PostgreSQL 您要在用戶端中使用的字元集（編碼）。有幾種方法可以實現此目的：

在 psql 中使用 \encoding 指令。\encoding 允許您即時更改用戶端編碼。例如，要將編碼更改為 SJIS，請鍵入：
```
\encoding SJIS
```
libpq（第 33.10 節）具有控制用戶端編碼的功能。
使用 SET client_encoding TO。可以使用以下 SQL 指令設定用戶端編碼：
```
SET CLIENT_ENCODING TO 'value';
```
您還可以使用標準 SQL 語法 SET NAMES 來達到此目的：
```
SET NAMES 'value';
```
要查詢目前用戶端編碼：
```
SHOW client_encoding;
```
要回傳預設編碼：
```
RESET client_encoding;
```
使用 PGCLIENTENCODING。如果在用戶端環境中定義了環境變數 PGCLIENTENCODING，則在建立與伺服器的連線時會自動選擇該用戶端編碼。（這可以隨後使用上面提到的任何其他方法覆蓋。）
使用組態變數 client_encoding。如果設定了 client_encoding 變數，則在建立與伺服器的連線時會自動選擇該用戶端編碼。（這可以隨後使用上面提到的任何其他方法覆蓋。）

如果無法轉換特定字元 - 假設您為伺服器選擇了 EUC_JP 而為用戶端選擇了 LATIN1，並且回傳了一些在 LATIN1 中沒有表示的日文字元 - 回報錯誤。

如果用戶端字元集定義為 SQL_ASCII，則無論伺服器的字元集如何，都將停用編碼轉換。就像伺服器一樣，除非使用全 ASCII 資料，否則使用 SQL_ASCII 是不明智的。

23.3.4. 延伸閱讀

這些是開始學習各種編碼系統的好資源。

CJKV 訊息處理：中文，日文，韓文和越南文運算
- 包含 EUC_JP，EUC_CN，EUC_KR，EUC_TW 的詳細說明。
http://www.unicode.org/
- Unicode Consortium 的網站。
RFC 3629
- UTF-8 (8-bit UCS/Unicode Transformation Format) 定義在這裡

23.3. 字元集支援

23.3.1. 支援的字元集

Table 23.1 顯示了可在 PostgreSQL 中使用的字元集。

Table 23.1. PostgreSQL Character Sets

並非所有用戶端 API 都支援所有列出的字元集。例如，PostgreSQL JDBC 驅動程式就不支援 MULE_INTERNAL，LATIN6，LATIN8 和 LATIN10。

23.3.2. 設定字元集

initdb 定義 PostgreSQL 叢集的預設字元集（編碼）。例如，

initdb -E EUC_JP

您可以在資料庫建立時指定非預設編碼，前提是該編碼與所選語言環境相容：

createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean

這將建立一個名為 korean 的資料庫，該資料庫使用字元集 EUC_KR 和語言環境 ko_KR。另一種方法是使用此 SQL 指令：

CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;

資料庫的編碼儲存在系統目錄 pg_database 中。您可以使用 psql -l 選項或 \l 指令查看。

$ psql -l
                                         List of databases
   Name    |  Owner   | Encoding  |  Collation  |    Ctype    |          Access Privileges          
-----------+----------+-----------+-------------+-------------+-------------------------------------
 clocaledb | hlinnaka | SQL_ASCII | C           | C           | 
 englishdb | hlinnaka | UTF8      | en_GB.UTF8  | en_GB.UTF8  | 
 japanese  | hlinnaka | UTF8      | ja_JP.UTF8  | ja_JP.UTF8  | 
 korean    | hlinnaka | EUC_KR    | ko_KR.euckr | ko_KR.euckr | 
 postgres  | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | 
 template0 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 template1 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
(7 rows)

23.3.3. 伺服器和用戶端之間的自動字元集轉換

Table 23.2. Client/Server Character Set Conversions

要啟用自動字元集轉換，您必須告訴 PostgreSQL 您要在用戶端中使用的字元集（編碼）。有幾種方法可以實現此目的：

在 psql 中使用 \encoding 指令。\encoding 允許您即時更改用戶端編碼。例如，要將編碼更改為 SJIS，請鍵入：
```
\encoding SJIS
```
libpq（第 33.10 節）具有控制用戶端編碼的功能。
使用 SET client_encoding TO。可以使用以下 SQL 指令設定用戶端編碼：
```
SET CLIENT_ENCODING TO 'value';
```
您還可以使用標準 SQL 語法 SET NAMES 來達到此目的：
```
SET NAMES 'value';
```
要查詢目前用戶端編碼：
```
SHOW client_encoding;
```
要回傳預設編碼：
```
RESET client_encoding;
```
使用 PGCLIENTENCODING。如果在用戶端環境中定義了環境變數 PGCLIENTENCODING，則在建立與伺服器的連線時會自動選擇該用戶端編碼。（這可以隨後使用上面提到的任何其他方法覆蓋。）
使用組態變數 client_encoding。如果設定了 client_encoding 變數，則在建立與伺服器的連線時會自動選擇該用戶端編碼。（這可以隨後使用上面提到的任何其他方法覆蓋。）

如果無法轉換特定字元 - 假設您為伺服器選擇了 EUC_JP 而為用戶端選擇了 LATIN1，並且回傳了一些在 LATIN1 中沒有表示的日文字元 - 回報錯誤。

23.3.4. 延伸閱讀

這些是開始學習各種編碼系統的好資源。

CJKV 訊息處理：中文，日文，韓文和越南文運算
- 包含 EUC_JP，EUC_CN，EUC_KR，EUC_TW 的詳細說明。
http://www.unicode.org/
- Unicode Consortium 的網站。
RFC 3629
- UTF-8 (8-bit UCS/Unicode Transformation Format) 定義在這裡

23.1. 語系支援

版本：11

23.1.1. 綜觀

如果您希望系統的行為就像它沒有語言環境支援一樣，請使用特殊的語言環境名稱 C 或等效的 POSIX。

要使訊息能夠轉換為用戶的偏好語言，必須在編譯時選擇 NLS（configure --enable-nls）。所有其他語言環境支援都是自動編譯的。

23.1.2. 操作行為

語系設定會影響以下的 SQL 功能：

使用 ORDER BY 或標準比較運算子對查詢中文字排序
upper，lower 和 initcap 功能
樣式匹配運算子（LIKE，SIMILAR TO 和 POSIX 形式的正規表示式）；locales 透過字元類的正規表示式影響不區分大小寫的匹配和字元分類
to_char 系列函數
索引可以與 LIKE 子句一起使用

作為允許 PostgreSQL 在非 C 語言環境下使用具有 LIKE 子句索引的解決方法，存在多個自訂運算子類。允許建立一個執行嚴格的逐字元比較的索引，忽略區域設定的比較規則。有關更多訊息，請參閱。另一種方法是使用 C collation 建立索引，如中所述。

23.1.3. 問題

原始碼發行版中的目錄 src/test/locale 包含了 PostgreSQL 語言環境支援的測試套件。

維護訊息翻譯目錄需要許多志願者的持續努力，他們希望看到 PostgreSQL 能夠順暢地說出他們喜歡的語言。如果您的語言訊息目前無法使用或未完全翻譯，我們將非常歡迎您的協助。如果您想幫助我們，請參閱或寫信給開發人員的郵件列表。

23.2. Collation Support

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.

23.2.1. Concepts

Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are text, varchar, and char. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.

The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.

When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with ORDER BY clauses and function or operator calls such as <. The collation to apply for an ORDER BY clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as lower, upper, and initcap; by pattern matching operators; and by to_char and related functions.

For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.

The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:

If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.

For example, consider this table definition:

Then in

the < comparison is performed according to de_DE rules, because the expression combines an implicitly derived collation with the default collation. But in

the comparison is performed using fr_FR rules, because the explicit collation derivation overrides the implicit one. Furthermore, given

the parser cannot determine which collation to apply, since the a and b columns have conflicting implicit collations. Since the < operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:

or equivalently

On the other hand, the structurally similar case

does not result in an error, because the || operator does not care about collations: its result is the same regardless of the collation.

The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in

the ordering will be done according to de_DE rules. But this query:

results in an error, because even though the || operator doesn't need to know a collation, the ORDER BY clause does. As before, the conflict can be resolved with an explicit collation specifier:

23.2.2. Managing Collations

A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales that most tools provided by the operating system use. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.

23.2.2.1. Standard Collations

On all platforms, the collations named default, C, and POSIX are available. Additional collations may be available depending on operating system support. The default collation selects the LC_COLLATE and LC_CTYPE values specified at database creation time. The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

Additionally, the SQL standard collation name ucs_basic is available for encoding UTF8. It is equivalent to C and sorts by Unicode code point.

23.2.2.2. Predefined Collations

If the operating system provides support for using multiple locales within a single program (newlocale and related functions), or if support for ICU is configured, then when a database cluster is initialized, initdb populates the system catalog pg_collation with collations based on all the locales it finds in the operating system at the time.

To inspect the currently available locales, use the query SELECT * FROM pg_collation, or the command \dOS+ in psql.

23.2.2.2.1. libc collations

For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 that has both LC_COLLATE and LC_CTYPE set to de_DE.utf8. It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.

Within any particular database, only collations that use that database's encoding are of interest. Other entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the default, C, and POSIX collations can be used regardless of the database encoding.

PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,

will draw an error even though the C and POSIX collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended.

23.2.2.2.2. ICU collations

With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales. initdb uses the ICU APIs to extract a set of distinct locales to populate the initial set of collations. Collations provided by ICU are created in the SQL environment with names in BCP 47 language tag format, with a “private use” extension -x-icu appended, to distinguish them from libc locales.

Here are some example collations that might be created:de-x-icu

German collation, default variantde-AT-x-icu

German collation for Austria, default variant

(There are also, say, de-DE-x-icu or de-CH-x-icu, but as of this writing, they are equivalent to de-x-icu.)und-x-icu (for “undefined”)

ICU “root” collation. Use this to get a reasonable language-agnostic sort order.

Some (less frequently used) encodings are not supported by ICU. When the database encoding is one of these, ICU collation entries in pg_collation are ignored. Attempting to use one will draw an error along the lines of “collation "de-x-icu" for encoding "WIN874" does not exist”.

23.2.2.3. Creating New Collation Objects

The standard and predefined collations are in the schema pg_catalog, like all predefined objects. User-defined collations should be created in user schemas. This also ensures that they are saved by pg_dump.

23.2.2.3.1. libc collations

New libc collations can be created like this:

The exact values that are acceptable for the locale clause in this command depend on the operating system. On Unix-like systems, the command locale -a will show a list.

23.2.2.3.2. ICU collations

Here are some examples:CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');

German collation with phone book collation type

The first example selects the ICU locale using a “language tag” per BCP 47. The second example uses the traditional ICU-specific locale syntax. The first style is preferred going forward, but it is not supported by older ICU versions.

Note that you can name the collation objects in the SQL environment anything you want. In this example, we follow the naming style that the predefined collations use, which in turn also follow BCP 47, but that is not required for user-defined collations.CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');

Root collation with Emoji collation type, per Unicode Technical Standard #51

Observe how in the traditional ICU locale naming system, the root locale is selected by an empty string.CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit'); CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit');

Sort digits after Latin letters. (The default is digits before letters.)CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');

Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit'); CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=latn-digit');

Combines both of the above options.CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');

Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123 (also known as natural sort).

Note that while this system allows creating collations that “ignore case” or “ignore accents” or similar (using the ks key), PostgreSQL does not at the moment allow such collations to act in a truly case- or accent-insensitive manner. Any strings that compare equal according to the collation but are not byte-wise equal will be sorted according to their byte values.

Note

By design, ICU will accept almost any string as a locale name and match it to the closest locale it can provide, using the fallback procedure described in its documentation. Thus, there will be no direct feedback if a collation specification is composed using features that the given ICU installation does not actually support. It is therefore recommended to create application-level test cases to check that the collation definitions satisfy one's requirements.

23.2.2.3.3. Copying Collations