PostgreSQL 中的字元集支援允許您將文字以各種字元集(也稱為編碼)儲存,包括單位元組字元集(如 ISO 8859 系列)和多位元組字元集,如 EUC(延伸 Unix 代碼), UTF-8 和 Mule 內部代碼。用戶端可以透通地使用所有支援的字元集,但有一些並不支援在伺服器中使用(即作為伺服器端編碼)。使用 initdb 初始化 PostgreSQL 資料庫叢集時,會選擇預設字元集。建立資料庫時可以覆寫它,因此您可以擁有多個資料庫,每個資料庫具有不同的字元集。
但是,一個重要的限制是每個資料庫的字元集必須與資料庫的 LC_CTYPE(字元分類)和 LC_COLLATE(字串排序順序)語言環境設定相容。對於 C 或 POSIX 語言環境,允許使用任何字元集,但對於其他 libc 提供的語言環境,只有一個字元集可以正常工作。(但在 Windows 上,UTF-8 編碼可以與任何語言環境一起使用。)如果您配置了 ICU 支援,ICU 提供的語言環境可以與大多數但不是所有伺服器端編碼一起使用。
Table 23.1 顯示了可在 PostgreSQL 中使用的字元集。
並非所有用戶端 API 都支援所有列出的字元集。例如,PostgreSQL JDBC 驅動程式就不支援 MULE_INTERNAL,LATIN6,LATIN8 和 LATIN10。
SQL_ASCII 設定與其他設定的行為大不相同。當伺服器字元集是 SQL_ASCII 時,伺服器根據 ASCII 標準解譯位元組值 0-127,而位元組值 128-255 作為未解譯的字元。當設定為 SQL_ASCII 時,不會進行編碼轉換。因此,這個設定並不是使用特定編碼的宣告,而是對編碼的未知宣告。在大多數情況下,如果您使用任何非 ASCII 資料,使用 SQL_ASCII 設定是不明智的,因為 PostgreSQL 將無法透過轉換或驗證非 ASCII 字元來幫助您。
initdb 定義 PostgreSQL 叢集的預設字元集(編碼)。例如,
將預設字元集設定為 EUC_JP(日本語的延伸 Unix 代碼)。如果您喜歡更長的選項字串,則可以使用 --encoding 而不是 -E。如果使用 -E 或 --encoding 選項,initdb 將嘗試根據指定的或預設的語言環境決定要使用的相對應編碼。
您可以在資料庫建立時指定非預設編碼,前提是該編碼與所選語言環境相容:
這將建立一個名為 korean 的資料庫,該資料庫使用字元集 EUC_KR 和語言環境 ko_KR。另一種方法是使用此 SQL 指令:
請注意,上述指令指定複製 template0 資料庫。複製任何其他資料庫時,無法更改原資料庫的編碼和語言環境設定,因為這可能會導致資料損壞。有關更多訊息,請參閱第 22.3 節。
資料庫的編碼儲存在系統目錄 pg_database 中。您可以使用 psql -l 選項或 \l 指令查看。
注意 在大多數現代作業系統上,PostgreSQL 可以確定 LC_CTYPE 設定所隱含的字元集,並強制只使用相符合的資料庫編碼。在較舊的系統上,您有責任確保使用所選區域設定所需的編碼。此區域中的錯誤可能會導致與區域設定相關操作(如排序)的奇怪行為。
即使 LC_CTYPE 不是 C 或 POSIX,PostgreSQL 也允許超級使用者使用 SQL_ASCII 編碼建立資料庫。如上所述,SQL_ASCII 不強制儲存在資料庫中的資料具有任何特定編碼,因此這種選擇會帶來相依於語言環境的錯誤行為風險。不推薦使用這種設定組合,有一天可能會被禁止使用。
PostgreSQL 支援伺服器和用戶端之間針對某些字元集組合的自動字元集轉換。轉換訊息儲存在 pg_conversion 系統目錄中。PostgreSQL 帶有一些預先定義的轉換,如 Table 23.2 所示。您可以使用 SQL 指令 CREATE CONVERSION 建立新的轉換。
要啟用自動字元集轉換,您必須告訴 PostgreSQL 您要在用戶端中使用的字元集(編碼)。有幾種方法可以實現此目的:
在 psql 中使用 \encoding 指令。\encoding 允許您即時更改用戶端編碼。例如,要將編碼更改為 SJIS,請鍵入:
libpq(第 33.10 節)具有控制用戶端編碼的功能。
使用 SET client_encoding TO。可以使用以下 SQL 指令設定用戶端編碼:
您還可以使用標準 SQL 語法 SET NAMES 來達到此目的:
要查詢目前用戶端編碼:
要回傳預設編碼:
使用 PGCLIENTENCODING。如果在用戶端環境中定義了環境變數 PGCLIENTENCODING,則在建立與伺服器的連線時會自動選擇該用戶端編碼。(這可以隨後使用上面提到的任何其他方法覆蓋。)
使用組態變數 client_encoding。如果設定了 client_encoding 變數,則在建立與伺服器的連線時會自動選擇該用戶端編碼。(這可以隨後使用上面提到的任何其他方法覆蓋。)
如果無法轉換特定字元 - 假設您為伺服器選擇了 EUC_JP 而為用戶端選擇了 LATIN1,並且回傳了一些在 LATIN1 中沒有表示的日文字元 - 回報錯誤。
如果用戶端字元集定義為 SQL_ASCII,則無論伺服器的字元集如何,都將停用編碼轉換。就像伺服器一樣,除非使用全 ASCII 資料,否則使用 SQL_ASCII 是不明智的。
這些是開始學習各種編碼系統的好資源。
CJKV 訊息處理:中文,日文,韓文和越南文運算
包含 EUC_JP,EUC_CN,EUC_KR,EUC_TW 的詳細說明。
Unicode Consortium 的網站。
RFC 3629
UTF-8 (8-bit UCS/Unicode Transformation Format) 定義在這裡
版本:11
區域設定支援是指某個應用程序,它提供有關字母、排序、數字格式等文化偏好。PostgreSQL 使用伺服器作業系統提供的標準 ISO C 和 POSIX 區域設定。有關其他訊息,請參閱作業系統文件。
使用 initdb 建立資料庫叢集時,將自動初始化語言環境支援。initdb 將預設使用其執行環境的語言環境設定初始化資料庫叢集。因此,如果您的作業系統已設定為使用資料庫叢集中所需的語言環境,那麼您毌須進行任何額外操作。如果要使用其他語言環境(或者您不確定系統設定的語言環境),可以透過指定 --locale 選項指示 initdb 確切使用哪個語言環境。例如:
Unix 系統的這個範例將語言環境設定為瑞典語(SE)中的瑞典語(sv)。其他可能性可能包括 en_US(美國英語)和 fr_CA(加拿大法語)。如果可以將多個字元集用於語言環境,則規範可採用 language_territory.codeset 形式。例如,fr_BE.UTF-8 表示比利時(BE)中使用的法語(fr),具有 UTF-8 字元集編碼。
系統上可用的區域設定取決於作業系統供應商提供和安裝的內容。在大多數 Unix 系統上,指令 locale -a
將提供可用語言環境的列表。Windows 使用更詳細的區域設定名稱,例如 German_Germany 或 Swedish_Sweden.1252,但原則是相同的。
有時,混合來自多個語言環境的規則很有用,例如,使用英語校對規則,但使用西班牙語訊息。為了支援這一點,可以存在一組區域設定子類別,它們僅控制本地化規則的某些方面:
類別名稱轉換為 initdb 選項的名稱,以覆蓋特定類別的區域設定選項。例如,要將語言環境設定為加拿大法語,但使用美國規則格式化貨幣,請使用 initdb --locale = fr_CA --lc-monetary = en_US
。
如果您希望系統的行為就像它沒有語言環境支援一樣,請使用特殊的語言環境名稱 C 或等效的 POSIX。
建立資料庫時,某些區域設定類別必須固定其值。 您可以對不同的資料庫使用不同的設定,但是一旦建立了資料庫,就無法再為該資料庫更改它們。LC_COLLATE 和 LC_CTYPE 是這些類別。它們會影響索引的排序順序,因此必須保持不變,否則文字欄位上的索引會損壞。(但是您可以使用排序規則來緩解此限制,如中所述。)這些類別的預設值在執行 initdb 時確定,並且在建立新資料庫時使用這些值,除非在 CREATE DATABASE 指令中另行指定。
透過設定與語言環境類別同名的伺服器配置參數,可以隨時更改其他語言環境類別(有關詳細訊息,請參閱)。initdb 選擇的值實際上只寫入配置文件 postgresql.conf,以在伺服器啟動時用作預設值。如果從 postgresql.conf 中刪除這些設定,則伺服器將從其執行環境繼承設定。
請注意,服務器的區域設定行為由伺服器看到的環境變數決定,而不是由任何用戶端的環境確定。因此,在啟動伺服器之前,請務必配置正確的區域設定。這樣做的結果是,如果用戶端和伺服器設定在不同的區域設定中,則訊息可能會以不同的語言顯示,具體取決於它們的來源。
注意 當我們談到從執行環境繼承語言環境時,這意味著在大多數作業系統上都有以下內容:對於給定的語言環境類別,比如排序規則,將按此順序查詢以下環境變數,直到找到一個設定:LC_ALL, LC_COLLATE(或對應於相應類別的變數),LANG。如果未設定這些環境變數,則語言環境預設為 C.
某些訊息的本地化函式庫還會查看環境變數 LANGUAGE,該變數將覆寫所有其他區域設定,以便設定訊息的語言。如有疑問,請參閱作業系統的文件,特別是有關 gettext 的文件。
要使訊息能夠轉換為用戶的偏好語言,必須在編譯時選擇 NLS(configure --enable-nls
)。所有其他語言環境支援都是自動編譯的。
語系設定會影響以下的 SQL 功能:
使用 ORDER BY 或標準比較運算子對查詢中文字排序
upper,lower 和 initcap 功能
樣式匹配運算子(LIKE,SIMILAR TO 和 POSIX 形式的正規表示式);locales 透過字元類的正規表示式影響不區分大小寫的匹配和字元分類
to_char 系列函數
索引可以與 LIKE 子句一起使用
在 PostgreSQL 中使用 C 或 POSIX 以外語言環境的缺點是對效能的影響。它會減慢字元處理速度並阻止 LIKE 使用普通索引。因此,最好只有在實際需要時才進行區域設定。
如果區域設定依上述說明操作卻不起作用的話,請檢查作業系統中的區域設定是否已正確配置。要檢查作業系統上安裝的語言環境,可以使用命令 locale -a(如果作業系統有提供的話)。
檢查 PostgreSQL 實際上是否正在使用您認為的語言環境。LC_COLLATE 和 LC_CTYPE 設定會在建立資料庫時確定,除非建立新的資料庫,否則無法變更。其他區域設定(包括 LC_MESSAGES 和 LC_MONETARY)最初由伺服器啟動的環境決定,但可以即時變更。您可以使用 SHOW 命令檢查當下有效的區域設定。
原始碼發行版中的目錄 src/test/locale 包含了 PostgreSQL 語言環境支援的測試套件。
當伺服器的訊息使用不同的語言時,透過解析錯誤訊息文字來處理伺服器端錯誤的用戶端應用程序顯然會出現問題。建議此類應用程序的作者使用錯誤代碼方案。
作為允許 PostgreSQL 在非 C 語言環境下使用具有 LIKE 子句索引的解決方法,存在多個自訂運算子類。允許建立一個執行嚴格的逐字元比較的索引,忽略區域設定的比較規則。有關更多訊息,請參閱。另一種方法是使用 C collation 建立索引,如中所述。
維護訊息翻譯目錄需要許多志願者的持續努力,他們希望看到 PostgreSQL 能夠順暢地說出他們喜歡的語言。如果您的語言訊息目前無法使用或未完全翻譯,我們將非常歡迎您的協助。如果您想幫助我們,請參閱或寫信給開發人員的郵件列表。
Name
Description
Language
Server?
ICU?
Bytes/Char
Aliases
BIG5
Big Five
Traditional Chinese
No
No
1-2
WIN950
, Windows950
EUC_CN
Extended UNIX Code-CN
Simplified Chinese
Yes
Yes
1-3
EUC_JP
Extended UNIX Code-JP
Japanese
Yes
Yes
1-3
EUC_JIS_2004
Extended UNIX Code-JP, JIS X 0213
Japanese
Yes
No
1-3
EUC_KR
Extended UNIX Code-KR
Korean
Yes
Yes
1-3
EUC_TW
Extended UNIX Code-TW
Traditional Chinese, Taiwanese
Yes
Yes
1-3
GB18030
National Standard
Chinese
No
No
1-4
GBK
Extended National Standard
Simplified Chinese
No
No
1-2
WIN936
, Windows936
ISO_8859_5
ISO 8859-5, ECMA 113
Latin/Cyrillic
Yes
Yes
1
ISO_8859_6
ISO 8859-6, ECMA 114
Latin/Arabic
Yes
Yes
1
ISO_8859_7
ISO 8859-7, ECMA 118
Latin/Greek
Yes
Yes
1
ISO_8859_8
ISO 8859-8, ECMA 121
Latin/Hebrew
Yes
Yes
1
JOHAB
JOHAB
Korean (Hangul)
No
No
1-3
KOI8R
KOI8-R
Cyrillic (Russian)
Yes
Yes
1
KOI8
KOI8U
KOI8-U
Cyrillic (Ukrainian)
Yes
Yes
1
LATIN1
ISO 8859-1, ECMA 94
Western European
Yes
Yes
1
ISO88591
LATIN2
ISO 8859-2, ECMA 94
Central European
Yes
Yes
1
ISO88592
LATIN3
ISO 8859-3, ECMA 94
South European
Yes
Yes
1
ISO88593
LATIN4
ISO 8859-4, ECMA 94
North European
Yes
Yes
1
ISO88594
LATIN5
ISO 8859-9, ECMA 128
Turkish
Yes
Yes
1
ISO88599
LATIN6
ISO 8859-10, ECMA 144
Nordic
Yes
Yes
1
ISO885910
LATIN7
ISO 8859-13
Baltic
Yes
Yes
1
ISO885913
LATIN8
ISO 8859-14
Celtic
Yes
Yes
1
ISO885914
LATIN9
ISO 8859-15
LATIN1 with Euro and accents
Yes
Yes
1
ISO885915
LATIN10
ISO 8859-16, ASRO SR 14111
Romanian
Yes
No
1
ISO885916
MULE_INTERNAL
Mule internal code
Multilingual Emacs
Yes
No
1-4
SJIS
Shift JIS
Japanese
No
No
1-2
Mskanji
, ShiftJIS
, WIN932
, Windows932
SHIFT_JIS_2004
Shift JIS, JIS X 0213
Japanese
No
No
1-2
SQL_ASCII
unspecified (see text)
any
Yes
No
1
UHC
Unified Hangul Code
Korean
No
No
1-2
WIN949
, Windows949
UTF8
Unicode, 8-bit
all
Yes
Yes
1-4
Unicode
WIN866
Windows CP866
Cyrillic
Yes
Yes
1
ALT
WIN874
Windows CP874
Thai
Yes
No
1
WIN1250
Windows CP1250
Central European
Yes
Yes
1
WIN1251
Windows CP1251
Cyrillic
Yes
Yes
1
WIN
WIN1252
Windows CP1252
Western European
Yes
Yes
1
WIN1253
Windows CP1253
Greek
Yes
Yes
1
WIN1254
Windows CP1254
Turkish
Yes
Yes
1
WIN1255
Windows CP1255
Hebrew
Yes
Yes
1
WIN1256
Windows CP1256
Arabic
Yes
Yes
1
WIN1257
Windows CP1257
Baltic
Yes
Yes
1
WIN1258
Windows CP1258
Vietnamese
Yes
Yes
1
ABC
, TCVN
, TCVN5712
, VSCII
Server Character Set
Available Client Character Sets
BIG5
not supported as a server encoding
EUC_CN
EUC_CN, MULE_INTERNAL
, UTF8
EUC_JP
EUC_JP, MULE_INTERNAL
, SJIS
, UTF8
EUC_KR
EUC_KR, MULE_INTERNAL
, UTF8
EUC_TW
EUC_TW, BIG5
, MULE_INTERNAL
, UTF8
GB18030
not supported as a server encoding
GBK
not supported as a server encoding
ISO_8859_5
ISO_8859_5, KOI8R
, MULE_INTERNAL
, UTF8
, WIN866
, WIN1251
ISO_8859_6
ISO_8859_6, UTF8
ISO_8859_7
ISO_8859_7, UTF8
ISO_8859_8
ISO_8859_8, UTF8
JOHAB
JOHAB, UTF8
KOI8R
KOI8R, ISO_8859_5
, MULE_INTERNAL
, UTF8
, WIN866
, WIN1251
KOI8U
KOI8U, UTF8
LATIN1
LATIN1, MULE_INTERNAL
, UTF8
LATIN2
LATIN2, MULE_INTERNAL
, UTF8
, WIN1250
LATIN3
LATIN3, MULE_INTERNAL
, UTF8
LATIN4
LATIN4, MULE_INTERNAL
, UTF8
LATIN5
LATIN5, UTF8
LATIN6
LATIN6, UTF8
LATIN7
LATIN7, UTF8
LATIN8
LATIN8, UTF8
LATIN9
LATIN9, UTF8
LATIN10
LATIN10, UTF8
MULE_INTERNAL
MULE_INTERNAL, BIG5
, EUC_CN
, EUC_JP
, EUC_KR
, EUC_TW
, ISO_8859_5
, KOI8R
, LATIN1
to LATIN4
, SJIS
, WIN866
, WIN1250
, WIN1251
SJIS
not supported as a server encoding
SQL_ASCII
any (no conversion will be performed)
UHC
not supported as a server encoding
UTF8
all supported encodings
WIN866
WIN866, ISO_8859_5
, KOI8R
, MULE_INTERNAL
, UTF8
, WIN1251
WIN874
WIN874, UTF8
WIN1250
WIN1250, LATIN2
, MULE_INTERNAL
, UTF8
WIN1251
WIN1251, ISO_8859_5
, KOI8R
, MULE_INTERNAL
, UTF8
, WIN866
WIN1252
WIN1252, UTF8
WIN1253
WIN1253, UTF8
WIN1254
WIN1254, UTF8
WIN1255
WIN1255, UTF8
WIN1256
WIN1256, UTF8
WIN1257
WIN1257, UTF8
WIN1258
WIN1258, UTF8
| String sort order |
| 字元分類(什麼是字母?它的大寫字母是?) |
| 訊息的語言 |
| 格式化貨幣金額 |
| 格式化數字 |
| 格式化日期和時間 |
The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE
and LC_CTYPE
settings of a database cannot be changed after its creation.
Conceptually, every expression of a collatable data type has a collation. (The built-in collatable data types are text
, varchar
, and char
. User-defined base types can also be marked collatable, and of course a domain over a collatable data type is collatable.) If the expression is a column reference, the collation of the expression is the defined collation of the column. If the expression is a constant, the collation is the default collation of the data type of the constant. The collation of a more complex expression is derived from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character classification, it uses the collation of the input expression. This happens, for example, with ORDER BY
clauses and function or operator calls such as <
. The collation to apply for an ORDER BY
clause is simply the collation of the sort key. The collation to apply for a function or operator call is derived from the arguments, as described below. In addition to comparison operators, collations are taken into account by functions that convert between lower and upper case letters, such as lower
, upper
, and initcap
; by pattern matching operators; and by to_char
and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE
clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
For example, consider this table definition:
Then in
the <
comparison is performed according to de_DE
rules, because the expression combines an implicitly derived collation with the default collation. But in
the comparison is performed using fr_FR
rules, because the explicit collation derivation overrides the implicit one. Furthermore, given
the parser cannot determine which collation to apply, since the a
and b
columns have conflicting implicit collations. Since the <
operator does need to know which collation to use, this will result in an error. The error can be resolved by attaching an explicit collation specifier to either input expression, thus:
or equivalently
On the other hand, the structurally similar case
does not result in an error, because the ||
operator does not care about collations: its result is the same regardless of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in
the ordering will be done according to de_DE
rules. But this query:
results in an error, because even though the ||
operator doesn't need to know a collation, the ORDER BY
clause does. As before, the conflict can be resolved with an explicit collation specifier:
A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc
, which uses the locales provided by the operating system C library. These are the locales that most tools provided by the operating system use. Another provider is icu
, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.
A collation object provided by icu
maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.
23.2.2.1. Standard Collations
On all platforms, the collations named default
, C
, and POSIX
are available. Additional collations may be available depending on operating system support. The default
collation selects the LC_COLLATE
and LC_CTYPE
values specified at database creation time. The C
and POSIX
collations both specify “traditional C” behavior, in which only the ASCII letters “A
” through “Z
” are treated as letters, and sorting is done strictly by character code byte values.
Additionally, the SQL standard collation name ucs_basic
is available for encoding UTF8
. It is equivalent to C
and sorts by Unicode code point.
23.2.2.2. Predefined Collations
If the operating system provides support for using multiple locales within a single program (newlocale
and related functions), or if support for ICU is configured, then when a database cluster is initialized, initdb
populates the system catalog pg_collation
with collations based on all the locales it finds in the operating system at the time.
To inspect the currently available locales, use the query SELECT * FROM pg_collation
, or the command \dOS+
in psql.
23.2.2.2.1. libc collations
For example, the operating system might provide a locale named de_DE.utf8
. initdb
would then create a collation named de_DE.utf8
for encoding UTF8
that has both LC_COLLATE
and LC_CTYPE
set to de_DE.utf8
. It will also create a collation with the .utf8
tag stripped off the name. So you could also use the collation under the name de_DE
, which is less cumbersome to write and makes the name less encoding-dependent. Note that, nevertheless, the initial set of collation names is platform-dependent.
Within any particular database, only collations that use that database's encoding are of interest. Other entries in pg_collation
are ignored. Thus, a stripped collation name such as de_DE
can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the default
, C
, and POSIX
collations can be used regardless of the database encoding.
PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,
will draw an error even though the C
and POSIX
collations have identical behaviors. Mixing stripped and non-stripped collation names is therefore not recommended.
23.2.2.2.2. ICU collations
With ICU, it is not sensible to enumerate all possible locale names. ICU uses a particular naming system for locales, but there are many more ways to name a locale than there are actually distinct locales. initdb
uses the ICU APIs to extract a set of distinct locales to populate the initial set of collations. Collations provided by ICU are created in the SQL environment with names in BCP 47 language tag format, with a “private use” extension -x-icu
appended, to distinguish them from libc locales.
Here are some example collations that might be created:de-x-icu
German collation, default variantde-AT-x-icu
German collation for Austria, default variant
(There are also, say, de-DE-x-icu
or de-CH-x-icu
, but as of this writing, they are equivalent to de-x-icu
.)und-x-icu
(for “undefined”)
ICU “root” collation. Use this to get a reasonable language-agnostic sort order.
Some (less frequently used) encodings are not supported by ICU. When the database encoding is one of these, ICU collation entries in pg_collation
are ignored. Attempting to use one will draw an error along the lines of “collation "de-x-icu" for encoding "WIN874" does not exist”.
23.2.2.3. Creating New Collation Objects
The standard and predefined collations are in the schema pg_catalog
, like all predefined objects. User-defined collations should be created in user schemas. This also ensures that they are saved by pg_dump
.
23.2.2.3.1. libc collations
New libc collations can be created like this:
The exact values that are acceptable for the locale
clause in this command depend on the operating system. On Unix-like systems, the command locale -a
will show a list.
23.2.2.3.2. ICU collations
Here are some examples:CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');
CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');
German collation with phone book collation type
The first example selects the ICU locale using a “language tag” per BCP 47. The second example uses the traditional ICU-specific locale syntax. The first style is preferred going forward, but it is not supported by older ICU versions.
Note that you can name the collation objects in the SQL environment anything you want. In this example, we follow the naming style that the predefined collations use, which in turn also follow BCP 47, but that is not required for user-defined collations.CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');
CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');
Root collation with Emoji collation type, per Unicode Technical Standard #51
Observe how in the traditional ICU locale naming system, the root locale is selected by an empty string.CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit');
CREATE COLLATION digitslast (provider = icu, locale = 'en@colReorder=latn-digit');
Sort digits after Latin letters. (The default is digits before letters.)CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');
Sort upper-case letters before lower-case letters. (The default is lower-case letters first.)CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit');
CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=latn-digit');
Combines both of the above options.CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');
Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21
< A-123
(also known as natural sort).
Note that while this system allows creating collations that “ignore case” or “ignore accents” or similar (using the ks
key), PostgreSQL does not at the moment allow such collations to act in a truly case- or accent-insensitive manner. Any strings that compare equal according to the collation but are not byte-wise equal will be sorted according to their byte values.
By design, ICU will accept almost any string as a locale name and match it to the closest locale it can provide, using the fallback procedure described in its documentation. Thus, there will be no direct feedback if a collation specification is composed using features that the given ICU installation does not actually support. It is therefore recommended to create application-level test cases to check that the collation definitions satisfy one's requirements.
23.2.2.3.3. Copying Collations
A collation object provided by libc
maps to a combination of LC_COLLATE
and LC_CTYPE
settings, as accepted by the setlocale()
system library call. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE
, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE
setting that is different from LC_COLLATE
, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE
per expression.) Also, a libc
collation is tied to a character set encoding (see ). The same collation name may exist for different encodings.
The default set of collations provided by libc
map directly to the locales installed in the operating system, which can be listed using the command locale -a
. In case a libc
collation is needed that has different values for LC_COLLATE
and LC_CTYPE
, or if new locales are installed in the operating system after the database system was initialized, then a new collation may be created using the command. New operating system locales can also be imported en masse using the function.
If the standard and predefined collations are not sufficient, users can create their own collation objects using the SQL command .
Since the predefined libc collations already include all collations defined in the operating system when the database instance is initialized, it is not often necessary to manually create new ones. Reasons might be if a different naming system is desired (in which case see also ) or if the operating system has been upgraded to provide new locale definitions (in which case see also ).
ICU allows collations to be customized beyond the basic language+country set that is preloaded by initdb
. Users are encouraged to define their own collation objects that make use of these facilities to suit the sorting behavior to their requirements. See and for information on ICU locale naming. The set of acceptable names and attributes depends on the particular ICU version.
See and for details. The list of possible collation types (co
subtag) can be found in the . The can be used to check the details of a particular locale definition. The examples using the k*
subtags require at least ICU version 54.
The command can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example: