1 of 8

66. GIN 索引

66.1. 簡介

GIN 代表 Generalized Inverted Index。GIN 設計用於處理要被索引的項目是複合值的情況，並且由索引處理的查詢需要搜索出現在複合項目內的元素值。例如，這些項目可能是文件，查詢可能是搜索包含特定單詞的文件。

我們使用單詞 item 來引用要編入索引的複合值，並使用單詞索引鍵（word key）來引用元素值。GIN 總是儲存和搜索索引鍵，而不是其值。

GIN 索引儲存一組（key, posting list）對，其中 posting list 是 key 對應的一組資料列 ID。同一資料列 ID 可以出現在多個 posting list 當中，因為一個項目可以包含多個關鍵字。每個索引鍵值只儲存一次，因此對於相同鍵出現多次的情況，GIN 索引非常會緊湊。

GIN 是泛用的，因為 GIN 存取方法的語法不需要知道它加速的具體操作。相反地，它使用為特定資料型別定義的自訂策略。該策略定義瞭瞭如何從索引項目和查詢條件中提取關鍵字，以及如何確定包含查詢中某些關鍵值的資料列能夠實際滿足查詢。

GIN 的一個優點是，它允許由資料型別領域的專家而不是資料庫專家使用適當的存取方法開發自訂的資料型別。這與使用 GiST 的優點相同。

PostgreSQL 中的 GIN 實現偏主要由 Teodor Sigaev 和 Oleg Bartunov 維護。在他們的網站上有更多關於 GIN 的訊息。

66.2. 內建運算子類

主要的 PostgreSQL 版本包括 Table 64.1 中所示的 GIN 運算子類。（附錄 F 中描述的一些選用套件提供了額外的 GIN 運算子類。）

Table 64.1. Built-in GIN Operator Classes

在 jsonb 型別的兩個運算子類中，jsonb_ops 是預設值。jsonb_path_ops 支援較少的運算子，但為這些運算子提供了更好的效能。有關詳細訊息，請參閱第 8.14.4 節。

66.3. Extensibility

The GIN interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GIN layer itself takes care of concurrency, logging and searching the tree structure.

All it takes to get a GIN access method working is to implement a few user-defined methods, which define the behavior of keys in the tree and the relationships between keys, indexed items, and indexable queries. In short, GIN combines extensibility with generality, code reuse, and a clean interface.

There are two methods that an operator class for GIN must provide:Datum *extractValue(Datum itemValue, int32 *nkeys, bool **nullFlags)

Returns a palloc'd array of keys given an item to be indexed. The number of returned keys must be stored into *nkeys. If any of the keys can be null, also palloc an array of *nkeys bool fields, store its address at *nullFlags, and set these null flags as needed.*nullFlags can be left NULL (its initial value) if all keys are non-null. The return value can be NULL if the item contains no keys.Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data, bool **nullFlags, int32 *searchMode)

Returns a palloc'd array of keys given a value to be queried; that is, query is the value on the right-hand side of an indexable operator whose left-hand side is the indexed column. n is the strategy number of the operator within the operator class (see Section 37.14.2). Often, extractQuery will need to consult n to determine the data type of query and the method it should use to extract key values. The number of returned keys must be stored into *nkeys. If any of the keys can be null, also palloc an array of *nkeys bool fields, store its address at *nullFlags, and set these null flags as needed. *nullFlags can be left NULL (its initial value) if all keys are non-null. The return value can be NULL if the query contains no keys.

searchMode is an output argument that allows extractQuery to specify details about how the search will be done. If *searchMode is set to GIN_SEARCH_MODE_DEFAULT (which is the value it is initialized to before call), only items that match at least one of the returned keys are considered candidate matches. If *searchMode is set to GIN_SEARCH_MODE_INCLUDE_EMPTY, then in addition to items containing at least one matching key, items that contain no keys at all are considered candidate matches. (This mode is useful for implementing is-subset-of operators, for example.) If *searchMode is set to GIN_SEARCH_MODE_ALL, then all non-null items in the index are considered candidate matches, whether they match any of the returned keys or not. (This mode is much slower than the other two choices, since it requires scanning essentially the entire index, but it may be necessary to implement corner cases correctly. An operator that needs this mode in most cases is probably not a good candidate for a GIN operator class.) The symbols to use for setting this mode are defined in access/gin.h.

pmatch is an output argument for use when partial match is supported. To use it, extractQuery must allocate an array of *nkeys booleans and store its address at *pmatch. Each element of the array should be set to TRUE if the corresponding key requires partial match, FALSE if not. If *pmatch is set to NULL then GIN assumes partial match is not required. The variable is initialized to NULL before call, so this argument can simply be ignored by operator classes that do not support partial match.

extra_data is an output argument that allows extractQuery to pass additional data to the consistent and comparePartial methods. To use it, extractQuery must allocate an array of *nkeys pointers and store its address at *extra_data, then store whatever it wants to into the individual pointers. The variable is initialized to NULL before call, so this argument can simply be ignored by operator classes that do not require extra data. If *extra_data is set, the whole array is passed to the consistent method, and the appropriate element to the comparePartial method.

An operator class must also provide a function to check if an indexed item matches the query. It comes in two flavors, a boolean consistent function, and a ternary triConsistent function. triConsistent covers the functionality of both, so providing triConsistent alone is sufficient. However, if the boolean variant is significantly cheaper to calculate, it can be advantageous to provide both. If only the boolean variant is provided, some optimizations that depend on refuting index items before fetching all the keys are disabled.bool consistent(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[])

Returns TRUE if an indexed item satisfies the query operator with strategy number n (or might satisfy it, if the recheck indication is returned). This function does not have direct access to the indexed item's value, since GIN does not store items explicitly. Rather, what is available is knowledge about which key values extracted from the query appear in a given indexed item. The check array has length nkeys, which is the same as the number of keys previously returned by extractQuery for this query datum. Each element of the check array is TRUE if the indexed item contains the corresponding query key, i.e., if (check[i] == TRUE) the i-th key of the extractQuery result array is present in the indexed item. The original query datum is passed in case the consistent method needs to consult it, and so are the queryKeys[] and nullFlags[] arrays previously returned by extractQuery. extra_data is the extra-data array returned by extractQuery, or NULL if none.

When extractQuery returns a null key in queryKeys[], the corresponding check[] element is TRUE if the indexed item contains a null key; that is, the semantics of check[] are like IS NOT DISTINCT FROM. The consistent function can examine the corresponding nullFlags[] element if it needs to tell the difference between a regular value match and a null match.

On success, *recheck should be set to TRUE if the heap tuple needs to be rechecked against the query operator, or FALSE if the index test is exact. That is, a FALSE return value guarantees that the heap tuple does not match the query; a TRUE return value with *recheck set to FALSE guarantees that the heap tuple does match the query; and a TRUE return value with *recheck set to TRUE means that the heap tuple might match the query, so it needs to be fetched and rechecked by evaluating the query operator directly against the originally indexed item.GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], Datum queryKeys[], bool nullFlags[])

triConsistent is similar to consistent, but instead of booleans in the check vector, there are three possible values for each key: GIN_TRUE, GIN_FALSE and GIN_MAYBE. GIN_FALSE and GIN_TRUE have the same meaning as regular boolean values, while GIN_MAYBE means that the presence of that key is not known. When GIN_MAYBE values are present, the function should only return GIN_TRUE if the item certainly matches whether or not the index item contains the corresponding query keys. Likewise, the function must return GIN_FALSEonly if the item certainly does not match, whether or not it contains the GIN_MAYBE keys. If the result depends on the GIN_MAYBE entries, i.e., the match cannot be confirmed or refuted based on the known query keys, the function must return GIN_MAYBE.

When there are no GIN_MAYBE values in the check vector, a GIN_MAYBE return value is the equivalent of setting the recheck flag in the boolean consistent function.

In addition, GIN must have a way to sort the key values stored in the index. The operator class can define the sort ordering by specifying a comparison method:int compare(Datum a, Datum b)

Compares two keys (not indexed items!) and returns an integer less than zero, zero, or greater than zero, indicating whether the first key is less than, equal to, or greater than the second. Null keys are never passed to this function.

Alternatively, if the operator class does not provide a compare method, GIN will look up the default btree operator class for the index key data type, and use its comparison function. It is recommended to specify the comparison function in a GIN operator class that is meant for just one data type, as looking up the btree operator class costs a few cycles. However, polymorphic GIN operator classes (such as array_ops) typically cannot specify a single comparison function.

Optionally, an operator class for GIN can supply the following method:int comparePartial(Datum partial_key, Datum key, StrategyNumber n, Pointer extra_data)

Compare a partial-match query key to an index key. Returns an integer whose sign indicates the result: less than zero means the index key does not match the query, but the index scan should continue; zero means that the index key does match the query; greater than zero indicates that the index scan should stop because no more matches are possible. The strategy number n of the operator that generated the partial match query is provided, in case its semantics are needed to determine when to end the scan. Also, extra_data is the corresponding element of the extra-data array made by extractQuery, or NULL if none. Null keys are never passed to this function.

To support “partial match” queries, an operator class must provide the comparePartial method, and its extractQuery method must set the pmatch parameter when a partial-match query is encountered. See Section 64.4.2 for details.

The actual data types of the various Datum values mentioned above vary depending on the operator class. The item values passed to extractValue are always of the operator class's input type, and all key values must be of the class's STORAGE type. The type of the queryargument passed to extractQuery, consistent and triConsistent is whatever is the right-hand input type of the class member operator identified by the strategy number. This need not be the same as the indexed type, so long as key values of the correct type can be extracted from it. However, it is recommended that the SQL declarations of these three support functions use the opclass's indexed data type for the query argument, even though the actual type might be something else depending on the operator.

66.4. Implementation

在內部，GIN 索引包含在索引鍵上建構的 B-tree 索引，其中每個索引鍵是一個或多個索引項目（例如，陣列的成員）元素，並且葉結點頁面中的每個 tuple 包含指向 heap 指標的 B-tree（“posting tree”）或 heap 指標的簡易列表（“posting list”），其列表足夠小以能與其索引鍵值放進單個索引 tuple。

從 PostgreSQL 9.1 開始，null 索引鍵值可以包含在索引中。此外，placeholder null 值也包含在索引項目中，該索引項目根據 extractValue 的結果判斷為 null 或不包含任何鍵值。這可以進行空項目的搜索。

透過在複合值（欄位號碼，鍵值）上建構單個 B-tree 來實現多欄位 GIN 索引。不同欄位的鍵值可以是不同型別。

64.4.1. GIN 快速更新技術

由於反向索引的固有特性，更新 GIN 索引往往會很慢：插入或更新一個 heap 資料列會導致許多項目插入到索引中（每個索引鍵從索引項目中提取一個）。從 PostgreSQL 8.4 開始，GIN 能夠透過將新的 tuple 插入臨時的未排序待處理條目列表來延遲大部分工作。當資料表被清理或自動分析時，或者當呼叫 gin_clean_pending_list 函數時，又或者待處理列表變得大於時，使用在初始索引建立期間使用的相同批次插入技術將項目移動到主要的 GIN 資料結構。這大大提升了 GIN 索引更新速度，甚至可以計算額外的清理開銷。此外，額外的工作可以通過背景程序而不是前端查詢處理來完成。

這種方法的主要缺點是除了搜尋一般索引之外，還必須掃描待處理項目列表，因此大量待處理項目將顯著拖慢搜尋速度。另一個缺點是，雖然大多數更新都很快，但是導致待處理列表變得「太大」的更新將導致觸發立即性清理工作，因此比其他更新慢得多。正確使用 autovacuum 可以儘可能地減少這些問題。

如果一致回應時間比更新速度更重要，則可以透過關閉 GIN 索引的 fastupdate 儲存參數來停用待處理項目的使用。有關詳細訊息，請參閱。

64.4.2. Partial Match Algorithm

GIN可以支援「部分匹配」查詢，其中查詢不確定一個或多個索引鍵能完全匹配，但可能的匹配屬於相當窄的索引鍵值範圍（以比較支持法確定的索引鍵排序順序））。 extractQuery 方法不是回傳要精確匹配的索引鍵值，而是回傳某鍵值，該鍵值是搜尋範圍的下限，並將 pmatch 標示設定為 true。然後使用 comparePartial 方法掃描關鍵的範圍。comparePartial 讓能匹配的索引鍵回傳零，對於仍在搜尋範圍內的非匹配小於零，或者如果索引鍵超出可匹配的範圍則大於零。

66.5. GIN Tips and Tricks

Create vs. insert

由於可能為每個項目插入了許多索引鍵，因此插入 GIN 索引可能會很慢。因此，對於批次插入表格，建議在完成批次插入後刪除 GIN 索引並重新建立。

從 PostgreSQL 8.4 開始，由於使用了延遲索引，因此此建議不太必要（詳見第 64.4.1 節）。但對於非常大的更新，還是最好刪除並重新建立索引。

maintenance_work_mem

GIN 索引的建構時間對 maintenance_work_mem 設定非常敏感；在建立索引期間，不需要花費工作記憶體。

gin_pending_list_limit

在一系列插入已啟用 fastupdate 的現有 GIN 索引期間，只要列表大於 gin_pending_list_limit，系統就會清理待處理項目列表。為了避免觀察的回應時間波動，希望在背景進行待處理列表清理（即透過 autovacuum）。透過增加 gin_pending_list_limit 或使 autovacuum 更積極，可以減少手動清理操作。但是，擴大清理操作的閾值意味著如果確實發生了手動清理，則需要更長時間。

可以透過變更儲存參數來覆蓋各個 GIN 索引的 gin_pending_list_limit，並允許每個 GIN 索引具有自己的清理閾值。例如，可以僅為可以大量更新的 GIN 索引增加閾值，否則可以減少閾值。

gin_fuzzy_search_limit

開發 GIN 索引的主要目標是在 PostgreSQL 中建立對可高度延展的全文檢索支援，並且通常情況下全文檢索會回傳非常大的結果集合。然而，當查詢包含非常頻繁的單詞時，通常會發生這種情況，因此大型結果集甚至不起作用。由於從磁碟讀取許多 tuple 並對它們進行排序可能需要花費大量時間，因此這對於產品環境來說是不可接受的。（請注意，索引搜尋本身非常快。）

為了便於控制執行此類查詢，GIN 對回傳的資料列數量有一個可配置的軟性上限：gin_fuzzy_search_limit 配置參數。預設設定為 0（表示無限制）。如果設定了非零的限制，則回傳的集合是整個結果集合的子集，隨機選擇。

「軟性上限」表示回傳結果的實際數量可能與指定的限制略有不同，具體取決於查詢和系統隨機數産生器的情況。

從經驗來看，數千以上的值（例如 5000 - 20000）是比較好的範圍。

66.6. Limitations

GIN assumes that indexable operators are strict. This means that extractValue will not be called at all on a null item value (instead, a placeholder index entry is created automatically), and extractQuery will not be called on a null query value either (instead, the query is presumed to be unsatisfiable). Note however that null key values contained within a non-null composite item or query value are supported.

66.7. Examples

The core PostgreSQL distribution includes the GIN operator classes previously shown in . The following contrib modules also contain GIN operator classes:

btree_gin

B-tree equivalent functionality for several data types

hstore

Module for storing (key, value) pairs

intarray

Enhanced support for int[]

pg_trgm

Text similarity using trigram matching

66.3. Extensibility

There are two methods that an operator class for GIN must provide:Datum *extractValue(Datum itemValue, int32 *nkeys, bool **nullFlags)

When there are no GIN_MAYBE values in the check vector, a GIN_MAYBE return value is the equivalent of setting the recheck flag in the boolean consistent function.

In addition, GIN must have a way to sort the key values stored in the index. The operator class can define the sort ordering by specifying a comparison method:int compare(Datum a, Datum b)

Optionally, an operator class for GIN can supply the following method:int comparePartial(Datum partial_key, Datum key, StrategyNumber n, Pointer extra_data)