Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
版本:11
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
本部分包含可能用於 PostgreSQL 研發人員的各種內容。
目錄 pg_auth_members 顯示角色之間的成員資格關連。允許任何非循環的關連。
由於角色身份是叢集範圍的,因此 pg_auth_members 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_auth_members 副本,而不是每個資料庫一個副本。
Table 51.9. pg_auth_members
Columns
The catalog pg_am
stores information about relation access methods. There is one row for each access method supported by the system. Currently, only indexes have access methods. The requirements for index access methods are discussed in detail in Chapter 60.
Table 51.3. pg_am
Columns
Before PostgreSQL 9.6, pg_am
contained many additional columns representing properties of index access methods. That data is now only directly visible at the C code level. However,pg_index_column_has_property()
and related functions have been added to allow SQL queries to inspect index access method properties; see Table 9.63.
版本:11
目錄 pg_class 對資料表和大多數具有欄位或其他類似於資料表的內容進行彙整。 包括索引(但也參閱 pg_index)、序列(但請參閱 pg_sequence)、檢視表、具體化檢視表、複合型別和 TOAST 資料表;另請查看 relkind 欄位。以下,當我們指的是所有這些類型的物件時,我們都會說「關連(relation)」。 並非所有欄位對所有關連類型都有意義。
pg_class
Columnspg_class 中的幾個布林欄位的維護是鬆散的:如果這是正確的狀態,那它們保證為 true,但是當條件不再為真時,可能不會立即重置為 false。例如,relhasindex 由CREATE INDEX 設定,但它永遠不會被 DROP INDEX 清除。相反地,如果 VACUUM 發現資料表沒有索引,則清除 relhasindex。這種安排避免了競爭條件並改善了一致性。
目錄 pg_authid 包含有關資料庫認證識別(角色)的資訊。角色包含「使用者」和「群組」的概念。使用者基本上只是設置了 rolcanlogin 識別的角色。任何角色(無論有或沒有 rolcanlogin)都可以擁有其他角色作為成員;詳見 。
由於此目錄包含密碼,因此不得公開讀取。 是 pg_authid 上的一個公開可讀的檢視表,它隱藏了密碼字串。
包含有關使用者和權限管理的詳細訊息。
由於使用者身份是叢集範圍的,因此 pg_authid 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_authid 副本,而不是每個資料庫一個副本。
Table 51.8. pg_authid
Columns
對於 MD5 加密密碼,rolpassword 欄位將以字串 md5 開頭,之後跟 32 個字元的十六進位 MD5 hash。MD5 hash 將是使用者的密碼連接到他們的使用者名稱。例如,如果使用者 joe 的密碼為 xyzzy,則 PostgreSQL 將儲存 xyzzyjoe 的 md5 hash 值。
如果使用 SCRAM-SHA-256 加密密碼,則其格式為:
其中 salt,StoredKey 和 ServerKey 採用 Base64 編碼格式。此格式與 RFC 5803 指定的格式相同。
未遵循這些格式之一的密碼就會被認為是未加密的。
Name
Type
References
Description
roleid
oid
pg_authid
.oid
具有成員的角色 ID
member
oid
pg_authid
.oid
作為 roleid 成員的角色 ID
grantor
oid
pg_authid
.oid
授予此成員資格的角色 ID
admin_option
bool
如果成員可以將 roleid 的成員資格授予其他人,則為 True
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
amname
name
Name of the access method
amhandler
regproc
pg_proc
.oid
OID of a handler function that is responsible for supplying information about the access method
amtype
char
Currently always i
to indicate an index access method; other values may be allowed in future
系統目錄(system catalog)是記錄資料庫管理系統儲存結構原始資料的地方,例如關於資料表和欄位的訊息以及內部日誌記錄訊息。PostgreSQL 的系統目錄是一般的資料表。您可以刪除並重新建立資料表、增加欄位、插入和更新內容,並以這種方式嚴重混淆您的系統。當然,通常情況下,不應該手動更改系統目錄,通常有 SQL 命令來執行此操作。(例如,CREATE DATABASE 向 pg_database 系統目錄插入一行 - 實際上是在磁碟上建立數據庫)。對於特別深奧的操作有一些例外,但其中很多已經隨著時間的推移而變為 SQL 命令,因此需要系統目錄的直接操作正在不斷減少。
PostgreSQLuses a message-based protocol for communication between frontends and backends (clients and servers). The protocol is supported overTCP/IPand also over Unix-domain sockets. Port number 5432 has been registered with IANA as the customary TCP port number for servers supporting this protocol, but in practice any non-privileged port number can be used.
This document describes version 3.0 of the protocol, implemented inPostgreSQL7.4 and later. For descriptions of the earlier protocol versions, see previous releases of thePostgreSQLdocumentation. A single server can support multiple protocol versions. The initial startup-request message tells the server which protocol version the client is attempting to use, and then the server follows that protocol if it is able.
In order to serve multiple clients efficiently, the server launches a new“backend”process for each client. In the current implementation, a new child process is created immediately after an incoming connection is detected. This is transparent to the protocol, however. For purposes of the protocol, the terms“backend”and“server”are interchangeable; likewise“frontend”and“client”are interchangeable.
The PostgreSQL source distribution includes several examples of index methods implemented using GiST. The core system currently provides text search support (indexing for tsvector
and tsquery
) as well as R-Tree equivalent functionality for some of the built-in geometric data types (see src/backend/access/gist/gistproc.c
). The following contrib
modules also contain GiST operator classes:btree_gist
B-tree equivalent functionality for several data typescube
Indexing for multidimensional cubeshstore
Module for storing (key, value) pairsintarray
RD-Tree for one-dimensional array of int4 valuesltree
Indexing for tree-like structurespg_trgm
Text similarity using trigram matchingseg
Indexing for “float ranges”
GIN assumes that indexable operators are strict. This means that extractValue
will not be called at all on a null item value (instead, a placeholder index entry is created automatically), and extractQuery
will not be called on a null query value either (instead, the query is presumed to be unsatisfiable). Note however that null key values contained within a non-null composite item or query value are supported.
欄位 | 型別 | 參考 | 說明 |
|
|
| 資料列指標(隱藏屬性;必須明確選擇) |
|
|
| 資料表的名稱,索引,檢視表等 |
|
| 包含此關連命名空間的 OID |
|
| 與此資料表的資料列類型對應資料型別的OID(如果有)(索引為零,因為沒有 pg_type 項目) |
|
| 對於複合型別資料表,底層複合型別的 OID,對於所有其他關連的值為零 |
|
| 關連的所有者 |
|
| 如果這是索引,則為使用存取的方法(B-tree,hash 等) |
|
|
| 此關連的磁碟檔案的名稱;零表示這是一個「映射」關連,其磁碟檔案名稱由底層狀態決定 |
|
| 儲存此關連的資料表空間。如果為零,則隱含資料庫的預設資料表空間。(如果關連沒有磁碟檔案,則沒有意義。) |
|
|
| 頁面(大小為 BLCKSZ)的磁碟表示形式的大小。這只是計劃程序使用的估算值。它由 VACUUM,ANALYZE 和一些 DDL 指令(如 CREATE INDEX)更新。 |
|
|
| 資料表中的資料列數。這只是計劃程序使用的估算值。它由VACUUM,ANALYZE 和一些 DDL 指令(如 CREATE INDEX)更新。 |
|
|
| 在資料表的可見性映射中標記為全部可見的頁面數。這只是計劃程序使用的估算值。它由 VACUUM,ANALYZE 和一些 DDL 指令(如 CREATE INDEX)更新。 |
|
| 與此資料表關連的 TOAST 資料表的OID,如果沒有,則為0。TOAST 資料表在輔助資料表中儲存“out of line”的大型屬性。 |
|
|
| 如果這是一個資料表並且它有(或最近有)任何索引,則為 True |
|
|
| 如果此資料表在叢集中的所有資料庫之間共享,則為 True。只有某些系統目錄共享(例如 pg_database)。 |
|
|
| p = 永久資料表,u = 無日誌資料,t = 臨時資料表 |
|
|
| r = 普通資料表,i = 索引,S = 序列,t = TOAST 資料表,v = 檢視表,m = 具體化檢視表,c = 複合型別,f = 外部資料表,p = 分割資料表 |
|
|
| 關連中的用戶欄位數(系統欄位未計算)。pg_attribute 中必須有這麼多對應的項目。另請參閱 pg_attribute.attnum。 |
|
|
|
|
|
| 如果我們為關連的每一個資料列産生一個 OID,則為 True |
|
|
| 如果資料表具有(或曾經有)主鍵,則為 True |
|
|
|
|
|
|
|
|
| 如果資料表具有(或曾經有)任何繼承子項,則為 True |
|
|
|
|
|
|
|
|
| 如果關連充入了資料,則為 True(除了某些具體化檢視表之外的所有關連都是 True) |
|
|
| 用於為資料列形成“replica identity”的欄位:d = 預設(主鍵,如果有),n = 無,f = 所有列,i = 具有 indisreplident 設定的索引,或預設值 |
|
|
| True if table is a partition |
|
|
| 此資料表之前的所有事務 ID 都已替換為此資料表中的永久(“frozen”)事務 ID。這用於追踪資料表是否需要被清理以防止事務 ID 重覆或讓 pg_xact 縮小。如果關連不是資料表,則為零(InvalidTransactionId)。 |
|
|
| 此資料表之前的所有 multixact ID 都已被此資料表中的事務 ID 替換。這用於追踪表是否需要被清理以防止多重 ID 重覆或使 pg_multixact 縮小。如果關連不是資料表,則為零(InvalidMultiXactId)。 |
|
|
|
|
|
| 存取方法的特定選項,為「keyword = value」字串 |
|
|
| 如果資料表是一個分割區(請參閱 relispartition),則綁定分割區的內部表示 |
Name | Type | Description |
|
| 資料列指標(隱藏屬性;必須明確選擇) |
|
| 角色名稱 |
|
| 角色具有超級使用者權限 |
|
| 角色自動繼承其所屬角色的權限 |
|
| 角色可以創造更多角色 |
|
| 角色可以建立資料庫 |
|
| 角色可以登入。也就是說,此角色可以作為初始連線認證識別 |
|
| 角色是複寫角色。複寫角色可以啟動複寫連線並建立和移除複寫槽。 |
|
|
|
| 對於可以登入的角色,這將設定此角色可以進行的最大同時連線數。-1 表示沒有限制。 |
|
| 密碼(可能是加密的); 如果沒有則為 null。格式取決於使用的加密形式。 |
|
| 密碼到期時間(僅用於密碼驗證);如果沒有過期,則回傳 null |
目錄 pg_index 包含有關索引的部分信息。其餘的大多數是在 pg_class 中。
Table 51.26. pg_index
Columns
版本:11
目錄 pg_language 註冊了可以撰寫函數或 stored procedure 的語言。有關語言處理程序的更多訊息,請參閱 CREATE LANGUAGE 和第 41 章。
pg_language
ColumnsThe catalog pg_collation
describes the available collations, which are essentially mappings from an SQL name to operating system locale categories. See Section 23.2 for more information.
Table 51.12. pg_collation
Columns
Note that the unique key on this catalog is (collname
, collencoding
, collnamespace
) not just (collname
, collnamespace
). PostgreSQL generally ignores all collations that do not have collencoding
equal to either the current database's encoding or -1, and creation of new entries with the same name as an entry with collencoding
= -1 is forbidden. Therefore it is sufficient to use a qualified SQL name (schema
.name
) to identify a collation, even though this is not unique according to the catalog definition. The reason for defining the catalog this way is that initdb fills it in at cluster initialization time with entries for all locales available on the system, so it must be able to hold entries for all encodings that might ever be used in the cluster.
In the template0
database, it could be useful to create collations whose encoding does not match the database encoding, since they could match the encodings of databases later cloned from template0
. This would currently have to be done manually.
目錄 pg_database 儲存有關資料庫一些可用的訊息。資料庫是使用 CREATE DATABASE 命令建立的。關於某些參數的含義的詳細訊息,請參閱第 22 章。
與大多數系統目錄不同,pg_database 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_database 副本,而不是每個資料庫一個副本。
Table 51.15. pg_database
欄位
The catalog pg_attribute
stores information about table columns. There will be exactly one pg_attribute
row for every column in every table in the database. (There will also be attribute entries for indexes, and indeed all objects that have pg_class
entries.)
The term attribute is equivalent to column and is used for historical reasons.
Table 51.7. pg_attribute
Columns
In a dropped column's pg_attribute
entry, atttypid
is reset to zero, but attlen
and the other fields copied from pg_type
are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was later dropped, and so there is no pg_type
row anymore. attlen
and the other fields can be used to interpret the contents of a row of the table.
The catalog pg_opclass
defines index access method operator classes. Each operator class defines semantics for index columns of a particular data type and a particular index access method. An operator class essentially specifies that a particular operator family is applicable to a particular indexable column data type. The set of operators from the family that are actually usable with the indexed column are whichever ones accept the column's data type as their left-hand input.
Operator classes are described at length in Section 37.14.
Table 51.33. pg_opclass
Columns
An operator class's opcmethod
must match the opfmethod
of its containing operator family. Also, there must be no more than one pg_opclass
row having opcdefault
true for any given combination of opcmethod
and opcintype
.
版本:11
The catalog pg_proc
stores information about functions, procedures, aggregate functions, and window functions (collectively also known as routines). See CREATE FUNCTION, CREATE PROCEDURE, and Section 38.3 for more information.
If prokind
indicates that the entry is for an aggregate function, there should be a matching row in pg_aggregate
.
Table 52.39. pg_proc
Columns
For compiled functions, both built-in and dynamically loaded, prosrc
contains the function's C-language name (link symbol). For all other currently-known language types, prosrc
contains the function's source text. probin
is unused except for dynamically-loaded C functions, for which it gives the name of the shared library file containing the function.
The catalog pg_constraint
stores check, primary key, unique, foreign key, and exclusion constraints on tables. (Column constraints are not treated specially. Every column constraint is equivalent to some table constraint.) Not-null constraints are represented in the pg_attribute
catalog, not here.
User-defined constraint triggers (created with CREATE CONSTRAINT TRIGGER
) also give rise to an entry in this table.
Check constraints on domains are stored here, too.
Table 51.13. pg_constraint
Columns
In the case of an exclusion constraint, conkey
is only useful for constraint elements that are simple column references. For other cases, a zero appears in conkey
and the associated index must be consulted to discover the expression that is constrained. (conkey
thus has the same contents as pg_index
.indkey
for the index.)
consrc
is not updated when referenced objects change; for example, it won't track renaming of columns. Rather than relying on this field, it's best to use pg_get_constraintdef()
to extract the definition of a check constraint.
pg_class.relchecks
needs to agree with the number of check-constraint entries found in this table for each relation.
pg_roles 這個 view 提供對資料庫角色的資訊。 這只是 pg_authid 的一個公共可讀的 view,它將密碼部份予以屏蔽。
此 view 列出底層的 OID 欄位,因此可能需要執行與其他目錄的交叉查詢。
Table 51.82. pg_roles
欄位
The catalog pg_policy
stores row level security policies for tables. A policy includes the kind of command that it applies to (possibly all commands), the roles that it applies to, the expression to be added as a security-barrier qualification to queries that include the table, and the expression to be added as a WITH CHECK
option for queries that attempt to add new records to the table.
Table 51.38. pg_policy
Columns
Policies stored in pg_policy
are applied only when pg_class
.relrowsecurity
is set for their table.
The catalog pg_statistic
stores statistical data about the contents of the database. Entries are created by ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.
Normally there is one entry, with stainherit
= false
, for each table column that has been analyzed. If the table has inheritance children, a second entry with stainherit
= true
is also created. This row represents the column's statistics over the inheritance tree, i.e., statistics for the data you'd see with SELECT
column
FROM table
*, whereas the stainherit
= false
row represents the results of SELECT
column
FROM ONLY table
.
pg_statistic
also stores statistical data about the values of index expressions. These are described as if they were actual data columns; in particular, starelid
references the index. No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column. Currently, entries for index expressions always have stainherit
= false
.
Since different kinds of statistics might be appropriate for different kinds of data, pg_statistic
is designed not to assume very much about what sort of statistics it stores. Only extremely general statistics (such as nullness) are given dedicated columns in pg_statistic
. Everything else is stored in “slots”, which are groups of associated columns whose content is identified by a code number in one of the slot's columns. For more information see src/include/catalog/pg_statistic.h
.
pg_statistic
should not be readable by the public, since even statistical information about a table's contents might be considered sensitive. (Example: minimum and maximum values of a salary column might be quite interesting.) pg_stats
is a publicly readable view on pg_statistic
that only exposes information about those tables that are readable by the current user.
Table 51.50. pg_statistic
Columns
系統目錄 pg_tablespace 儲存有關可用資料表空間的訊息。可以將資料表放置在特定的資料表空間中以幫助管理磁碟規畫。
與大多數系統目錄不同,pg_tablespace 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_tablespace 副本,而不是每個資料庫一個副本。
Table 51.54. pg_tablespace
欄位
檢視表 pg_settings 提供對伺服器的執行時參數的存取。它本質上是 和 指令的替代介面。它也提供 SHOW 無法直接獲得的一些資訊存取,例如最小值和最大值。
pg_settings
Columns設定內容有幾種可能的值,是為了降低變更組態的複雜度,它們是:
internal
這些設定無法直接更改;它們反映了內部所決定的值,其中一些可以透過使用不同的組態選項重建伺服器,或透過更改提供給 initdb 的選項來調整。
postmaster
這些設定只能在伺服器啟動時套用,因此任何變更都需要重新啟動伺服器。這些設定的值通常儲存在 postgresql.conf 檔案中,或在啟動伺服器時在命令列中給予。當然,也可以在伺服器啟動時設定任何層級較低的設定。
sighup
可以在 postgresql.conf 中對這些設定進行變更,而毌須重新啟動伺服器。只要向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf 並套用變更。postmaster 還會將 SIGHUP 信號轉發給其子程序,以便它們都獲取新值。
superuser-backend
可以在 postgresql.conf 中對這些設定進行變更,而毌須重新啟動伺服器。它們也可以在連線要求的封包中設定為特別連線(例如,透過 libpq 的 PGOPTIONS 環境變數),但前提是連線使用者是超級使用者。但是,這些設定在啟動後的連線中永遠不會變更。如果你在 postgresql.conf 中更改它們,請向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。
backend
可以在 postgresql.conf 中對這些設定進行變更,而毌須重新啟動伺服器。它們也可以在連線請求封包中設定為特別連線(例如,透過 libpq 的 PGOPTIONS 環境變數);任何使用者都可以為他們的連線進行這樣的變更。但是,這些設定在啟動後的連線中永遠無法變更。如果你在 postgresql.conf 中更改它們,請向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。
superuser
這些設定可以從 postgresql.conf 設定,也可以透過 SET 指令在連線中設定;但只有超級使用者可以透過 SET 來更改。僅當沒有使用 SET 建立連線專用的值時,postgresql.conf 中的變更才會影響現有連線。
user
這些設定可以從 postgresql.conf 設定,也可以透過 SET 指令在連線中設定。允許任何使用者變更其連線中所使用的值。僅當未使用 SET 未建立連線專用值時,postgresql.conf 中的變更才會影響現有連線。
The catalog pg_trigger
stores triggers on tables and views. See for more information.
Table 51.56. pg_trigger
Columns
Currently, column-specific triggering is supported only for UPDATE
events, and so tgattr
is relevant only for that event type. tgtype
might contain bits for other event types as well, but those are presumed to be table-wide regardless of what is in tgattr
.
When tgconstraint
is nonzero, tgconstrrelid
, tgconstrindid
, tgdeferrable
, and tginitdeferred
are largely redundant with the referenced pg_constraint
entry. However, it is possible for a non-deferrable trigger to be associated with a deferrable constraint: foreign key constraints can have some deferrable and some non-deferrable triggers.
pg_class.relhastriggers
must be true if a relation has any triggers in this catalog.
The pg_replication_origin_status
view contains information about how far replay for a certain origin has progressed. For more on replication origins see .
Table 51.80. pg_replication_origin_status
Columns
The catalog pg_type
stores information about data types. Base types and enum types (scalar types) are created with , and domains with . A composite type is automatically created for each table in the database, to represent the row structure of the table. It is also possible to create composite types with CREATE TYPE AS
.
Table 51.62. pg_type
Columns
Table 51.63. typcategory
Codes
版本:11
Source code formatting uses 4 column tab spacing, with tabs preserved (i.e., tabs are not expanded to spaces). Each logical indentation level is one additional tab stop.
Layout rules (brace positioning, etc) follow BSD conventions. In particular, curly braces for the controlled blocks of if
, while
, switch
, etc go on their own lines.
Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.)
Do not use C++ style comments (//
comments). Strict ANSI C compilers do not accept them. For the same reason, do not use C++ extensions such as declaring new variables mid-block.
The preferred style for multi-line comment blocks is
Note that comment blocks that begin in column 1 will be preserved as-is by pgindent, but it will re-flow indented comment blocks as though they were plain text. If you want to preserve the line breaks in an indented block, add dashes like this:
While submitted patches do not absolutely have to follow these formatting rules, it's a good idea to do so. Your code will get run through pgindent before the next release, so there's no point in making it look nice under some other set of formatting conventions. A good rule of thumb for patches is “make the new code look like the existing code around it”.
The src/tools
directory contains sample settings files that can be used with the emacs, xemacs or vim editors to help ensure that they format code according to these conventions.
The text browsing tools more and less can be invoked as:
to make them show tabs appropriately.
版本:11
This style guide is offered in the hope of maintaining a consistent, user-friendly style throughout all the messages generated by PostgreSQL.
The primary message should be short, factual, and avoid reference to implementation details such as specific function names. “Short” means “should fit on one line under normal conditions”. Use a detail message if needed to keep the primary message short, or if you feel a need to mention implementation details such as the particular system call that failed. Both primary and detail messages should be factual. Use a hint message for suggestions about what to do to fix the problem, especially if the suggestion might not always be applicable.
For example, instead of:
write:
Rationale: keeping the primary message short helps keep it to the point, and lets clients lay out screen space on the assumption that one line is enough for error messages. Detail and hint messages can be relegated to a verbose mode, or perhaps a pop-up error-details window. Also, details and hints would normally be suppressed from the server log to save space. Reference to implementation details is best avoided since users aren't expected to know the details.
Don't put any specific assumptions about formatting into the message texts. Expect clients and the server log to wrap lines to fit their own needs. In long messages, newline characters (\n) can be used to indicate suggested paragraph breaks. Don't end a message with a newline. Don't use tabs or other formatting characters. (In error context displays, newlines are automatically added to separate levels of context such as function calls.)
Rationale: Messages are not necessarily displayed on terminal-type displays. In GUI displays or browsers these formatting instructions are at best ignored.
English text should use double quotes when quoting is appropriate. Text in other languages should consistently use one kind of quotes that is consistent with publishing customs and computer output of other programs.
Rationale: The choice of double quotes over single quotes is somewhat arbitrary, but tends to be the preferred use. Some have suggested choosing the kind of quotes depending on the type of object according to SQL conventions (namely, strings single quoted, identifiers double quoted). But this is a language-internal technical issue that many users aren't even familiar with, it won't scale to other kinds of quoted terms, it doesn't translate to other languages, and it's pretty pointless, too.
Use quotes always to delimit file names, user-supplied identifiers, and other variables that might contain words. Do not use them to mark up variables that will not contain words (for example, operator names).
There are functions in the backend that will double-quote their own output at need (for example, format_type_be()
). Do not put additional quotes around the output of such functions.
Rationale: Objects can have names that create ambiguity when embedded in a message. Be consistent about denoting where a plugged-in name starts and ends. But don't clutter messages with unnecessary or duplicate quote marks.
The rules are different for primary error messages and for detail/hint messages:
Primary error messages: Do not capitalize the first letter. Do not end a message with a period. Do not even think about ending a message with an exclamation point.
Detail and hint messages: Use complete sentences, and end each with a period. Capitalize the first word of sentences. Put two spaces after the period if another sentence follows (for English text; might be inappropriate in other languages).
Error context strings: Do not capitalize the first letter and do not end the string with a period. Context strings should normally not be complete sentences.
Rationale: Avoiding punctuation makes it easier for client applications to embed the message into a variety of grammatical contexts. Often, primary messages are not grammatically complete sentences anyway. (And if they're long enough to be more than one sentence, they should be split into primary and detail parts.) However, detail and hint messages are longer and might need to include multiple sentences. For consistency, they should follow complete-sentence style even when there's only one sentence.
Use lower case for message wording, including the first letter of a primary error message. Use upper case for SQL commands and key words if they appear in the message.
Rationale: It's easier to make everything look more consistent this way, since some messages are complete sentences and some not.
Use the active voice. Use complete sentences when there is an acting subject (“A could not do B”). Use telegram style without subject if the subject would be the program itself; do not use “I” for the program.
Rationale: The program is not human. Don't pretend otherwise.
Use past tense if an attempt to do something failed, but could perhaps succeed next time (perhaps after fixing some problem). Use present tense if the failure is certainly permanent.
There is a nontrivial semantic difference between sentences of the form:
and:
The first one means that the attempt to open the file failed. The message should give a reason, such as “disk full” or “file doesn't exist”. The past tense is appropriate because next time the disk might not be full anymore or the file in question might exist.
The second form indicates that the functionality of opening the named file does not exist at all in the program, or that it's conceptually impossible. The present tense is appropriate because the condition will persist indefinitely.
Rationale: Granted, the average user will not be able to draw great conclusions merely from the tense of the message, but since the language provides us with a grammar we should use it correctly.
When citing the name of an object, state what kind of object it is.
Rationale: Otherwise no one will know what “foo.bar.baz” refers to.
Square brackets are only to be used (1) in command synopses to denote optional arguments, or (2) to denote an array subscript.
Rationale: Anything else does not correspond to widely-known customary usage and will confuse people.
When a message includes text that is generated elsewhere, embed it in this style:
Rationale: It would be difficult to account for all possible error codes to paste this into a single smooth sentence, so some sort of punctuation is needed. Putting the embedded text in parentheses has also been suggested, but it's unnatural if the embedded text is likely to be the most important part of the message, as is often the case.
Messages should always state the reason why an error occurred. For example:
If no reason is known you better fix the code.
Don't include the name of the reporting routine in the error text. We have other mechanisms for finding that out when needed, and for most users it's not helpful information. If the error text doesn't make as much sense without the function name, reword it.
Avoid mentioning called function names, either; instead say what the code was trying to do:
If it really seems necessary, mention the system call in the detail message. (In some cases, providing the actual values passed to the system call might be appropriate information for the detail message.)
Rationale: Users don't know what all those functions do.
Unable. “Unable” is nearly the passive voice. Better use “cannot” or “could not”, as appropriate.
Bad. Error messages like “bad result” are really hard to interpret intelligently. It's better to write why the result is “bad”, e.g., “invalid format”.
Illegal. “Illegal” stands for a violation of the law, the rest is “invalid”. Better yet, say why it's invalid.
Unknown. Try to avoid “unknown”. Consider “error: unknown response”. If you don't know what the response is, how do you know it's erroneous? “Unrecognized” is often a better choice. Also, be sure to include the value being complained of.
Find vs. Exists. If the program uses a nontrivial algorithm to locate a resource (e.g., a path search) and that algorithm fails, it is fair to say that the program couldn't “find” the resource. If, on the other hand, the expected location of the resource is known but the program cannot access it there then say that the resource doesn't “exist”. Using “find” in this case sounds weak and confuses the issue.
May vs. Can vs. Might. “May” suggests permission (e.g., "You may borrow my rake."), and has little use in documentation or error messages. “Can” suggests ability (e.g., "I can lift that log."), and “might” suggests possibility (e.g., "It might rain today."). Using the proper word clarifies meaning and assists translation.
Contractions. Avoid contractions, like “can't”; use “cannot” instead.
Spell out words in full. For instance, avoid:
spec
stats
parens
auth
xact
Rationale: This will improve consistency.
The view pg_stats
provides access to the information stored in the catalog. This view allows access only to rows of pg_statistic
that correspond to tables the user has permission to read, and therefore it is safe to allow public read access to this view.
pg_stats
is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic
.
Table 51.88. pg_stats
Columns
The maximum number of entries in the array fields can be controlled on a column-by-column basis using the ALTER TABLE SET STATISTICS
command, or globally by setting the run-time parameter.
GiST stands for Generalized Search Tree. It is a balanced, tree-structured access method, that acts as a base template in which to implement arbitrary indexing schemes. B-trees, R-trees and many other indexing schemes can be implemented in GiST.
One advantage of GiST is that it allows the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.
Some of the information here is derived from the University of California at Berkeley's GiST Indexing Project and Marcel Kornacker's thesis, . The GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their .
版本:11
Code in PostgreSQL should only rely on language features available in the C89 standard. That means a conforming C89 compiler has to be able to compile postgres, at least aside from a few platform dependent pieces. Features from later revision of the C standard or compiler specific features can be used, if a fallback is provided.
For example static inline
and _StaticAssert()
are currently used, even though they are from newer revisions of the C standard. If not available we respectively fall back to defining the functions without inline, and to using a C89 compatible replacement that performs the same checks, but emits rather cryptic messages.
Both, macros with arguments and static inline
functions, may be used. The latter are preferable if there are multiple-evaluation hazards when written as a macro, as e.g. the case with
or when the macro would be very long. In other cases it's only possible to use macros, or at least easier. For example because expressions of various types need to be passed to the macro.
When the definition of an inline function references symbols (i.e. variables, functions) that are only available as part of the backend, the function may not be visible when included from frontend code.
In this example CurrentMemoryContext
, which is only available in the backend, is referenced and the function thus hidden with a #ifndef FRONTEND
. This rule exists because some compilers emit references to symbols contained in inline functions even if the function is not used.
To be suitable to run inside a signal handler code has to be written very carefully. The fundamental problem is that, unless blocked, a signal handler can interrupt code at any time. If code inside the signal handler uses the same state as code outside chaos may ensue. As an example consider what happens if a signal handler tries to acquire a lock that's already held in the interrupted code.
Barring special arrangements code in signal handlers may only call async-signal safe functions (as defined in POSIX) and access variables of type volatile sig_atomic_t
. A few functions in postgres
are also deemed signal safe, importantly SetLatch()
.
In most cases signal handlers should do nothing more than note that a signal has arrived, and wake up code running outside of the handler using a latch. An example of such a handler is the following:
errno
is saved and restored because SetLatch()
might change it. If that were not done interrupted code that's currently inspecting errno
might see the wrong value.
For clarity, it is preferred to explicitly dereference a function pointer when calling the pointed-to function if the pointer is a simple variable, for example:
(even though emit_log_hook(edata)
would also work). When the function pointer is part of a structure, then the extra punctuation can and usually should be omitted, for example:
Building large GiST indexes by simply inserting all the tuples tends to be slow, because if the index tuples are scattered across the index and the index is large enough to not fit in cache, the insertions need to perform a lot of random I/O. Beginning in version 9.2, PostgreSQL supports a more efficient method to build GiST indexes based on buffering, which can dramatically reduce the number of random I/Os needed for non-ordered data sets. For well-ordered data sets the benefit is smaller or non-existent, because only a small number of pages receive new tuples at a time, and those pages fit in cache even if the index as whole does not.
However, buffering index build needs to call the penalty
function more often, which consumes some extra CPU resources. Also, the buffers used in the buffering build need temporary disk space, up to the size of the resulting index. Buffering can also influence the quality of the resulting index, in both positive and negative directions. That influence depends on various factors, like the distribution of the input data and the operator class implementation.
By default, a GiST index build switches to the buffering method when the index size reaches . It can be manually turned on or off by the buffering
parameter to the CREATE INDEX command. The default behavior is good for most cases, but turning buffering off might speed up the build somewhat if the input data is ordered.
This section describes the logical replication protocol, which is the message flow started by the START_REPLICATION
SLOT
slot_name
LOGICAL
replication command.
The logical streaming replication protocol builds on the primitives of the physical streaming replication protocol.
The logical replication START_REPLICATION
command accepts following parameters:proto_version
Protocol version. Currently only version 1
is supported.publication_names
Comma separated list of publication names for which to subscribe (receive changes). The individual publication names are treated as standard objects names and can be quoted the same as needed.
The individual protocol messages are discussed in the following subsections. Individual messages are described in .
All top-level protocol messages begin with a message type byte. While represented in code as a character, this is a signed byte with no associated encoding.
Since the streaming replication protocol supplies a message length there is no need for top-level protocol messages to embed a length in their header.
With the exception of the START_REPLICATION
command and the replay progress messages, all information flows only from the backend to the frontend.
The logical replication protocol sends individual transactions one by one. This means that all messages between a pair of Begin and Commit messages belong to the same transaction.
Every sent transaction contains zero or more DML messages (Insert, Update, Delete). In case of a cascaded setup it can also contain Origin messages. The origin message indicated that the transaction originated on different replication node. Since a replication node in the scope of logical replication protocol can be pretty much anything, the only identifier is the origin name. It's downstream's responsibility to handle this as needed (if needed). The Origin message is always sent before any DML messages in the transaction.
Every DML message contains an arbitrary relation ID, which can be mapped to an ID in the Relation messages. The Relation messages describe the schema of the given relation. The Relation message is sent for a given relation either because it is the first time we send a DML message for given relation in the current session or because the relation definition has changed since the last Relation message was sent for it. The protocol assumes that the client is capable of caching the metadata for as many relations as needed.
All calls to functions that are written in a language other than the current “version 1” interface for compiled languages (this includes functions in user-defined procedural languages and functions written in SQL) go through a call handler function for the specific language. It is the responsibility of the call handler to execute the function in a meaningful way, such as by interpreting the supplied source text. This chapter outlines how a new procedural language's call handler can be written.
The call handler for a procedural language is a “normal” function that must be written in a compiled language such as C, using the version-1 interface, and registered with PostgreSQL as taking no arguments and returning the type language_handler
. This special pseudo-type identifies the function as a call handler and prevents it from being called directly in SQL commands. For more details on C language calling conventions and dynamic loading, see .
The call handler is called in the same way as any other function: It receives a pointer to a FunctionCallInfoData
struct
containing argument values and information about the called function, and it is expected to return aDatum
result (and possibly set the isnull
field of the FunctionCallInfoData
structure, if it wishes to return an SQL null result). The difference between a call handler and an ordinary callee function is that the flinfo->fn_oid
field of the FunctionCallInfoData
structure will contain the OID of the actual function to be called, not of the call handler itself. The call handler must use this field to determine which function to execute. Also, the passed argument list has been set up according to the declaration of the target function, not of the call handler.
It's up to the call handler to fetch the entry of the function from the pg_proc
system catalog and to analyze the argument and return types of the called function. The AS
clause from the CREATE FUNCTION
command for the function will be found in the prosrc
column of the pg_proc
row. This is commonly source text in the procedural language, but in theory it could be something else, such as a path name to a file, or anything else that tells the call handler what to do in detail.
Often, the same function is called many times per SQL statement. A call handler can avoid repeated lookups of information about the called function by using the flinfo->fn_extra
field. This will initially be NULL
, but can be set by the call handler to point at information about the called function. On subsequent calls, if flinfo->fn_extra
is already non-NULL
then it can be used and the information lookup step skipped. The call handler must make sure that flinfo->fn_extra
is made to point at memory that will live at least until the end of the current query, since an FmgrInfo
data structure could be kept that long. One way to do this is to allocate the extra data in the memory context specified by flinfo->fn_mcxt
; such data will normally have the same lifespan as the FmgrInfo
itself. But the handler could also choose to use a longer-lived memory context so that it can cache function definition information across queries.
When a procedural-language function is invoked as a trigger, no arguments are passed in the usual way, but the FunctionCallInfoData
's context
field points at a TriggerData
structure, rather than being NULL
as it is in a plain function call. A language handler should provide mechanisms for procedural-language functions to get at the trigger information.
This is a template for a procedural-language handler written in C:
Only a few thousand lines of code have to be added instead of the dots to complete the call handler.
If a validator is provided by a procedural language, it must be declared as a function taking a single parameter of type oid
. The validator's result is ignored, so it is customarily declared to return void
. The validator will be called at the end of a CREATE FUNCTION
command that has created or updated a function written in the procedural language. The passed-in OID is the OID of the function's pg_proc
row. The validator must fetch this row in the usual way, and do whatever checking is appropriate. First, call CheckFunctionValidatorAccess()
to diagnose explicit calls to the validator that the user could not achieve through CREATE FUNCTION
. Typical checks then include verifying that the function's argument and result types are supported by the language, and that the function's body is syntactically correct in the language. If the validator finds the function to be okay, it should just return. If it finds an error, it should report that via the normal ereport()
error reporting mechanism. Throwing an error will force a transaction rollback and thus prevent the incorrect function definition from being committed.
If an inline handler is provided by a procedural language, it must be declared as a function taking a single parameter of type internal
. The inline handler's result is ignored, so it is customarily declared to return void
. The inline handler will be called when a DO
statement is executed specifying the procedural language. The parameter actually passed is a pointer to an InlineCodeBlock
struct, which contains information about the DO
statement's parameters, in particular the text of the anonymous code block to be executed. The inline handler should execute this code and return.
The core PostgreSQL distribution includes the GiST operator classes shown in . (Some of the optional modules described in provide additional GiST operator classes.)
For historical reasons, the inet_ops
operator class is not the default class for types inet
and cidr
. To use it, mention the class name in CREATE INDEX
, for example
Traditionally, implementing a new index access method meant a lot of difficult work. It was necessary to understand the inner workings of the database, such as the lock manager and Write-Ahead Log. The GiST interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GiST layer itself takes care of concurrency, logging and searching the tree structure.
This extensibility should not be confused with the extensibility of the other standard search trees in terms of the data they can handle. For example, PostgreSQL supports extensible B-trees and hash indexes. That means that you can use PostgreSQL to build a B-tree or hash over any data type you want. But B-trees only support range predicates (<
, =
, >
), and hash indexes only support equality queries.
So if you index, say, an image collection with a PostgreSQL B-tree, you can only issue queries such as “is imagex equal to imagey”, “is imagex less than imagey” and “is imagex greater than imagey”. Depending on how you define “equals”, “less than” and “greater than” in this context, this could be useful. However, by using a GiST based index, you could create ways to ask domain-specific questions, perhaps “find all images of horses” or “find all over-exposed images”.
All it takes to get a GiST access method up and running is to implement several user-defined methods, which define the behavior of keys in the tree. Of course these methods have to be pretty fancy to support fancy queries, but for all the standard queries (B-trees, R-trees, etc.) they're relatively straightforward. In short, GiST combines extensibility along with generality, code reuse, and a clean interface.
There are seven methods that an index operator class for GiST must provide, and two that are optional. Correctness of the index is ensured by proper implementation of the same
,consistent
and union
methods, while efficiency (size and speed) of the index will depend on the penalty
and picksplit
methods. The remaining two basic methods are compress
and decompress
, which allow an index to have internal tree data of a different type than the data it indexes. The leaves are to be of the indexed data type, while the other tree nodes can be of any C struct (but you still have to follow PostgreSQL data type rules here, see about varlena
for variable sized data). If the tree's internal data type exists at the SQL level, theSTORAGE
option of the CREATE OPERATOR CLASS
command can be used. The optional eighth method is distance
, which is needed if the operator class wishes to support ordered scans (nearest-neighbor searches). The optional ninth method fetch
is needed if the operator class wishes to support index-only scans.consistent
Given an index entry p
and a query value q
, this function determines whether the index entry is “consistent” with the query; that is, could the predicate “indexed_columnindexable_operator
q
” be true for any row represented by the index entry? For a leaf index entry this is equivalent to testing the indexable condition, while for an internal tree node this determines whether it is necessary to scan the subtree of the index represented by the tree node. When the result is true
, a recheck
flag must also be returned. This indicates whether the predicate is certainly true or only possibly true. If recheck
= false
then the index has tested the predicate condition exactly, whereas if recheck
= true
the row is only a candidate match. In that case the system will automatically evaluate the indexable_operator
against the actual row value to see if it is really a match. This convention allows GiST to support both lossless and lossy index structures.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
Here, key
is an element in the index and query
the value being looked up in the index. The StrategyNumber
parameter indicates which operator of your operator class is being applied — it matches one of the operator numbers in the CREATE OPERATOR CLASS
command.
Depending on which operators you have included in the class, the data type of query
could vary with the operator, since it will be whatever type is on the righthand side of the operator, which might be different from the indexed data type appearing on the lefthand side. (The above code skeleton assumes that only one type is possible; if not, fetching the query
argument value would have to depend on the operator.) It is recommended that the SQL declaration of the consistent
function use the opclass's indexed data type for the query
argument, even though the actual type might be something else depending on the operator.union
This method consolidates information in the tree. Given a set of entries, this function generates a new index entry that represents all the given entries.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
As you can see, in this skeleton we're dealing with a data type where union(X, Y, Z) = union(union(X, Y), Z)
. It's easy enough to support data types where this is not the case, by implementing the proper union algorithm in this GiST support method.
The result of the union
function must be a value of the index's storage type, whatever that is (it might or might not be different from the indexed column's type). The union
function should return a pointer to newly palloc()
ed memory. You can't just return the input value as-is, even if there is no type change.
As shown above, the union
function's first internal
argument is actually a GistEntryVector
pointer. The second argument is a pointer to an integer variable, which can be ignored. (It used to be required that the union
function store the size of its result value into that variable, but this is no longer necessary.)compress
Converts the data item into a format suitable for physical storage in an index page.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
You have to adapt compressed_data_type
to the specific type you're converting to in order to compress your leaf nodes, of course.decompress
The reverse of the compress
method. Converts the index representation of the data item into a format that can be manipulated by the other GiST methods in the operator class.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
The above skeleton is suitable for the case where no decompression is needed.penalty
Returns a value indicating the “cost” of inserting the new entry into a particular branch of the tree. Items will be inserted down the path of least penalty
in the tree. Values returned by penalty
should be non-negative. If a negative value is returned, it will be treated as zero.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
For historical reasons, the penalty
function doesn't just return a float
result; instead it has to store the value at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.
The penalty
function is crucial to good performance of the index. It'll get used at insertion time to determine which branch to follow when choosing where to add the new entry in the tree. At query time, the more balanced the index, the quicker the lookup.picksplit
When an index page split is necessary, this function decides which entries on the page are to stay on the old page, and which are to move to the new page.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
Notice that the picksplit
function's result is delivered by modifying the passed-in v
structure. The return value per se is ignored, though it's conventional to pass back the address of v
.
Like penalty
, the picksplit
function is crucial to good performance of the index. Designing suitable penalty
and picksplit
implementations is where the challenge of implementing well-performing GiST indexes lies.same
Returns true if two index entries are identical, false otherwise. (An “index entry” is a value of the index's storage type, not necessarily the original indexed column's type.)
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
For historical reasons, the same
function doesn't just return a Boolean result; instead it has to store the flag at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.distance
Given an index entry p
and a query value q
, this function determines the index entry's “distance” from the query value. This function must be supplied if the operator class contains any ordering operators. A query using the ordering operator will be implemented by returning index entries with the smallest “distance” values first, so the results must be consistent with the operator's semantics. For a leaf index entry the result just represents the distance to the index entry; for an internal tree node, the result must be the smallest distance that any child entry could have.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
The arguments to the distance
function are identical to the arguments of the consistent
function.
Some approximation is allowed when determining the distance, so long as the result is never greater than the entry's actual distance. Thus, for example, distance to a bounding box is usually sufficient in geometric applications. For an internal tree node, the distance returned must not be greater than the distance to any of the child nodes. If the returned distance is not exact, the function must set *recheck
to true. (This is not necessary for internal tree nodes; for them, the calculation is always assumed to be inexact.) In this case the executor will calculate the accurate distance after fetching the tuple from the heap, and reorder the tuples if necessary.
If the distance function returns *recheck = true
for any leaf node, the original ordering operator's return type must be float8
or float4
, and the distance function's result values must be comparable to those of the original ordering operator, since the executor will sort using both distance function results and recalculated ordering-operator results. Otherwise, the distance function's result values can be any finite float8
values, so long as the relative order of the result values matches the order returned by the ordering operator. (Infinity and minus infinity are used internally to handle cases such as nulls, so it is not recommended that distance
functions return these values.)fetch
Converts the compressed index representation of a data item into the original data type, for index-only scans. The returned data must be an exact, non-lossy copy of the originally indexed value.
The SQL declaration of the function must look like this:
The argument is a pointer to a GISTENTRY
struct. On entry, its key
field contains a non-NULL leaf datum in compressed form. The return value is another GISTENTRY
struct, whose key
field contains the same datum in its original, uncompressed form. If the opclass's compress function does nothing for leaf entries, the fetch
method can return the argument as-is.
The matching code in the C module could then follow this skeleton:
If the compress method is lossy for leaf entries, the operator class cannot support index-only scans, and must not define a fetch
function.
All the GiST support methods are normally called in short-lived memory contexts; that is, CurrentMemoryContext
will get reset after each tuple is processed. It is therefore not very important to worry about pfree'ing everything you palloc. However, in some cases it's useful for a support method to cache data across repeated calls. To do that, allocate the longer-lived data in fcinfo->flinfo->fn_mcxt
, and keep a pointer to it in fcinfo->flinfo->fn_extra
. Such data will survive for the life of the index operation (e.g., a single GiST index scan, index build, or index tuple insertion). Be careful to pfree the previous value when replacing a fn_extra
value, or the leak will accumulate for the duration of the operation.
版本:11
Error, warning, and log messages generated within the server code should be created using ereport
, or its older cousin elog
. The use of this function is complex enough to require some explanation.
There are two required elements for every message: a severity level (ranging from DEBUG
to PANIC
) and a primary message text. In addition there are optional elements, the most common of which is an error identifier code that follows the SQL spec's SQLSTATE conventions. ereport
itself is just a shell function, that exists mainly for the syntactic convenience of making message generation look like a function call in the C source code. The only parameter accepted directly by ereport
is the severity level. The primary message text and any optional message elements are generated by calling auxiliary functions, such as errmsg
, within the ereport
call.
A typical call to ereport
might look like this:
This specifies error severity level ERROR
(a run-of-the-mill error). The errcode
call specifies the SQLSTATE error code using a macro defined in src/include/utils/errcodes.h
. The errmsg
call provides the primary message text. Notice the extra set of parentheses surrounding the auxiliary function calls — these are annoying but syntactically necessary.
Here is a more complex example:
This illustrates the use of format codes to embed run-time values into a message text. Also, an optional “hint” message is provided.
If the severity level is ERROR
or higher, ereport
aborts the execution of the user-defined function and does not return to the caller. If the severity level is lower than ERROR
, ereport
returns normally.
The available auxiliary routines for ereport
are:
errcode(sqlerrcode)
specifies the SQLSTATE error identifier code for the condition. If this routine is not called, the error identifier defaults to ERRCODE_INTERNAL_ERROR
when the error severity level is ERROR
or higher, ERRCODE_WARNING
when the error level is WARNING
, otherwise (for NOTICE
and below) ERRCODE_SUCCESSFUL_COMPLETION
. While these defaults are often convenient, always think whether they are appropriate before omitting the errcode()
call.
errmsg(const char *msg, ...)
specifies the primary error message text, and possibly run-time values to insert into it. Insertions are specified by sprintf
-style format codes. In addition to the standard format codes accepted by sprintf
, the format code %m
can be used to insert the error message returned by strerror
for the current value of errno
. %m
does not require any corresponding entry in the parameter list for errmsg
. Note that the message string will be run through gettext
for possible localization before format codes are processed.
errmsg_internal(const char *msg, ...)
is the same as errmsg
, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for “cannot happen” cases that are probably not worth expending translation effort on.
errmsg_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errmsg
, but with support for various plural forms of the message. fmt_singular
is the English singular format, fmt_plural
is the English plural format, n
is the integer value that determines which plural form is needed, and the remaining arguments are formatted according to the selected format string. For more information see .
errdetail(const char *msg, ...)
supplies an optional “detail” message; this is to be used when there is additional information that seems inappropriate to put in the primary message. The message string is processed in just the same way as for errmsg
.
errdetail_internal(const char *msg, ...)
is the same as errdetail
, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for detail messages that are not worth expending translation effort on, for instance because they are too technical to be useful to most users.
errdetail_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errdetail
, but with support for various plural forms of the message. For more information see .
errdetail_log(const char *msg, ...)
is the same as errdetail
except that this string goes only to the server log, never to the client. If both errdetail
(or one of its equivalents above) and errdetail_log
are used then one string goes to the client and the other to the log. This is useful for error details that are too security-sensitive or too bulky to include in the report sent to the client.
errdetail_log_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errdetail_log
, but with support for various plural forms of the message. For more information see .
errhint(const char *msg, ...)
supplies an optional “hint” message; this is to be used when offering suggestions about how to fix the problem, as opposed to factual details about what went wrong. The message string is processed in just the same way as for errmsg
.
errcontext(const char *msg, ...)
is not normally called directly from an ereport
message site; rather it is used in error_context_stack
callback functions to provide information about the context in which an error occurred, such as the current location in a PL function. The message string is processed in just the same way as for errmsg
. Unlike the other auxiliary functions, this can be called more than once per ereport
call; the successive strings thus supplied are concatenated with separating newlines.
errposition(int cursorpos)
specifies the textual location of an error within a query string. Currently it is only useful for errors detected in the lexical and syntactic analysis phases of query processing.
errtable(Relation rel)
specifies a relation whose name and schema name should be included as auxiliary fields in the error report.
errtablecol(Relation rel, int attnum)
specifies a column whose name, table name, and schema name should be included as auxiliary fields in the error report.
errtableconstraint(Relation rel, const char *conname)
specifies a table constraint whose name, table name, and schema name should be included as auxiliary fields in the error report. Indexes should be considered to be constraints for this purpose, whether or not they have an associated pg_constraint
entry. Be careful to pass the underlying heap relation, not the index itself, as rel
.
errdatatype(Oid datatypeOid)
specifies a data type whose name and schema name should be included as auxiliary fields in the error report.
errdomainconstraint(Oid datatypeOid, const char *conname)
specifies a domain constraint whose name, domain name, and schema name should be included as auxiliary fields in the error report.
errcode_for_file_access()
is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a file-access-related system call. It uses the saved errno
to determine which error code to generate. Usually this should be used in combination with %m
in the primary error message text.
errcode_for_socket_access()
is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a socket-related system call.
errhidestmt(bool hide_stmt)
can be called to specify suppression of the STATEMENT:
portion of a message in the postmaster log. Generally this is appropriate if the message text includes the current statement already.
errhidecontext(bool hide_ctx)
can be called to specify suppression of the CONTEXT:
portion of a message in the postmaster log. This should only be used for verbose debugging messages where the repeated inclusion of context would bloat the log volume too much.
At most one of the functions errtable
, errtablecol
, errtableconstraint
, errdatatype
, or errdomainconstraint
should be used in an ereport
call. These functions exist to allow applications to extract the name of a database object associated with the error condition without having to examine the potentially-localized error message text. These functions should be used in error reports for which it's likely that applications would wish to have automatic error handling. As of PostgreSQL 9.3, complete coverage exists only for errors in SQLSTATE class 23 (integrity constraint violation), but this is likely to be expanded in future.
There is an older function elog
that is still heavily used. An elog
call:
is exactly equivalent to:
Notice that the SQLSTATE error code is always defaulted, and the message string is not subject to translation. Therefore, elog
should be used only for internal errors and low-level debug logging. Any message that is likely to be of interest to ordinary users should go through ereport
. Nonetheless, there are enough internal “cannot happen” error checks in the system that elog
is still widely used; it is preferred for those messages for its notational simplicity.
The core PostgreSQL distribution includes the SP-GiST operator classes shown in .
Of the two operator classes for type point
, quad_point_ops
is the default. kd_point_ops
supports the same operators but uses a different index data structure which may offer better performance in some applications.
This section covers implementation details and other tricks that are useful for implementers of SP-GiST operator classes to know.
Individual leaf tuples and inner tuples must fit on a single index page (8kB by default). Therefore, when indexing values of variable-length data types, long values can only be supported by methods such as radix trees, in which each level of the tree includes a prefix that is short enough to fit on a page, and the final leaf level includes a suffix also short enough to fit on a page. The operator class should set longValuesOK
to TRUE only if it is prepared to arrange for this to happen. Otherwise, the SP-GiST core will reject any request to index a value that is too large to fit on an index page.
Likewise, it is the operator class's responsibility that inner tuples do not grow too large to fit on an index page; this limits the number of child nodes that can be used in one inner tuple, as well as the maximum size of a prefix value.
Another limitation is that when an inner tuple's node points to a set of leaf tuples, those tuples must all be in the same index page. (This is a design decision to reduce seeking and save space in the links that chain such tuples together.) If the set of leaf tuples grows too large for a page, a split is performed and an intermediate inner tuple is inserted. For this to fix the problem, the new inner tuple must divide the set of leaf values into more than one node group. If the operator class's picksplit
function fails to do that, the SP-GiST core resorts to extraordinary measures described in .
Some tree algorithms use a fixed set of nodes for each inner tuple; for example, in a quad-tree there are always exactly four nodes corresponding to the four quadrants around the inner tuple's centroid point. In such a case the code typically works with the nodes by number, and there is no need for explicit node labels. To suppress node labels (and thereby save some space), the picksplit
function can return NULL for the nodeLabels
array, and likewise the choose
function can return NULL for the prefixNodeLabels
array during a spgSplitTuple
action. This will in turn result in nodeLabels
being NULL during subsequent calls to choose
and inner_consistent
. In principle, node labels could be used for some inner tuples and omitted for others in the same index.
When working with an inner tuple having unlabeled nodes, it is an error for choose
to return spgAddNode
, since the set of nodes is supposed to be fixed in such cases.
The SP-GiST core can override the results of the operator class's picksplit
function when picksplit
fails to divide the supplied leaf values into at least two node categories. When this happens, the new inner tuple is created with multiple nodes that each have the same label (if any) that picksplit
gave to the one node it did use, and the leaf values are divided at random among these equivalent nodes. The allTheSame
flag is set on the inner tuple to warn the choose
and inner_consistent
functions that the tuple does not have the node set that they might otherwise expect.
When dealing with an allTheSame
tuple, a choose
result of spgMatchNode
is interpreted to mean that the new value can be assigned to any of the equivalent nodes; the core code will ignore the supplied nodeN
value and descend into one of the nodes at random (so as to keep the tree balanced). It is an error for choose
to return spgAddNode
, since that would make the nodes not all equivalent; the spgSplitTuple
action must be used if the value to be inserted doesn't match the existing nodes.
When dealing with an allTheSame
tuple, the inner_consistent
function should return either all or none of the nodes as targets for continuing the index search, since they are all equivalent. This may or may not require any special-case code, depending on how much the inner_consistent
function normally assumes about the meaning of the nodes.
The PostgreSQL source distribution includes several examples of index operator classes for SP-GiST, as described in . Look into src/backend/access/spgist/
and src/backend/utils/adt/
to see the code.
Create vs. insert
由於可能為每個項目插入了許多索引鍵,因此插入 GIN 索引可能會很慢。因此,對於批次插入表格,建議在完成批次插入後刪除 GIN 索引並重新建立。
從 PostgreSQL 8.4 開始,由於使用了延遲索引,因此此建議不太必要(詳見)。但對於非常大的更新,還是最好刪除並重新建立索引。
GIN 索引的建構時間對 maintenance_work_mem 設定非常敏感;在建立索引期間,不需要花費工作記憶體。
在一系列插入已啟用 fastupdate 的現有 GIN 索引期間,只要列表大於 gin_pending_list_limit,系統就會清理待處理項目列表。為了避免觀察的回應時間波動,希望在背景進行待處理列表清理(即透過 autovacuum)。透過增加 gin_pending_list_limit 或使 autovacuum 更積極,可以減少手動清理操作。但是,擴大清理操作的閾值意味著如果確實發生了手動清理,則需要更長時間。
可以透過變更儲存參數來覆蓋各個 GIN 索引的 gin_pending_list_limit,並允許每個 GIN 索引具有自己的清理閾值。例如,可以僅為可以大量更新的 GIN 索引增加閾值,否則可以減少閾值。
開發 GIN 索引的主要目標是在 PostgreSQL 中建立對可高度延展的全文檢索支援,並且通常情況下全文檢索會回傳非常大的結果集合。然而,當查詢包含非常頻繁的單詞時,通常會發生這種情況,因此大型結果集甚至不起作用。由於從磁碟讀取許多 tuple 並對它們進行排序可能需要花費大量時間,因此這對於產品環境來說是不可接受的。(請注意,索引搜尋本身非常快。)
為了便於控制執行此類查詢,GIN 對回傳的資料列數量有一個可配置的軟性上限:gin_fuzzy_search_limit 配置參數。預設設定為 0(表示無限制)。如果設定了非零的限制,則回傳的集合是整個結果集合的子集,隨機選擇。
「軟性上限」表示回傳結果的實際數量可能與指定的限制略有不同,具體取決於查詢和系統隨機數産生器的情況。
從經驗來看,數千以上的值(例如 5000 - 20000)是比較好的範圍。
GIN 代表 Generalized Inverted Index。GIN 設計用於處理要被索引的項目是複合值的情況,並且由索引處理的查詢需要搜索出現在複合項目內的元素值。例如,這些項目可能是文件,查詢可能是搜索包含特定單詞的文件。
我們使用單詞 item 來引用要編入索引的複合值,並使用單詞索引鍵(word key)來引用元素值。GIN 總是儲存和搜索索引鍵,而不是其值。
GIN 索引儲存一組(key, posting list)對,其中 posting list 是 key 對應的一組資料列 ID。同一資料列 ID 可以出現在多個 posting list 當中,因為一個項目可以包含多個關鍵字。每個索引鍵值只儲存一次,因此對於相同鍵出現多次的情況,GIN 索引非常會緊湊。
GIN 是泛用的,因為 GIN 存取方法的語法不需要知道它加速的具體操作。相反地,它使用為特定資料型別定義的自訂策略。該策略定義瞭瞭如何從索引項目和查詢條件中提取關鍵字,以及如何確定包含查詢中某些關鍵值的資料列能夠實際滿足查詢。
GIN 的一個優點是,它允許由資料型別領域的專家而不是資料庫專家使用適當的存取方法開發自訂的資料型別。這與使用 GiST 的優點相同。
PostgreSQL 中的 GIN 實現偏主要由 Teodor Sigaev 和 Oleg Bartunov 維護。在他們的上有更多關於 GIN 的訊息。
SP-GiST offers an interface with a high level of abstraction, requiring the access method developer to implement only methods specific to a given data type. The SP-GiST core is responsible for efficient disk mapping and searching the tree structure. It also takes care of concurrency and logging considerations.
Leaf tuples of an SP-GiST tree contain values of the same data type as the indexed column. Leaf tuples at the root level will always contain the original indexed data value, but leaf tuples at lower levels might contain only a compressed representation, such as a suffix. In that case the operator class support functions must be able to reconstruct the original value using information accumulated from the inner tuples that are passed through to reach the leaf level.
Inner tuples are more complex, since they are branching points in the search tree. Each inner tuple contains a set of one or more nodes, which represent groups of similar leaf values. A node contains a downlink that leads either to another, lower-level inner tuple, or to a short list of leaf tuples that all lie on the same index page. Each node normally has a label that describes it; for example, in a radix tree the node label could be the next character of the string value. (Alternatively, an operator class can omit the node labels, if it works with a fixed set of nodes for all inner tuples; see .) Optionally, an inner tuple can have a prefix value that describes all its members. In a radix tree this could be the common prefix of the represented strings. The prefix value is not necessarily really a prefix, but can be any data needed by the operator class; for example, in a quad-tree it can store the central point that the four quadrants are measured with respect to. A quad-tree inner tuple would then also contain four nodes corresponding to the quadrants around this central point.
Some tree algorithms require knowledge of level (or depth) of the current tuple, so the SP-GiST core provides the possibility for operator classes to manage level counting while descending the tree. There is also support for incrementally reconstructing the represented value when that is needed, and for passing down additional data (called traverse values) during a tree descent.
The SP-GiST core code takes care of null entries. Although SP-GiST indexes do store entries for nulls in indexed columns, this is hidden from the index operator class code: no null index entries or search conditions will ever be passed to the operator class methods. (It is assumed that SP-GiST operators are strict and so cannot succeed for null values.) Null values are therefore not discussed further here.
There are five user-defined methods that an index operator class for SP-GiST must provide. All five follow the convention of accepting two internal
arguments, the first of which is a pointer to a C struct containing input values for the support method, while the second argument is a pointer to a C struct where output values must be placed. Four of the methods just return void
, since all their results appear in the output struct; but leaf_consistent
additionally returns a boolean
result. The methods must not modify any fields of their input structs. In all cases, the output struct is initialized to zeroes before calling the user-defined method.
The five user-defined methods are:config
Returns static information about the index implementation, including the data type OIDs of the prefix and node label data types.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgConfigIn
C struct, containing input data for the function. The second argument is a pointer to a spgConfigOut
C struct, which the function must fill with result data.
attType
is passed in order to support polymorphic index operator classes; for ordinary fixed-data-type operator classes, it will always have the same value and so can be ignored.
Chooses a method for inserting a new value into an inner tuple.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgChooseIn
C struct, containing input data for the function. The second argument is a pointer to a spgChooseOut
C struct, which the function must fill with result data.
The choose
function can determine either that the new value matches one of the existing child nodes, or that a new child node must be added, or that the new value is inconsistent with the tuple prefix and so the inner tuple must be split to create a less restrictive prefix.
If the new value matches one of the existing child nodes, set resultType
to spgMatchNode
. Set nodeN
to the index (from zero) of that node in the node array. Set levelAdd
to the increment in level
caused by descending through that node, or leave it as zero if the operator class does not use levels. Set restDatum
to equal datum
if the operator class does not modify datums from one level to the next, or otherwise set it to the modified value to be used as leafDatum
at the next level.
If a new child node must be added, set resultType
to spgAddNode
. Set nodeLabel
to the label to be used for the new node, and set nodeN
to the index (from zero) at which to insert the node in the node array. After the node has been added, the choose
function will be called again with the modified inner tuple; that call should result in an spgMatchNode
result.
If the new value is inconsistent with the tuple prefix, set resultType
to spgSplitTuple
. This action moves all the existing nodes into a new lower-level inner tuple, and replaces the existing inner tuple with a tuple having a single downlink pointing to the new lower-level inner tuple. Set prefixHasPrefix
to indicate whether the new upper tuple should have a prefix, and if so set prefixPrefixDatum
to the prefix value. This new prefix value must be sufficiently less restrictive than the original to accept the new value to be indexed. Set prefixNNodes
to the number of nodes needed in the new tuple, and set prefixNodeLabels
to a palloc'd array holding their labels, or to NULL if node labels are not required. Note that the total size of the new upper tuple must be no more than the total size of the tuple it is replacing; this constrains the lengths of the new prefix and new labels. Set childNodeN
to the index (from zero) of the node that will downlink to the new lower-level inner tuple. Set postfixHasPrefix
to indicate whether the new lower-level inner tuple should have a prefix, and if so set postfixPrefixDatum
to the prefix value. The combination of these two prefixes and the downlink node's label (if any) must have the same meaning as the original prefix, because there is no opportunity to alter the node labels that are moved to the new lower-level tuple, nor to change any child index entries. After the node has been split, the choose
function will be called again with the replacement inner tuple. That call may return an spgAddNode
result, if no suitable node was created by the spgSplitTuple
action. Eventually choose
must return spgMatchNode
to allow the insertion to descend to the next level.picksplit
Decides how to create a new inner tuple over a set of leaf tuples.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgPickSplitIn
C struct, containing input data for the function. The second argument is a pointer to a spgPickSplitOut
C struct, which the function must fill with result data.
nTuples
is the number of leaf tuples provided. datums
is an array of their datum values. level
is the current level that all the leaf tuples share, which will become the level of the new inner tuple.
Set hasPrefix
to indicate whether the new inner tuple should have a prefix, and if so set prefixDatum
to the prefix value. Set nNodes
to indicate the number of nodes that the new inner tuple will contain, and set nodeLabels
to an array of their label values, or to NULL if node labels are not required. Set mapTuplesToNodes
to an array that gives the index (from zero) of the node that each leaf tuple should be assigned to. Set leafTupleDatums
to an array of the values to be stored in the new leaf tuples (these will be the same as the input datums
if the operator class does not modify datums from one level to the next). Note that the picksplit
function is responsible for palloc'ing the nodeLabels
,mapTuplesToNodes
and leafTupleDatums
arrays.
Returns set of nodes (branches) to follow during tree search.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgInnerConsistentIn
C struct, containing input data for the function. The second argument is a pointer to a spgInnerConsistentOut
C struct, which the function must fill with result data.
nNodes
must be set to the number of child nodes that need to be visited by the search, and nodeNumbers
must be set to an array of their indexes. If the operator class keeps track of levels, set levelAdds
to an array of the level increments required when descending to each node to be visited. (Often these increments will be the same for all the nodes, but that's not necessarily so, so an array is used.) If value reconstruction is needed, set reconstructedValues
to an array of the values reconstructed for each child node to be visited; otherwise, leave reconstructedValues
as NULL. If it is desired to pass down additional out-of-band information (“traverse values”) to lower levels of the tree search, settraversalValues
to an array of the appropriate traverse values, one for each child node to be visited; otherwise, leave traversalValues
as NULL. Note that the inner_consistent
function is responsible for palloc'ing the nodeNumbers
, levelAdds
, reconstructedValues
, and traversalValues
arrays in the current memory context. However, any output traverse values pointed to by the traversalValues
array should be allocated in traversalMemoryContext
. Each traverse value must be a single palloc'd chunk.leaf_consistent
Returns true if a leaf tuple satisfies a query.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgLeafConsistentIn
C struct, containing input data for the function. The second argument is a pointer to a spgLeafConsistentOut
C struct, which the function must fill with result data.
The array scankeys
, of length nkeys
, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them satisfy the query. (Note that nkeys
= 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy
and sk_argument
fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags
to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue
is the value reconstructed for the parent tuple; it is (Datum) 0
at the root level or if the inner_consistent
function did not provide a value at the parent level. traversalValue
is a pointer to any traverse data passed down from the previous call of inner_consistent
on the parent index tuple, or NULL at the root level. level
is the current leaf tuple's level, starting at zero for the root level. returnData
is true
if reconstructed data is required for this query; this will only be so if the config
function asserted canReturnData
. leafDatum
is the key value stored in the current leaf tuple.
The function must return true
if the leaf tuple matches the query, or false
if not. In the true
case, if returnData
is true
then leafValue
must be set to the value originally supplied to be indexed for this leaf tuple. Also, recheck
may be set to true
if the match is uncertain and so the operator(s) must be re-applied to the actual heap tuple to verify the match.
All the SP-GiST support methods are normally called in a short-lived memory context; that is, CurrentMemoryContext
will be reset after processing of each tuple. It is therefore not very important to worry about pfree'ing everything you palloc. (The config
method is an exception: it should try to avoid leaking memory. But usually the config
method need do nothing but assign constants into the passed parameter struct.)
If the indexed column is of a collatable data type, the index collation will be passed to all the support methods, using the standard PG_GET_COLLATION()
mechanism.
在內部,GIN 索引包含在索引鍵上建構的 B-tree 索引,其中每個索引鍵是一個或多個索引項目(例如,陣列的成員)元素,並且葉結點頁面中的每個 tuple 包含指向 heap 指標的 B-tree(“posting tree”)或 heap 指標的簡易列表(“posting list”),其列表足夠小以能與其索引鍵值放進單個索引 tuple。
從 PostgreSQL 9.1 開始,null 索引鍵值可以包含在索引中。此外,placeholder null 值也包含在索引項目中,該索引項目根據 extractValue 的結果判斷為 null 或不包含任何鍵值。這可以進行空項目的搜索。
透過在複合值(欄位號碼,鍵值)上建構單個 B-tree 來實現多欄位 GIN 索引。不同欄位的鍵值可以是不同型別。
由於反向索引的固有特性,更新 GIN 索引往往會很慢:插入或更新一個 heap 資料列會導致許多項目插入到索引中(每個索引鍵從索引項目中提取一個)。從 PostgreSQL 8.4 開始,GIN 能夠透過將新的 tuple 插入臨時的未排序待處理條目列表來延遲大部分工作。當資料表被清理或自動分析時,或者當呼叫 gin_clean_pending_list 函數時,又或者待處理列表變得大於 時,使用在初始索引建立期間使用的相同批次插入技術將項目移動到主要的 GIN 資料結構。這大大提升了 GIN 索引更新速度,甚至可以計算額外的清理開銷。此外,額外的工作可以通過背景程序而不是前端查詢處理來完成。
這種方法的主要缺點是除了搜尋一般索引之外,還必須掃描待處理項目列表,因此大量待處理項目將顯著拖慢搜尋速度。另一個缺點是,雖然大多數更新都很快,但是導致待處理列表變得「太大」的更新將導致觸發立即性清理工作,因此比其他更新慢得多。正確使用 autovacuum 可以儘可能地減少這些問題。
如果一致回應時間比更新速度更重要,則可以透過關閉 GIN 索引的 fastupdate 儲存參數來停用待處理項目的使用。有關詳細訊息,請參閱 。
GIN可以支援「部分匹配」查詢,其中查詢不確定一個或多個索引鍵能完全匹配,但可能的匹配屬於相當窄的索引鍵值範圍(以比較支持法確定的索引鍵排序順序))。 extractQuery 方法不是回傳要精確匹配的索引鍵值,而是回傳某鍵值,該鍵值是搜尋範圍的下限,並將 pmatch 標示設定為 true。然後使用 comparePartial 方法掃描關鍵的範圍。comparePartial 讓能匹配的索引鍵回傳零,對於仍在搜尋範圍內的非匹配小於零,或者如果索引鍵超出可匹配的範圍則大於零。
The GIN interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GIN layer itself takes care of concurrency, logging and searching the tree structure.
All it takes to get a GIN access method working is to implement a few user-defined methods, which define the behavior of keys in the tree and the relationships between keys, indexed items, and indexable queries. In short, GIN combines extensibility with generality, code reuse, and a clean interface.
There are two methods that an operator class for GIN must provide:Datum *extractValue(Datum itemValue, int32 *nkeys, bool **nullFlags)
Returns a palloc'd array of keys given an item to be indexed. The number of returned keys must be stored into *nkeys
. If any of the keys can be null, also palloc an array of *nkeys
bool
fields, store its address at *nullFlags
, and set these null flags as needed.*nullFlags
can be left NULL
(its initial value) if all keys are non-null. The return value can be NULL
if the item contains no keys.Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data, bool **nullFlags, int32 *searchMode)
Returns a palloc'd array of keys given a value to be queried; that is, query
is the value on the right-hand side of an indexable operator whose left-hand side is the indexed column. n
is the strategy number of the operator within the operator class (see ). Often, extractQuery
will need to consult n
to determine the data type of query
and the method it should use to extract key values. The number of returned keys must be stored into *nkeys
. If any of the keys can be null, also palloc an array of *nkeys
bool
fields, store its address at *nullFlags
, and set these null flags as needed. *nullFlags
can be left NULL
(its initial value) if all keys are non-null. The return value can be NULL
if the query
contains no keys.
searchMode
is an output argument that allows extractQuery
to specify details about how the search will be done. If *searchMode
is set to GIN_SEARCH_MODE_DEFAULT
(which is the value it is initialized to before call), only items that match at least one of the returned keys are considered candidate matches. If *searchMode
is set to GIN_SEARCH_MODE_INCLUDE_EMPTY
, then in addition to items containing at least one matching key, items that contain no keys at all are considered candidate matches. (This mode is useful for implementing is-subset-of operators, for example.) If *searchMode
is set to GIN_SEARCH_MODE_ALL
, then all non-null items in the index are considered candidate matches, whether they match any of the returned keys or not. (This mode is much slower than the other two choices, since it requires scanning essentially the entire index, but it may be necessary to implement corner cases correctly. An operator that needs this mode in most cases is probably not a good candidate for a GIN operator class.) The symbols to use for setting this mode are defined in access/gin.h
.
pmatch
is an output argument for use when partial match is supported. To use it, extractQuery
must allocate an array of *nkeys
booleans and store its address at *pmatch
. Each element of the array should be set to TRUE if the corresponding key requires partial match, FALSE if not. If *pmatch
is set to NULL
then GIN assumes partial match is not required. The variable is initialized to NULL
before call, so this argument can simply be ignored by operator classes that do not support partial match.
extra_data
is an output argument that allows extractQuery
to pass additional data to the consistent
and comparePartial
methods. To use it, extractQuery
must allocate an array of *nkeys
pointers and store its address at *extra_data
, then store whatever it wants to into the individual pointers. The variable is initialized to NULL
before call, so this argument can simply be ignored by operator classes that do not require extra data. If *extra_data
is set, the whole array is passed to the consistent
method, and the appropriate element to the comparePartial
method.
An operator class must also provide a function to check if an indexed item matches the query. It comes in two flavors, a boolean consistent
function, and a ternary triConsistent
function. triConsistent
covers the functionality of both, so providing triConsistent
alone is sufficient. However, if the boolean variant is significantly cheaper to calculate, it can be advantageous to provide both. If only the boolean variant is provided, some optimizations that depend on refuting index items before fetching all the keys are disabled.bool consistent(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[])
Returns TRUE if an indexed item satisfies the query operator with strategy number n
(or might satisfy it, if the recheck indication is returned). This function does not have direct access to the indexed item's value, since GIN does not store items explicitly. Rather, what is available is knowledge about which key values extracted from the query appear in a given indexed item. The check
array has length nkeys
, which is the same as the number of keys previously returned by extractQuery
for this query
datum. Each element of the check
array is TRUE if the indexed item contains the corresponding query key, i.e., if (check[i] == TRUE) the i-th key of the extractQuery
result array is present in the indexed item. The original query
datum is passed in case the consistent
method needs to consult it, and so are the queryKeys[]
and nullFlags[]
arrays previously returned by extractQuery
. extra_data
is the extra-data array returned by extractQuery
, or NULL
if none.
When extractQuery
returns a null key in queryKeys[]
, the corresponding check[]
element is TRUE if the indexed item contains a null key; that is, the semantics of check[]
are like IS NOT DISTINCT FROM
. The consistent
function can examine the corresponding nullFlags[]
element if it needs to tell the difference between a regular value match and a null match.
On success, *recheck
should be set to TRUE if the heap tuple needs to be rechecked against the query operator, or FALSE if the index test is exact. That is, a FALSE return value guarantees that the heap tuple does not match the query; a TRUE return value with *recheck
set to FALSE guarantees that the heap tuple does match the query; and a TRUE return value with *recheck
set to TRUE means that the heap tuple might match the query, so it needs to be fetched and rechecked by evaluating the query operator directly against the originally indexed item.GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], Datum queryKeys[], bool nullFlags[])
triConsistent
is similar to consistent
, but instead of booleans in the check
vector, there are three possible values for each key: GIN_TRUE
, GIN_FALSE
and GIN_MAYBE
. GIN_FALSE
and GIN_TRUE
have the same meaning as regular boolean values, while GIN_MAYBE
means that the presence of that key is not known. When GIN_MAYBE
values are present, the function should only return GIN_TRUE
if the item certainly matches whether or not the index item contains the corresponding query keys. Likewise, the function must return GIN_FALSE
only if the item certainly does not match, whether or not it contains the GIN_MAYBE
keys. If the result depends on the GIN_MAYBE
entries, i.e., the match cannot be confirmed or refuted based on the known query keys, the function must return GIN_MAYBE
.
When there are no GIN_MAYBE
values in the check
vector, a GIN_MAYBE
return value is the equivalent of setting the recheck
flag in the boolean consistent
function.
In addition, GIN must have a way to sort the key values stored in the index. The operator class can define the sort ordering by specifying a comparison method:int compare(Datum a, Datum b)
Compares two keys (not indexed items!) and returns an integer less than zero, zero, or greater than zero, indicating whether the first key is less than, equal to, or greater than the second. Null keys are never passed to this function.
Alternatively, if the operator class does not provide a compare
method, GIN will look up the default btree operator class for the index key data type, and use its comparison function. It is recommended to specify the comparison function in a GIN operator class that is meant for just one data type, as looking up the btree operator class costs a few cycles. However, polymorphic GIN operator classes (such as array_ops
) typically cannot specify a single comparison function.
Optionally, an operator class for GIN can supply the following method:int comparePartial(Datum partial_key, Datum key, StrategyNumber n, Pointer extra_data)
Compare a partial-match query key to an index key. Returns an integer whose sign indicates the result: less than zero means the index key does not match the query, but the index scan should continue; zero means that the index key does match the query; greater than zero indicates that the index scan should stop because no more matches are possible. The strategy number n
of the operator that generated the partial match query is provided, in case its semantics are needed to determine when to end the scan. Also, extra_data
is the corresponding element of the extra-data array made by extractQuery
, or NULL
if none. Null keys are never passed to this function.
The actual data types of the various Datum
values mentioned above vary depending on the operator class. The item values passed to extractValue
are always of the operator class's input type, and all key values must be of the class's STORAGE
type. The type of the query
argument passed to extractQuery
, consistent
and triConsistent
is whatever is the right-hand input type of the class member operator identified by the strategy number. This need not be the same as the indexed type, so long as key values of the correct type can be extracted from it. However, it is recommended that the SQL declarations of these three support functions use the opclass's indexed data type for the query
argument, even though the actual type might be something else depending on the operator.
The core PostgreSQL distribution includes the GIN operator classes previously shown in . The following contrib
modules also contain GIN operator classes:
btree_gin
B-tree equivalent functionality for several data types
hstore
Module for storing (key, value) pairs
intarray
Enhanced support for int[]
pg_trgm
Text similarity using trigram matching
The core PostgreSQL distribution includes the BRIN operator classes shown in .
The minmax operator classes store the minimum and the maximum values appearing in the indexed column within the range. The inclusion operator classes store a value which includes the values in the indexed column within the range.
版本:11
本節概述了 TOAST(The Oversized-Attribute Storage Technique,超大型屬性儲存技術)。
PostgreSQL 使用固定的頁面大小(通常為 8 kB),並且不允許 tuple 跨越多個頁面。 因此,不可能直接儲存非常大的字串。為了克服這個限制,將大字串壓縮和分解成多個實體資料列。這對使用者而言是無感的,對大多數後端程式碼的影響很小。該技術被親切地稱為 TOAST(或「切片麵包以來最好的東西」)。TOAST 基礎結構還用於改進記憶體中大資料值的處理。
只有某些資料型別支援 TOAST - 不需要對無法産生大字串的資料型別增加成本。為了支援 TOAST,資料型別必須具有可變長度(varlena)表示,其中,通常,任何儲存值的第一個 4 bytes 包含以 byte 為單位的總長度(包括其自身)。TOAST不會限制資料型別表示的其餘部分。統稱為 TOASTed 的特殊值表示透過修改或重新解釋此初始長度字來起作用。因此,支援 TOAST-able 資料型別的 C 語言函數必須注意它們如何處理可能的 TOASTed 輸入值:輸入實際上可能不包含 4 bytes 長度的字和內容,直到它被解除 TOAST。(這通常透過在對輸入值執行任何操作之前呼叫 PG_DETOAST_DATUM 來完成,但在某些情況下可以採用更有效的方法。有關更多詳細訊息,請參閱。)
TOAST 使用 varlena 長度的兩位元(big-endian 機器上的高位元,little-endian 機器上的低位元),從而將 TOAST-able 資料型別的任何值的邏輯大小限制為 1 GB。當兩個位元都為零時,該值是資料型別的普通值非 TOAST,長度位元組的其餘位以位元組為單位記錄總資料大小(包括長度位元組)。當設定最高位或最低位時,該值只有一個單位元組標頭而不是普通的四位元組標頭,該位元組的其餘位元表示以位元組為單位的總資料大小(包括長度位元組) 。此額外的方案支援空間高效率儲存短於 127 位元組的值,同時仍允許資料型別在需要時增長到 1 GB。具有單位元組標頭的值不在任何特定邊界上對齊,而具有四位元組標頭的值在至少四位元組邊界上對齊;與短值相比,這種省略對齊填充提供了額外的空間節省。作為特殊情況,如果單位元組標頭的剩餘位全部為零(對於自包含長度而言這是不可能的),則該值是指向外部資料的指標,具有如所描述的幾種可能的替代方案,如下所示。這種 TOAST 指標的型別和大小由儲存在資料的第二個位元組中的代碼決定。最後,當最高位元或最低位元清除為零但相鄰位置時,資料的內容已被壓縮,必須在使用前解壓縮。在這種情況下,四位元組長度字的剩餘位表示壓縮資料的總大小,而不是原始資料。請注意,對於外部資料也可以進行壓縮,但 varlena 標頭不會告訴它是否已經發生 - 而 TOAST 指標的內容則說明這件事。
如上所述,有多種類型的 TOAST 指標基準。最舊和最常見的類型是指向儲存在 TOAST 資料表中的外部資料的指標,該資料表與包含 TOAST 指標資料本身的資料表分開但與之相關聯。當要儲存在磁碟上的 tuple 太大而無法按原樣儲存時,這些磁碟指標基準由 TOAST 管理代碼(在 access/heap/tuptoaster.c 中)建立。更多細節見。或者,TOAST 指標資料可以包含指向出現在記憶體中其他位置外部資料的指標。這些資料必然是短暫的,並且永遠不會出現在磁碟上,但它們對於避免複製和冗餘處理大量資料值非常有用。更多細節見。
用於壓縮資料的壓縮技術是 LZ 系列壓縮技術中相當簡單且非常快速的方法。有關詳細訊息,請參閱 src/common/pg_lzcompress.c。
如果資料表的任何欄位都是可以 TOAST 的,則該資料表將擁有關連的 TOAST 資料表,其 OID 儲存在資料表的 pg_class.reltoastrelid 項目中。磁盤上 TOAST 後的值保留在 TOAST 資料表中,下面將有更詳細的描述。
將 out-of-line 的內容(在壓縮後使用)分割為最多 TOAST_MAX_CHUNK_SIZE 個字元的區塊(預設情況下,選擇此值使得四個區塊的資料列行剛好放進一個 page,大約為 2000 個字元)。每個區塊都屬於其所有資料表的 TOAST 資料表中單獨的資料列來儲存。每個 TOAST 資料表都有欄位的 chunk_id(識別特定有 TOAST 值的 OID),chunk_seq(其值中區塊的序列號)和 chunk_data(區塊的實際資料)。chunk_id 和 chunk_seq 上的唯一索引提供了對內容的快速檢索。表示線上磁碟 TOAST 值的指標資料需要儲存要查看的 TOAST 資料表 OID 以及特定值的 OID(其chunk_id)。為方便起見,指標 datum 還儲存邏輯上的 datum 大小(原始未壓縮字串長度)和實際上的儲存大小(如果套用了壓縮則會不同)。因此,允許 varlena 標頭字元,磁碟 TOAST 指標資料的總大小為 18 個位元組,不論其所表示字串大小。
僅當要儲存在資料表中的資料列內容大於 TOAST_TUPLE_THRESHOLD 字元(通常為2 kB)時,才會觸發 TOAST 機制。TOAST 程式將會壓縮或移動字串內容,直到資料列小於 TOAST_TUPLE_TARGET 個字元(通常也是 2 kB)或者不能再獲得更多的增益。在 UPDATE 操作期間,未變更字串的內容通常就保持原樣;因此,如果沒有任何 out-of-line 需要變更,則具有 out-of-line 的資料列更新就不會產生任何 TOAST 成本。
TOAST 機制識別用於在磁碟上儲存可 TOAST 欄位有四種不同策略:
PLAIN 可防止壓縮或 out-of-line 儲存方式;此外,它禁止使用 varlena 類型的單字元標頭。對於非 TOAST-capable 資料型別欄位,這是唯一可行的策略。
EXTENDED 允許壓縮和 out-of-line 儲存。這是大多數 TOAST-capable 資料型別的預設方式。首先嘗試壓縮,然後在資料列仍然太大的情況下進行 out-of-line 儲存。
EXTERNAL 允許 out-of-line 儲存但不允許壓縮。使用 EXTERNAL 將使大量文字和 bytea 欄位上的子字串操作更快(以增加的儲存空間為代價),因為這些操作被最佳化為在未壓縮時僅獲取 out-of-line 內容所需的部分。
MAIN 允許壓縮但不允許 out-of-line 儲存。(實際上,仍然會為這些欄位執行 out-of-line 儲存,但只有在沒有其他方法使資料列足夠小到適合頁面時才做的最後手段。)
每個 TOAST-able 資料型別會為該型別的欄位指定預設策略,但是可以使用 變更指定資料表欄位的策略。
與更直覺的方法(例如允許資料列內容跨越頁面)相比,此方案具有許多優點。假設查詢通常透過與相對較小的鍵值進行比較來過濾,執行程序的大部分工作將使用主要欄位完成。 TOASTed 屬性的大量內容只會在結果集發送到用戶端時被取出(如果選中的話)。因此,與沒有任何外部儲存的情況相比,主要資料表更小並且其更多資料列置於共享緩衝區高速處理。排序集合也會縮小,而排序通常完全在記憶體中完成。一個小小的測試顯示,包含典型 HTML 頁面及其 URL 的資料儲存在大約一半的原始資料大小(包括 TOAST 資料表)中,並且主要資料表僅包含大約 10% 的內容(URL 和一些小的 HTML)。與未轉換的相比,並沒有執行時間差異,其中所有 HTML 頁面都被削減到 7 kB 以適應頁面。
TOAST pointers can point to data that is not on disk, but is elsewhere in the memory of the current server process. Such pointers obviously cannot be long-lived, but they are nonetheless useful. There are currently two sub-cases: pointers to indirect data and pointers to expanded data.
Indirect TOAST pointers simply point at a non-indirect varlena value stored somewhere in memory. This case was originally created merely as a proof of concept, but it is currently used during logical decoding to avoid possibly having to create physical tuples exceeding 1 GB (as pulling all out-of-line field values into the tuple might do). The case is of limited use since the creator of the pointer datum is entirely responsible that the referenced data survives for as long as the pointer could exist, and there is no infrastructure to help with this.
Expanded TOAST pointers are useful for complex data types whose on-disk representation is not especially suited for computational purposes. As an example, the standard varlena representation of a PostgreSQL array includes dimensionality information, a nulls bitmap if there are any null elements, then the values of all the elements in order. When the element type itself is variable-length, the only way to find the N
'th element is to scan through all the preceding elements. This representation is appropriate for on-disk storage because of its compactness, but for computations with the array it's much nicer to have an “expanded” or “deconstructed” representation in which all the element starting locations have been identified. The TOAST pointer mechanism supports this need by allowing a pass-by-reference Datum to point to either a standard varlena value (the on-disk representation) or a TOAST pointer that points to an expanded representation somewhere in memory. The details of this expanded representation are up to the data type, though it must have a standard header and meet the other API requirements given in src/include/utils/expandeddatum.h
. C-level functions working with the data type can choose to handle either representation. Functions that do not know about the expanded representation, but simply apply PG_DETOAST_DATUM
to their inputs, will automatically receive the traditional varlena representation; so support for an expanded representation can be introduced incrementally, one function at a time.
TOAST pointers to expanded values are further broken down into read-write and read-only pointers. The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not; it must first create a copy if it wants to make a modified version of the value. This distinction and some associated conventions make it possible to avoid unnecessary copying of expanded values during query execution.
For all types of in-memory TOAST pointer, the TOAST management code ensures that no such pointer datum can accidentally get stored on disk. In-memory TOAST pointers are automatically expanded to normal in-line varlena values before storage — and then possibly converted to on-disk TOAST pointers, if the containing tuple would otherwise be too big.
.oid
.oid
.oid
.oid
.oid
.oid
.oid
資料表上的 CHECK 限制條件數目;請參閱 目錄
如果資料表有(或曾經有)rule,則為 true;請參閱 目錄
如果資料表具有(或曾經有)觸發器,則為 True;請參閱 目錄
如果資料表啟用了資料列級安全性,則為 True;請參閱 目錄
如果資料列級別安全性(啟用時)也適用於資料表擁有者,則為 True;請參閱 目錄
存取權限;有關詳細信息,請參閱 和
角色繞過每個資料列級別的安全原則,有關詳細訊息,請參閱。
有關變更這些參數的各種方法和更多資訊,請參閱。
pg_settings 檢視表無法INSERT 或 DELETE,但可以 UPDATE。套用於一行 pg_settings 的 UPDATE 相當於對該參數執行 指令。此變更僅影響目前連線所使用的值。如果在稍後中止的交易事務中發出 UPDATE,則在回溯事務時 UPDATE 指令的效果會消失。一旦提交了相關的事務,則效果將持續到連線結束,除非被另一個 UPDATE 或 SET 覆蓋。
lists the system-defined values of typcategory
. Any future additions to this list will also be upper-case ASCII letters. All other ASCII characters are reserved for user-defined categories.
Keep in mind that error message texts need to be translated into other languages. Follow the guidelines in to avoid making life difficult for translators.
After having compiled the handler function into a loadable module (see ), the following commands then register the sample procedural language:
Although providing a call handler is sufficient to create a minimal procedural language, there are two other functions that can optionally be provided to make the language more convenient to use. These are a validator and an inline handler. A validator can be provided to allow language-specific checking to be done during . An inline handler can be provided to allow the language to support anonymous code blocks executed via the command.
Validator functions should typically honor the parameter: if it is turned off then any expensive or context-sensitive checking should be skipped. If the language provides for code execution at compilation time, the validator must suppress checks that would induce such execution. In particular, this parameter is turned off by pg_dump so that it can load procedural language functions without worrying about side effects or dependencies of the function bodies on other database objects. (Because of this requirement, the call handler should avoid assuming that the validator has fully checked the function. The point of having a validator is not to let the call handler omit checks, but to notify the user immediately if there are obvious errors in a CREATE FUNCTION
command.) While the choice of exactly what to check is mostly left to the discretion of the validator function, note that the core CREATE FUNCTION
code only executes SET
clauses attached to a function when check_function_bodies
is on. Therefore, checks whose results might be affected by GUC parameters definitely should be skipped when check_function_bodies
is off, to avoid false failures when reloading a dump.
It's recommended that you wrap all these function declarations, as well as the CREATE LANGUAGE
command itself, into an extension so that a simple CREATE EXTENSION
command is sufficient to install the language. See for information about writing extensions.
The procedural languages included in the standard distribution are good references when trying to write your own language handler. Look into the src/pl
subdirectory of the source tree. The reference page also has some useful details.
Advice about writing good error messages can be found in .
That is, the value that was current when the ereport
call was reached; changes of errno
within the auxiliary reporting routines will not affect it. That would not be true if you were to write strerror(errno)
explicitly in errmsg
's parameter list; accordingly, do not do so.
For operator classes that do not use prefixes, prefixType
can be set to VOIDOID
. Likewise, for operator classes that do not use node labels, labelType
can be set to VOIDOID
. canReturnData
should be set true if the operator class is capable of reconstructing the originally-supplied index value. longValuesOK
should be set true only when the attType
is of variable length and the operator class is capable of segmenting long values by repeated suffixing (see ).choose
datum
is the original datum that was to be inserted into the index. leafDatum
is initially the same as datum
, but can change at lower levels of the tree if the choose
or picksplit
methods change it. When the insertion search reaches a leaf page, the current value of leafDatum
is what will be stored in the newly created leaf tuple. level
is the current inner tuple's level, starting at zero for the root level. allTheSame
is true if the current inner tuple is marked as containing multiple equivalent nodes (see ). hasPrefix
is true if the current inner tuple contains a prefix; if so, prefixDatum
is its value. nNodes
is the number of child nodes contained in the inner tuple, and nodeLabels
is an array of their label values, or NULL if there are no labels.
If more than one leaf tuple is supplied, it is expected that the picksplit
function will classify them into more than one node; otherwise it is not possible to split the leaf tuples across multiple pages, which is the ultimate purpose of this operation. Therefore, if the picksplit
function ends up placing all the leaf tuples in the same node, the core SP-GiST code will override that decision and generate an inner tuple in which the leaf tuples are assigned at random to several identically-labeled nodes. Such a tuple is marked allTheSame
to signify that this has happened. The choose
and inner_consistent
functions must take suitable care with such inner tuples. See for more information.
picksplit
can be applied to a single leaf tuple only in the case that the config
function set longValuesOK
to true and a larger-than-a-page input value has been supplied. In this case the point of the operation is to strip off a prefix and produce a new, shorter leaf datum value. The call will be repeated until a leaf datum short enough to fit on a page has been produced. See for more information.inner_consistent
The array scankeys
, of length nkeys
, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them are interesting. (Note that nkeys
= 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy
and sk_argument
fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags
to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue
is the value reconstructed for the parent tuple; it is (Datum) 0
at the root level or if the inner_consistent
function did not provide a value at the parent level. traversalValue
is a pointer to any traverse data passed down from the previous call of inner_consistent
on the parent index tuple, or NULL at the root level. traversalMemoryContext
is the memory context in which to store output traverse values (see below). level
is the current inner tuple's level, starting at zero for the root level. returnData
is true
if reconstructed data is required for this query; this will only be so if the config
function asserted canReturnData
. allTheSame
is true if the current inner tuple is marked “all-the-same”; in this case all the nodes have the same label (if any) and so either all or none of them match the query (see ).hasPrefix
is true if the current inner tuple contains a prefix; if so, prefixDatum
is its value. nNodes
is the number of child nodes contained in the inner tuple, and nodeLabels
is an array of their label values, or NULL if the nodes do not have labels.
To support “partial match” queries, an operator class must provide the comparePartial
method, and its extractQuery
method must set the pmatch
parameter when a partial-match query is encountered. See for details.
Name
Type
References
Description
indexrelid
oid
pg_class
.oid
此索引在 pg_class 中的 OID
indrelid
oid
pg_class
.oid
此索引對應資料表在 pg_class 中的 OID
indnatts
int2
索引中的欄位數(複製自 pg_class.relnatts)
indisunique
bool
如果為 true,則這是唯一性索引
indisprimary
bool
如果為 true,則此索引表示資料表的主鍵(如果為 true,則 indisunique 應始終為true)
indisexclusion
bool
如果為 true,則此索引支援排除限制條件
indimmediate
bool
如果為 true,則在插入時立即強制執行唯一性檢查(如果 indisunique 不成立則無關緊要)
indisclustered
bool
If true, the table was last clustered on this index
indisvalid
bool
如果為 true,則索引目前對查詢有效。False 意味著索引可能不完整:它仍然必須通過 INSERT / UPDATE 操作進行修改,但它不能安全地用於查詢。 如果它是唯一的,則唯一性屬性也不保證是真的。
indcheckxmin
bool
如果為 true,則查詢必須不使用索引,直到此 pg_index 資料列的 xmin 低於其 TransactionXmin 事務範圍,因為可以看到該資料表可能包含具有不相容資料列的損壞 HOT 鏈
indisready
bool
如果為 true,則索引目前已準備好進行插入。False 表示 INSERT / UPDATE 操作必須忽略索引。
indislive
bool
如果為 false,則索引正在被移除,並且應該被忽略用於所有目的(包括 HOT-safety 決策)
indisreplident
bool
If true this index 已使用 ALTER TABLE ... REPLICA IDENTITY 選擇“replica identity”...
indkey
int2vector
pg_attribute
.attnum
這是一個 indnatts 陣列,意指此索引所索引的資料表欄位。例如,值為 1 3 意味著第一個和第三個資料表欄位構成索引鍵。此陣列中的零表示相應的索引屬性是資料表欄位上的表示式,而不是簡單的欄位引用。
indcollation
oidvector
pg_collation
.oid
對於索引鍵中的每一個欄位,它包含用於索引的排序規則的 OID,如果該欄位不是可合併的資料型別,則為零。
indclass
oidvector
pg_opclass
.oid
對於索引鍵中的每一欄位,它包含要使用的運算子類的 OID。有關詳細訊息,請參閱 pg_opclass。
indoption
int2vector
這是一個 indnatts 陣列,用於儲存每個欄位的旗標位元。位元的意義由索引的存取方法定義。
indexprs
pg_node_tree
表示式樹(以 nodeToString() 表示),用於不是簡單欄位引用的索引屬性。這是一個列表,其中包含 indkey 中每個零項目的一個元素。如果所有索引屬性都是簡單引用,則為空。
indpred
pg_node_tree
部分索引條件的表示式樹(以 nodeToString() 表示)。如果不是部分索引,則為空。
Name
Type
References
Description
oid
oid
資料列識別指標(隱藏屬性;必須明確選擇)
lanname
name
語言名稱
lanowner
oid
pg_authid
.oid
語言的所有者
lanispl
bool
對於內部語言(例如 SQL)而言這是 false 的,對於使用者定義的語言則是 true。目前,pg_dump 仍然使用它來決定需要轉存哪些語言,但將來可能會被不同的機制所取代。
lanpltrusted
bool
如果這是一種受信任的語言,則為 True,這意味著它被認為不會授予對正常 SQL 執行環境之外任何內容的存取權限。只有超級使用者才能以不受信任的語言建立函數。
lanplcallfoid
oid
pg_proc
.oid
對於非內部語言,這引用了語言處理程序,它是一個特殊的函數,負責執行使用特定語言所編寫的所有函數。
laninline
oid
pg_proc
.oid
這引用了一個負責執行 “inline” 匿名程式區塊(DO 區塊)的函數。如果不支援 inline 區塊,則為零。
lanvalidator
oid
pg_proc
.oid
這引用了一個語言驗證器函數,該函數負責在建立新函數時檢查它們的語法和有效性。如果未提供驗證器,則為零。
lanacl
aclitem[]
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
collname
name
Collation name (unique per namespace and encoding)
collnamespace
oid
pg_namespace
.oid
The OID of the namespace that contains this collation
collowner
oid
pg_authid
.oid
Owner of the collation
collprovider
char
Provider of the collation: d
= database default, c
= libc, i
= icu
collencoding
int4
Encoding in which the collation is applicable, or -1 if it works for any encoding
collcollate
name
LC_COLLATE
for this collation object
collctype
name
LC_CTYPE
for this collation object
collversion
text
Provider-specific version of the collation. This is recorded when the collation is created and then checked when it is used, to detect changes in the collation definition that could lead to data corruption.
名稱
型別
參閱
說明
oid
oid
資料列識別指標(隱藏屬性;必須明確選擇)
datname
name
資料庫名稱
datdba
oid
pg_authid
.oid
資料庫的擁有者,通常是建立它的使用者
encoding
int4
此資料庫的字元編碼(pg_encoding_to_char()可將此數字轉換為編碼名稱)
datcollate
name
這個資料庫的 LC_COLLATE
datctype
name
這個資料庫的 LC_CTYPE
datistemplate
bool
如果為 true,則該資料庫可以由具有 CREATEDB 權限的任何使用者複製;如果為 false,則只有超級使用者或資料庫的擁有者才能複製它。
datallowconn
bool
如果為 false,則沒有人可以連線到該資料庫。這用於保護 template0 資料庫免遭更改。
datconnlimit
int4
設定可以對此資料庫執行的最大同時連線數。-1 意味著沒有限制。
datlastsysoid
oid
資料庫中的最後一個系統 OID;特別適用於 pg_dump
datfrozenxid
xid
在這個事務 ID 之前在此資料庫中的所有事務 ID,已被替換為永久(「 frozen」)。這用於追踪是否需要清理資料庫以防止事務 ID 重覆或允許縮減 pg_xact。它是每個資料表 pg_class.relfrozenxid 的最小值。
datminmxid
xid
此資料庫中的所有 multixact ID 已被替換為該資料庫中的事務 ID。這用於追踪資料庫是否需要清理,以防止 multixact ID 重覆或允許縮減 pg_multixact。它是每個資料表 pg_class.relminmxid 的最小值。
dattablespace
oid
pg_tablespace
.oid
資料庫預設的資料表空間。在此資料庫中,pg_class.reltablespace 為零的所有資料表都將儲存在此資料表空間中;特別是所有非共享系統目錄都將在那裡。
datacl
aclitem[]
Name
Type
References
Description
attrelid
oid
pg_class
.oid
The table this column belongs to
attname
name
The column name
atttypid
oid
pg_type
.oid
The data type of this column
attstattarget
int4
attstattarget
controls the level of detail of statistics accumulated for this column by ANALYZE. A zero value indicates that no statistics should be collected. A negative value says to use the system default statistics target. The exact meaning of positive values is data type-dependent. For scalar data types, attstattarget
is both the target number of “most common values” to collect, and the target number of histogram bins to create.
attlen
int2
A copy of pg_type.typlen
of this column's type
attnum
int2
The number of the column. Ordinary columns are numbered from 1 up. System columns, such as oid
, have (arbitrary) negative numbers.
attndims
int4
Number of dimensions, if the column is an array type; otherwise 0. (Presently, the number of dimensions of an array is not enforced, so any nonzero value effectively means “it's an array”.)
attcacheoff
int4
Always -1 in storage, but when loaded into a row descriptor in memory this might be updated to cache the offset of the attribute within the row
atttypmod
int4
atttypmod
records type-specific data supplied at table creation time (for example, the maximum length of a varchar
column). It is passed to type-specific input functions and length coercion functions. The value will generally be -1 for types that do not need atttypmod
.
attbyval
bool
A copy of pg_type.typbyval
of this column's type
attstorage
char
Normally a copy of pg_type.typstorage
of this column's type. For TOAST-able data types, this can be altered after column creation to control storage policy.
attalign
char
A copy of pg_type.typalign
of this column's type
attnotnull
bool
This represents a not-null constraint.
atthasdef
bool
This column has a default value, in which case there will be a corresponding entry in the pg_attrdef
catalog that actually defines the value.
attidentity
char
If a zero byte (''
), then not an identity column. Otherwise, a
= generated always, d
= generated by default.
attisdropped
bool
This column has been dropped and is no longer valid. A dropped column is still physically present in the table, but is ignored by the parser and so cannot be accessed via SQL.
attislocal
bool
This column is defined locally in the relation. Note that a column can be locally defined and inherited simultaneously.
attinhcount
int4
The number of direct ancestors this column has. A column with a nonzero number of ancestors cannot be dropped nor renamed.
attcollation
oid
pg_collation
.oid
The defined collation of the column, or zero if the column is not of a collatable data type.
attacl
aclitem[]
Column-level access privileges, if any have been granted specifically on this column
attoptions
text[]
Attribute-level options, as “keyword=value” strings
attfdwoptions
text[]
Attribute-level foreign data wrapper options, as “keyword=value” strings
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
rulename
name
Rule name
ev_class
oid
pg_class
.oid
The table this rule is for
ev_type
char
Event type that the rule is for: 1 = SELECT
, 2 = UPDATE
, 3 = INSERT
, 4 = DELETE
ev_enabled
char
Controls in which session_replication_role modes the rule fires. O
= rule fires in “origin” and “local” modes, D
= rule is disabled, R
= rule fires in “replica” mode, A
= rule fires always.
is_instead
bool
True if the rule is an INSTEAD
rule
ev_qual
pg_node_tree
Expression tree (in the form of a nodeToString()
representation) for the rule's qualifying condition
ev_action
pg_node_tree
Query tree (in the form of a nodeToString()
representation) for the rule's action
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
opcmethod
oid
pg_am
.oid
Index access method operator class is for
opcname
name
Name of this operator class
opcnamespace
oid
pg_namespace
.oid
Namespace of this operator class
opcowner
oid
pg_authid
.oid
Owner of the operator class
opcfamily
oid
pg_opfamily
.oid
Operator family containing the operator class
opcintype
oid
pg_type
.oid
Data type that the operator class indexes
opcdefault
bool
True if this operator class is the default for opcintype
opckeytype
oid
pg_type
.oid
Type of data stored in index, or zero if same as opcintype
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
proname
name
Name of the function
pronamespace
oid
pg_namespace
.oid
The OID of the namespace that contains this function
proowner
oid
pg_authid
.oid
Owner of the function
prolang
oid
pg_language
.oid
Implementation language or call interface of this function
procost
float4
Estimated execution cost (in units of cpu_operator_cost); if proretset
, this is cost per row returned
prorows
float4
Estimated number of result rows (zero if not proretset
)
provariadic
oid
pg_type
.oid
Data type of the variadic array parameter's elements, or zero if the function does not have a variadic parameter
protransform
regproc
pg_proc
.oid
Calls to this function can be simplified by this other function (see Section 38.10.10)
prokind
char
f
for a normal function, p
for a procedure, a
for an aggregate function, or w
for a window function
prosecdef
bool
Function is a security definer (i.e., a “setuid” function)
proleakproof
bool
The function has no side effects. No information about the arguments is conveyed except via the return value. Any function that might throw an error depending on the values of its arguments is not leak-proof.
proisstrict
bool
Function returns null if any call argument is null. In that case the function won't actually be called at all. Functions that are not “strict” must be prepared to handle null inputs.
proretset
bool
Function returns a set (i.e., multiple values of the specified data type)
provolatile
char
provolatile
tells whether the function's result depends only on its input arguments, or is affected by outside factors. It is i
for “immutable” functions, which always deliver the same result for the same inputs. It is s
for “stable” functions, whose results (for fixed inputs) do not change within a scan. It is v
for “volatile” functions, whose results might change at any time. (Use v
also for functions with side-effects, so that calls to them cannot get optimized away.)
proparallel
char
proparallel
tells whether the function can be safely run in parallel mode. It is s
for functions which are safe to run in parallel mode without restriction. It is r
for functions which can be run in parallel mode, but their execution is restricted to the parallel group leader; parallel worker processes cannot invoke these functions. It is u
for functions which are unsafe in parallel mode; the presence of such a function forces a serial execution plan.
pronargs
int2
Number of input arguments
pronargdefaults
int2
Number of arguments that have defaults
prorettype
oid
pg_type
.oid
Data type of the return value
proargtypes
oidvector
pg_type
.oid
An array with the data types of the function arguments. This includes only input arguments (including INOUT
and VARIADIC
arguments), and thus represents the call signature of the function.
proallargtypes
oid[]
pg_type
.oid
An array with the data types of the function arguments. This includes all arguments (including OUT
and INOUT
arguments); however, if all the arguments are IN
arguments, this field will be null. Note that subscripting is 1-based, whereas for historical reasons proargtypes
is subscripted from 0.
proargmodes
char[]
An array with the modes of the function arguments, encoded as i
for IN
arguments, o
for OUT
arguments, b
for INOUT
arguments, v
for VARIADIC
arguments, t
for TABLE
arguments. If all the arguments are IN
arguments, this field will be null. Note that subscripts correspond to positions of proallargtypes
not proargtypes
.
proargnames
text[]
An array with the names of the function arguments. Arguments without a name are set to empty strings in the array. If none of the arguments have a name, this field will be null. Note that subscripts correspond to positions of proallargtypes
not proargtypes
.
proargdefaults
pg_node_tree
Expression trees (in nodeToString()
representation) for default values. This is a list with pronargdefaults
elements, corresponding to the last N
input arguments (i.e., the last N
proargtypes
positions). If none of the arguments have defaults, this field will be null.
protrftypes
oid[]
Data type OIDs for which to apply transforms.
prosrc
text
This tells the function handler how to invoke the function. It might be the actual source code of the function for interpreted languages, a link symbol, a file name, or just about anything else, depending on the implementation language/call convention.
probin
text
Additional information about how to invoke the function. Again, the interpretation is language-specific.
proconfig
text[]
Function's local settings for run-time configuration variables
proacl
aclitem[]
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
conname
name
Constraint name (not necessarily unique!)
connamespace
oid
pg_namespace
.oid
The OID of the namespace that contains this constraint
contype
char
c
= check constraint, f
= foreign key constraint, p
= primary key constraint, u
= unique constraint, t
= constraint trigger, x
= exclusion constraint
condeferrable
bool
Is the constraint deferrable?
condeferred
bool
Is the constraint deferred by default?
convalidated
bool
Has the constraint been validated? Currently, can only be false for foreign keys and CHECK constraints
conrelid
oid
pg_class
.oid
The table this constraint is on; 0 if not a table constraint
contypid
oid
pg_type
.oid
The domain this constraint is on; 0 if not a domain constraint
conindid
oid
pg_class
.oid
The index supporting this constraint, if it's a unique, primary key, foreign key, or exclusion constraint; else 0
confrelid
oid
pg_class
.oid
If a foreign key, the referenced table; else 0
confupdtype
char
Foreign key update action code: a
= no action, r
= restrict, c
= cascade, n
= set null, d
= set default
confdeltype
char
Foreign key deletion action code: a
= no action, r
= restrict, c
= cascade, n
= set null, d
= set default
confmatchtype
char
Foreign key match type: f
= full, p
= partial, s
= simple
conislocal
bool
This constraint is defined locally for the relation. Note that a constraint can be locally defined and inherited simultaneously.
coninhcount
int4
The number of direct inheritance ancestors this constraint has. A constraint with a nonzero number of ancestors cannot be dropped nor renamed.
connoinherit
bool
This constraint is defined locally for the relation. It is a non-inheritable constraint.
conkey
int2[]
pg_attribute
.attnum
If a table constraint (including foreign keys, but not constraint triggers), list of the constrained columns
confkey
int2[]
pg_attribute
.attnum
If a foreign key, list of the referenced columns
conpfeqop
oid[]
pg_operator
.oid
If a foreign key, list of the equality operators for PK = FK comparisons
conppeqop
oid[]
pg_operator
.oid
If a foreign key, list of the equality operators for PK = PK comparisons
conffeqop
oid[]
pg_operator
.oid
If a foreign key, list of the equality operators for FK = FK comparisons
conexclop
oid[]
pg_operator
.oid
If an exclusion constraint, list of the per-column exclusion operators
conbin
pg_node_tree
If a check constraint, an internal representation of the expression
consrc
text
If a check constraint, a human-readable representation of the expression
Name
Type
References
Description
rolname
name
角色名稱
rolsuper
bool
角色具有超級使用者權限
rolinherit
bool
角色自動繼承它所屬角色的權限
rolcreaterole
bool
角色可以建立更多角色
rolcreatedb
bool
角色可以建立資料庫
rolcanlogin
bool
角色可以登入。也就是說,可以將此角色作為初始連線認證使用
rolreplication
bool
角色是可以進行資料複寫的角色。複寫角色表示可以啟動資料複寫連線並建立和刪除複寫對象。
rolconnlimit
int4
對於可以登入的角色,這個設定此角色可以建立的最大同時連線數。 -1 意味著沒有限制。
rolpassword
text
不是密碼(讀出來都是********)
rolvaliduntil
timestamptz
密碼到期時間(僅用於密碼驗證); 如果沒有到期時間,則顯示 null
rolbypassrls
bool
角色繞過每一個資料列層級的安全原則,參閱第 5.7 節了解更多訊息。
rolconfig
text[]
執行環境時用於角色的組態預設值
oid
oid
pg_authid
.oid
角色的 ID
Name
Type
References
Description
polname
name
The name of the policy
polrelid
oid
pg_class
.oid
The table to which the policy applies
polcmd
char
The command type to which the policy is applied: r
for SELECT
, a
for INSERT
, w
for UPDATE
, d
for DELETE
, or *
for all
polpermissive
boolean
Is the policy permissive or restrictive?
polroles
oid[]
pg_authid
.oid
The roles to which the policy is applied
polqual
pg_node_tree
The expression tree to be added to the security barrier qualifications for queries that use the table
polwithcheck
pg_node_tree
The expression tree to be added to the WITH CHECK qualifications for queries that attempt to add rows to the table
Name
Type
References
Description
starelid
oid
pg_class
.oid
The table or index that the described column belongs to
staattnum
int2
pg_attribute
.attnum
The number of the described column
stainherit
bool
If true, the stats include inheritance child columns, not just the values in the specified relation
stanullfrac
float4
The fraction of the column's entries that are null
stawidth
int4
The average stored width, in bytes, of nonnull entries
stadistinct
float4
The number of distinct nonnull data values in the column. A value greater than zero is the actual number of distinct values. A value less than zero is the negative of a multiplier for the number of rows in the table; for example, a column in which about 80% of the values are nonnull and each nonnull value appears about twice on average could be represented by stadistinct
= -0.4. A zero value means the number of distinct values is unknown.
stakind
N
int2
A code number indicating the kind of statistics stored in the N
th “slot” of the pg_statistic
row.
staop
N
oid
pg_operator
.oid
An operator used to derive the statistics stored in the N
th “slot”. For example, a histogram slot would show the <
operator that defines the sort order of the data.
stanumbers
N
float4[]
Numerical statistics of the appropriate kind for the N
th “slot”, or null if the slot kind does not involve numerical values
stavalues
N
anyarray
Column data values of the appropriate kind for the N
th “slot”, or null if the slot kind does not store any data values. Each array's element values are actually of the specific column's data type, or a related type such as an array's element type, so there is no way to define these columns' type more specifically than anyarray
.
名稱
型別
參考
說明
oid
oid
資料列識別指標(隱藏屬性;必須明確選擇)
spcname
name
資料表空間名稱
spcowner
oid
pg_authid
.oid
資料表空間的所有者,通常是建立它的使用者
spcacl
aclitem[]
spcoptions
text[]
資料表空間的選項,為“keyword = value”字串
Code | Category |
| Array types |
| Boolean types |
| Composite types |
| Date/time types |
| Enum types |
| Geometric types |
| Network address types |
| Numeric types |
| Pseudo-types |
| Range types |
| String types |
| Timespan types |
| User-defined types |
| Bit-string types |
|
|
名稱 | 型別 | 說明 |
|
| 執行階段的組態參數名稱 |
|
| 參數的現值 |
|
| 參數隱含的單位 |
|
| 參數的邏輯分類 |
|
| 參數的簡要說明 |
|
| 附加的,更詳細的參數說明 |
|
| 組態參數值的必要內容(詳見下文) |
|
| 參數型別(bool、enum、integer、real 或 string) |
|
| 目前參數值的來源 |
|
| 參數的最小允許值(非數字型別為 null) |
|
| 參數的最大允許值(非數字型別為 null) |
|
| 列舉參數的允許值(非列舉型別為 null) |
|
| 如果未另行設定參數,則在伺服器啟動時預先給予參數值 |
|
| RESET 將參數重置為目前連線中的值 |
|
| 組態檔案目前設定為何(對於從組態檔案以外來源設定的值,或者由非超級使用者也不是 pg_read_all_settings 的成員所給予,為null);在組態檔案中使用 include 指令時會很有幫助 |
|
| 組態檔案中目前設定所在的行號(對於從組態檔案以外來源所設定的值,或者由非超級使用者,也不是 pg_read_all_settings 成員所給予的值,則為 null)。 |
|
| 如果組態檔案中的值已更改但需要重新啟動,則為 true;否則為 false。 |
Name | Indexed Data Type | Indexable Operators | Ordering Operators |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| any range type |
|
|
|
|
|
|
|
Name | Indexed Data Type | Indexable Operators |
|
|
|
|
|
|
| any range type |
|
|
|
|
|
|
|
|
|
|
Name | Indexed Data Type | Indexable Operators |
|
|
|
|
|
|
|
|
|
|
|
|
Name | Indexed Data Type | Indexable Operators |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
版本:11
Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values:
As explained in Section 14.2, the planner can determine cardinality of t
using the number of pages and rows obtained from pg_class
:
The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed.
The following example shows the result of estimating a WHERE
condition on the a
column:
The planner examines the condition and determines the selectivity of this clause to be 1%. By comparing this estimate and the actual number of rows, we see that the estimate is very accurate (in fact exact, as the table is very small). Changing the WHERE
condition to use the b
column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with AND
:
The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher.
This problem can be fixed by creating a statistics object that directs ANALYZE
to calculate functional-dependency multivariate statistics on the two columns:
A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a GROUP BY
clause. When GROUP BY
lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate:
But without multivariate statistics, the estimate for the number of groups in a query with two columns in GROUP BY
, as in the following example, is off by an order of magnitude:
By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved:
BRIN stands for Block Range Index. BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table. A block range is a group of pages that are physically adjacent in the table; for each block range, some summary info is stored by the index. For example, a table storing a store's sale orders might have a date column on which each order was placed, and most of the time the entries for earlier orders will appear earlier in the table as well; a table storing a ZIP code column might have all codes for a city grouped together naturally.
BRIN indexes can satisfy queries via regular bitmap index scans, and will return all tuples in all pages within each range if the summary info stored by the index is consistent with the query conditions. The query executor is in charge of rechecking these tuples and discarding those that do not match the query conditions — in other words, these indexes are lossy. Because a BRIN index is very small, scanning the index adds little overhead compared to a sequential scan, but may avoid scanning large parts of the table that are known not to contain matching tuples.
The specific data that a BRIN index will store, as well as the specific queries that the index will be able to satisfy, depend on the operator class selected for each column of the index. Data types having a linear sort order can have operator classes that store the minimum and maximum value within each block range, for instance; geometrical types might store the bounding box for all the objects in the block range.
The size of the block range is determined at index creation time by the pages_per_range
storage parameter. The number of index entries will be equal to the size of the relation in pages divided by the selected value for pages_per_range
. Therefore, the smaller the number, the larger the index becomes (because of the need to store more index entries), but at the same time the summary data stored can be more precise and more data blocks can be skipped during an index scan.
At the time of creation, all existing heap pages are scanned and a summary index tuple is created for each range, including the possibly-incomplete range at the end. As new pages are filled with data, page ranges that are already summarized will cause the summary information to be updated with data from the new tuples. When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_range(regclass, bigint)
or brin_summarize_new_values(regclass)
functions; automatically when VACUUM
processes the table; or by automatic summarization executed by autovacuum, as insertions occur. (This last trigger is disabled by default and can be enabled with the autosummarize
parameter.) Conversely, a range can be de-summarized using the brin_desummarize_range(regclass, bigint)
function, which is useful when the index tuple is no longer a very good representation because the existing values have changed.
When autosummarization is enabled, each time a page range is filled a request is sent to autovacuum for it to execute a targeted summarization for that range, to be fulfilled at the end of the next worker run on the same database. If the request queue is full, the request is not recorded and a message is sent to the server log:
When this happens, the range will be summarized normally during the next regular vacuum of the table.
每個 heap 關連都有一個可見性映射表(VM,Visibility Map),用於追踪哪些頁面僅包含已知對所有活動事務可見的 tuple;它還追踪哪些頁面僅包含凍結的 tuple。 它與主要的關連資料一起儲存在一個單獨的關連分支中,以關連的 filenode 編號命名,加上 _vm 後綴。例如,如果關連的 filenode 是 12345,則 VM 儲存在名稱為 12345_vm 的檔案中,與主要關連檔案位於同一目錄中。請注意,索引沒有 VM。
可見性映射表將每個 heap 頁面儲存 2 個位元。第一個位元(如果為 1)表示頁面全部可見,或者換句話說,頁面不包含任何需要清理的 tuple。索引限定掃描也可以使用此訊息來索引限定掃描 tuple 來回答查詢。第二個位元(如果為 1)表示頁面上的所有 tuple 都已凍結。這意味著即使是防止交易重疊清理也不需要重新讀取頁面。
映射表是保守的,因為我們得確保無論何時設定一個位元,我們都知道條件為真,但如果沒有設定一個位元,它可能會也可能不會成立。可見性映射位元僅由 vacuum 設定,但可以透過頁面上的任何資料修改操作清除。
pg_visibility 模組可用於檢查可見性映射表中儲存的訊息。
SP-GiST is an abbreviation for space-partitioned GiST. SP-GiST supports partitioned search trees, which facilitate development of a wide range of different non-balanced data structures, such as quad-trees, k-d trees, and radix trees (tries). The common feature of these structures is that they repeatedly divide the search space into partitions that need not be of equal size. Searches that are well matched to the partitioning rule can be very fast.
These popular data structures were originally developed for in-memory usage. In main memory, they are usually designed as a set of dynamically allocated nodes linked by pointers. This is not suitable for direct storing on disk, since these chains of pointers can be rather long which would require too many disk accesses. In contrast, disk-based data structures should have a high fanout to minimize I/O. The challenge addressed by SP-GiST is to map search tree nodes to disk pages in such a way that a search need access only a few disk pages, even if it traverses many nodes.
Like GiST, SP-GiST is meant to allow the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.
Some of the information here is derived from Purdue University's SP-GiST Indexing Project web site. The SP-GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their web site.
本章概述了 PostgreSQL 資料庫使用的實體儲存格式。
This chapter builds on the material covered in Section 14.1 and Section 14.2 to show some additional details about how the planner uses the system statistics to estimate the number of rows each part of a query might return. This is a significant part of the planning process, providing much of the raw material for cost calculation.
The intent of this chapter is not to document the code in detail, but to present an overview of how it works. This will perhaps ease the learning curve for someone who subsequently wishes to read the code.
Name | Type | References | Description |
|
|
| Row identifier (hidden attribute; must be explicitly selected) |
|
| The table this trigger is on |
|
|
| Trigger name (must be unique among triggers of same table) |
|
| The function to be called |
|
|
| Bit mask identifying trigger firing conditions |
|
|
|
|
|
| True if trigger is internally generated (usually, to enforce the constraint identified by |
|
| The table referenced by a referential integrity constraint |
|
| The index supporting a unique, primary key, referential integrity, or exclusion constraint |
|
| The |
|
|
| True if constraint trigger is deferrable |
|
|
| True if constraint trigger is initially deferred |
|
|
| Number of argument strings passed to trigger function |
|
| Column numbers, if trigger is column-specific; otherwise an empty array |
|
|
| Argument strings to pass to trigger, each NULL-terminated |
|
|
| Expression tree (in |
|
|
|
|
|
|
|
|
Name | Type | References | Description |
|
|
| Row identifier (hidden attribute; must be explicitly selected) |
|
|
| Data type name |
|
| The OID of the namespace that contains this type |
|
| Owner of the type |
|
|
| For a fixed-size type, |
|
|
|
|
|
|
|
|
|
|
|
|
|
| True if the type is a preferred cast target within its |
|
|
| True if the type is defined, false if this is a placeholder entry for a not-yet-defined type. When |
|
|
| Character that separates two values of this type when parsing array input. Note that the delimiter is associated with the array element data type, not the array data type. |
|
| If this is a composite type (see |
|
| If |
|
| If |
|
| Input conversion function (text format) |
|
| Output conversion function (text format) |
|
| Input conversion function (binary format), or 0 if none |
|
| Output conversion function (binary format), or 0 if none |
|
| Type modifier input function, or 0 if type does not support modifiers |
|
| Type modifier output function, or 0 to use the standard format |
|
| Custom |
|
|
|
|
|
|
|
|
|
|
|
|
|
| If this is a domain (see |
|
|
| Domains use |
|
|
|
|
|
|
|
|
|
| If |
|
|
|
|
|
|
|
The BRIN interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The BRIN layer itself takes care of concurrency, logging and searching the index structure.
All it takes to get a BRIN access method working is to implement a few user-defined methods, which define the behavior of summary values stored in the index and the way they interact with scan keys. In short, BRIN combines extensibility with generality, code reuse, and a clean interface.
There are four methods that an operator class for BRIN must provide:BrinOpcInfo *opcInfo(Oid type_oid)
Returns internal information about the indexed columns' summary data. The return value must point to a palloc'd BrinOpcInfo
, which has this definition:
BrinOpcInfo
.oi_opaque
can be used by the operator class routines to pass information between support procedures during an index scan.bool consistent(BrinDesc *bdesc, BrinValues *column, ScanKey key)
Returns whether the ScanKey is consistent with the given indexed values for a range. The attribute number to use is passed as part of the scan key.bool addValue(BrinDesc *bdesc, BrinValues *column, Datum newval, bool isnull)
Given an index tuple and an indexed value, modifies the indicated attribute of the tuple so that it additionally represents the new value. If any modification was done to the tuple, true
is returned.bool unionTuples(BrinDesc *bdesc, BrinValues *a, BrinValues *b)
Consolidates two index tuples. Given two index tuples, modifies the indicated attribute of the first of them so that it represents both tuples. The second tuple is not modified.
The core distribution includes support for two types of operator classes: minmax and inclusion. Operator class definitions using them are shipped for in-core data types as appropriate. Additional operator classes can be defined by the user for other data types using equivalent definitions, without having to write any source code; appropriate catalog entries being declared is enough. Note that assumptions about the semantics of operator strategies are embedded in the support procedures' source code.
Operator classes that implement completely different semantics are also possible, provided implementations of the four main support procedures described above are written. Note that backwards compatibility across major releases is not guaranteed: for example, additional support procedures might be required in later releases.
To write an operator class for a data type that implements a totally ordered set, it is possible to use the minmax support procedures alongside the corresponding operators, as shown in Table 65.2. All operator class members (procedures and operators) are mandatory.
Table 65.2. Procedure and Support Numbers for Minmax Operator Classes
To write an operator class for a complex data type which has values included within another type, it's possible to use the inclusion support procedures alongside the corresponding operators, as shown in Table 65.3. It requires only a single additional function, which can be written in any language. More functions can be defined for additional functionality. All operators are optional. Some operators require other operators, as shown as dependencies on the table.
Table 65.3. Procedure and Support Numbers for Inclusion Operator Classes
Support procedure numbers 1-10 are reserved for the BRIN internal functions, so the SQL level functions start with number 11. Support function number 11 is the main function required to build the index. It should accept two arguments with the same data type as the operator class, and return the union of them. The inclusion operator class can store union values with different data types if it is defined with the STORAGE
parameter. The return value of the union function should match the STORAGE
data type.
Support procedure numbers 12 and 14 are provided to support irregularities of built-in data types. Procedure number 12 is used to support network addresses from different families which are not mergeable. Procedure number 14 is used to support empty ranges. Procedure number 13 is an optional but recommended one, which allows the new value to be checked before it is passed to the union function. As the BRIN framework can shortcut some operations when the union is not changed, using this function can improve index performance.
Both minmax and inclusion operator classes support cross-data-type operators, though with these the dependencies become more complicated. The minmax operator class requires a full set of operators to be defined with both arguments having the same data type. It allows additional data types to be supported by defining extra sets of operators. Inclusion operator class operator strategies are dependent on another operator strategy as shown in Table 65.3, or the same operator strategy as themselves. They require the dependency operator to be defined with the STORAGE
data type as the left-hand-side argument and the other supported data type to be the right-hand-side argument of the supported operator. Seefloat4_minmax_ops
as an example of minmax, and box_inclusion_ops
as an example of inclusion.
欄位 | 型別 | 參考 | 說明 |
|
|
| 資料列指標(隱藏屬性;必須明確選擇) |
|
|
| 命名空間的名稱 |
|
| .oid | 命名空間的所有者 |
|
|
| 存取權限;有關詳細信息,請參閱 和 |
欄位 | 型別 | 參考 | 說明 |
|
|
| 包含此物件包含欄位的所屬資料表 |
|
| 統計物件的名稱 |
|
|
| 包含此統計資訊物件在命名空間裡的 OID |
|
|
| 統計物件的所有者 |
|
|
| 一組陣列表示的屬性數字,指示此統計物件覆蓋哪些資料表欄位;例如,值為 1 3 意味著覆蓋了第一個和第三個資料表欄位 |
|
| 包含已啟用統計類型的代碼的陣列;有效值為:d 表示 n-distinct 統計資訊,f 表示功能相依統計資訊 |
|
| N-distinct 計數,序列化為 pg_ndistinct 型別 |
|
| 功能相依統計資訊,序列化為 pg_dependencies 型別 |
Name | Type | References | Description |
|
| .roident | internal node identifier |
|
| .roname | external node identifier |
|
|
| The origin node's LSN up to which data has been replicated. |
|
|
| This node's LSN at which |
Name | Type | References | Description |
|
| .nspname | Name of schema containing table |
|
| .relname | Name of table |
|
| .attname | Name of the column described by this row |
|
|
| If true, this row includes inheritance child columns, not just the values in the specified table |
|
|
| Fraction of column entries that are null |
|
|
| Average width in bytes of column's entries |
|
|
| If greater than zero, the estimated number of distinct values in the column. If less than zero, the negative of the number of distinct values divided by the number of rows. (The negated form is used when |
|
|
| A list of the most common values in the column. (Null if no values seem to be more common than any others.) |
|
|
| A list of the frequencies of the most common values, i.e., number of occurrences of each divided by total number of rows. (Null when |
|
|
| A list of values that divide the column's values into groups of approximately equal population. The values in |
|
|
| Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a |
|
|
| A list of non-null element values most often appearing within values of the column. (Null for scalar types.) |
|
|
| A list of the frequencies of the most common element values, i.e., the fraction of rows containing at least one instance of the given value. Two or three additional values follow the per-element frequencies; these are the minimum and maximum of the preceding per-element frequencies, and optionally the frequency of null elements. (Null when |
|
|
| A histogram of the counts of distinct non-null element values within the values of the column, followed by the average number of distinct non-null elements. (Null for scalar types.) |
.oid
.oid
Controls in which modes the trigger fires. O
= trigger fires in “origin” and “local” modes, D
= trigger is disabled, R
= trigger fires in “replica” mode, A
= trigger fires always.
.oid
.oid
.oid
.attnum
.oid
.oid
typcategory
is an arbitrary classification of data types that is used by the parser to determine which implicit casts should be “preferred”. See .
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
Access privileges; see and for details
Operator class member
Object
Support Procedure 1
internal function brin_minmax_opcinfo()
Support Procedure 2
internal function brin_minmax_add_value()
Support Procedure 3
internal function brin_minmax_consistent()
Support Procedure 4
internal function brin_minmax_union()
Operator Strategy 1
operator less-than
Operator Strategy 2
operator less-than-or-equal-to
Operator Strategy 3
operator equal-to
Operator Strategy 4
operator greater-than-or-equal-to
Operator Strategy 5
operator greater-than
Operator class member
Object
Dependency
Support Procedure 1
internal function brin_inclusion_opcinfo()
Support Procedure 2
internal function brin_inclusion_add_value()
Support Procedure 3
internal function brin_inclusion_consistent()
Support Procedure 4
internal function brin_inclusion_union()
Support Procedure 11
function to merge two elements
Support Procedure 12
optional function to check whether two elements are mergeable
Support Procedure 13
optional function to check if an element is contained within another
Support Procedure 14
optional function to check whether an element is empty
Operator Strategy 1
operator left-of
Operator Strategy 4
Operator Strategy 2
operator does-not-extend-to-the-right-of
Operator Strategy 5
Operator Strategy 3
operator overlaps
Operator Strategy 4
operator does-not-extend-to-the-left-of
Operator Strategy 1
Operator Strategy 5
operator right-of
Operator Strategy 2
Operator Strategy 6, 18
operator same-as-or-equal-to
Operator Strategy 7
Operator Strategy 7, 13, 16, 24, 25
operator contains-or-equal-to
Operator Strategy 8, 14, 26, 27
operator is-contained-by-or-equal-to
Operator Strategy 3
Operator Strategy 9
operator does-not-extend-above
Operator Strategy 11
Operator Strategy 10
operator is-below
Operator Strategy 12
Operator Strategy 11
operator is-above
Operator Strategy 9
Operator Strategy 12
operator does-not-extend-below
Operator Strategy 10
Operator Strategy 20
operator less-than
Operator Strategy 5
Operator Strategy 21
operator less-than-or-equal-to
Operator Strategy 5
Operator Strategy 22
operator greater-than
Operator Strategy 1
Operator Strategy 23
operator greater-than-or-equal-to
Operator Strategy 1