Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
PostgreSQL is implemented using a simple “process per user” client/server model. In this model there is one client process connected to exactly one server process. As we do not know ahead of time how many connections will be made, we have to use a master process that spawns a new server process every time a connection is requested. This master process is called postgres
and listens at a specified TCP/IP port for incoming connections. Whenever a request for a connection is detected the postgres
process spawns a new server process. The server tasks communicate with each other using semaphores and shared memory to ensure data integrity throughout concurrent data access.
The client process can be any program that understands the PostgreSQL protocol described in Chapter 52. Many clients are based on the C-language library libpq, but several independent implementations of the protocol exist, such as the Java JDBC driver.
Once a connection is established the client process can send a query to the backend (server). The query is transmitted using plain text, i.e., there is no parsing done in the frontend (client). The server parses the query, creates an execution plan, executes the plan and returns the retrieved rows to the client by transmitting them over the established connection.
本部分包含可能用於 PostgreSQL 研發人員的各種內容。
The parser stage consists of two parts:
The parser defined in gram.y
and scan.l
is built using the Unix tools bison and flex.
The transformation process does modifications and augmentations to the data structures returned by the parser.
The parser has to check the query string (which arrives as plain text) for valid syntax. If the syntax is correct a parse tree is built up and handed back; otherwise an error is returned. The parser and lexer are implemented using the well-known Unix tools bison and flex.
The lexer is defined in the file scan.l
and is responsible for recognizing identifiers, the SQL key words etc. For every key word or identifier that is found, a token is generated and handed to the parser.
The parser is defined in the file gram.y
and consists of a set of grammar rules and actions that are executed whenever a rule is fired. The code of the actions (which is actually C code) is used to build up the parse tree.
The file scan.l
is transformed to the C source file scan.c
using the program flex and gram.y
is transformed to gram.c
using bison. After these transformations have taken place a normal C compiler can be used to create the parser. Never make any changes to the generated C files as they will be overwritten the next time flex or bison is called.
The mentioned transformations and compilations are normally done automatically using the makefiles shipped with the PostgreSQL source distribution.
A detailed description of bison or the grammar rules given in gram.y
would be beyond the scope of this paper. There are many books and documents dealing with flex and bison. You should be familiar with bison before you start to study the grammar given in gram.y
otherwise you won't understand what happens there.
The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs, so there is no possibility to understand the detailed semantics of the requested operations. After the parser completes, the transformation process takes the tree handed back by the parser as input and does the semantic interpretation needed to understand which tables, functions, and operators are referenced by the query. The data structure that is built to represent this information is called the query tree.
The reason for separating raw parsing from semantic analysis is that system catalog lookups can only be done within a transaction, and we do not wish to start a transaction immediately upon receiving a query string. The raw parsing stage is sufficient to identify the transaction control commands (BEGIN
, ROLLBACK
, etc), and these can then be correctly executed without any further analysis. Once we know that we are dealing with an actual query (such as SELECT
or UPDATE
), it is okay to start a transaction if we're not already in one. Only then can the transformation process be invoked.
The query tree created by the transformation process is structurally similar to the raw parse tree in most places, but it has many differences in detail. For example, a FuncCall
node in the parse tree represents something that looks syntactically like a function call. This might be transformed to either a FuncExpr
or Aggref
node depending on whether the referenced name turns out to be an ordinary function or an aggregate function. Also, information about the actual data types of columns and expression results is added to the query tree.
本章概述了 PostgreSQL 的內部架構。閱讀以下各節後,你應該可以了解一個查詢是如何被處理的。本章的目的不是詳細描述 PostgreSQL 的內部操作,因為那樣的說明太過於詳盡。本章旨在幫助讀者理解資料庫後端內部發生的一些操作程序,從接收查詢的開始到將結果回傳給用戶端之間所發生的事。
PostgreSQL supports a powerful rule system for the specification of views and ambiguous view updates. Originally the PostgreSQL rule system consisted of two implementations:
The first one worked using row level processing and was implemented deep in the executor. The rule system was called whenever an individual row had been accessed. This implementation was removed in 1995 when the last official release of the Berkeley Postgres project was transformed into Postgres95.
The second implementation of the rule system is a technique called query rewriting. The rewrite system is a module that exists between the parser stage and the planner/optimizer. This technique is still implemented.
The query rewriter is discussed in some detail in Chapter 40, so there is no need to cover it here. We will only point out that both the input and the output of the rewriter are query trees, that is, there is no change in the representation or level of semantic detail in the trees. Rewriting can be thought of as a form of macro expansion.
在這裡,我們簡單概述查詢必須經過某些階段才能得到結果。
必須建立從應用程式到 PostgreSQL 伺服器的連線。應用程式將查詢語句發送到伺服器,並等待接收伺服器送回的結果。
查詢解析程序(parser)階段檢查應用程式所發送的查詢語句是否語法正確,並建立查詢樹(query tree)。
查詢改寫(rewrite)系統採用查詢解析程序階段所建立的查詢樹,查詢要使用於查詢樹的所有規則(記錄在系統目錄中)。它執行規則內容所敘述的改寫轉換方式。 改寫系統的其中一種用途是檢視表(view)的實作。每當針對檢視表(或虛擬資料表)進行查詢時,改寫系統都會將使用者的查詢覆寫為存取檢視表定義中所宣告的基本資料表查詢。
計劃程序或最佳化程序採用已改寫的查詢樹並建立一個查詢計劃,該計劃將作為執行程序的輸入。
為此,首先建立所有可能產生相同結果的查詢路徑。例如,如果要掃描的關連上有索引,則有兩條掃描路徑。一種可能性是簡單的循序掃描,另一種可能性是使用索引。接下來,估計每條路徑的執行成本,並選擇最便宜的路徑。最便宜的路徑將被產生用於執行者可以使用的完整計劃。
執行程序以遞迴方式走遍計劃樹並以計劃規劃的方式檢索資料。執行程序利用儲存系統掃描資料關連、執行排序和集合、過濾條件,在最後回傳所產生的資料內容。
在以下各節中,我們將更詳細地介紹上述每個項目,以更好地理解 PostgreSQL 的內部控制和資料結構。
Name
Type
Description
name
name
Extension name
default_version
text
預設版本的名稱;如果未指定,則為 NULL
installed_version
text
目前安裝的延伸功能版本,如果未安裝,則為 NULL
comment
text
延伸功能的控制檔案中的註解文字內容
The catalog pg_am
stores information about relation access methods. There is one row for each access method supported by the system. Currently, only indexes have access methods. The requirements for index access methods are discussed in detail in Chapter 60.
Table 51.3. pg_am
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
amname
name
Name of the access method
amhandler
regproc
.oid
OID of a handler function that is responsible for supplying information about the access method
amtype
char
Currently always i
to indicate an index access method; other values may be allowed in future
Before PostgreSQL 9.6, pg_am
contained many additional columns representing properties of index access methods. That data is now only directly visible at the C code level. However,pg_index_column_has_property()
and related functions have been added to allow SQL queries to inspect index access method properties; see Table 9.63.
The executor takes the plan created by the planner/optimizer and recursively processes it to extract the required set of rows. This is essentially a demand-pull pipeline mechanism. Each time a plan node is called, it must deliver one more row, or report that it is done delivering rows.
To provide a concrete example, assume that the top node is a MergeJoin
node. Before any merge can be done two rows have to be fetched (one from each subplan). So the executor recursively calls itself to process the subplans (it starts with the subplan attached to lefttree
). The new top node (the top node of the left subplan) is, let's say, a Sort
node and again recursion is needed to obtain an input row. The child node of the Sort
might be a SeqScan
node, representing actual reading of a table. Execution of this node causes the executor to fetch a row from the table and return it up to the calling node. The Sort
node will repeatedly call its child to obtain all the rows to be sorted. When the input is exhausted (as indicated by the child node returning a NULL instead of a row), the Sort
code performs the sort, and finally is able to return its first output row, namely the first one in sorted order. It keeps the remaining rows stored so that it can deliver them in sorted order in response to later demands.
The MergeJoin
node similarly demands the first row from its right subplan. Then it compares the two rows to see if they can be joined; if so, it returns a join row to its caller. On the next call, or immediately if it cannot join the current pair of inputs, it advances to the next row of one table or the other (depending on how the comparison came out), and again checks for a match. Eventually, one subplan or the other is exhausted, and the MergeJoin
node returns NULL to indicate that no more join rows can be formed.
Complex queries can involve many levels of plan nodes, but the general approach is the same: each node computes and returns its next output row each time it is called. Each node is also responsible for applying any selection or projection expressions that were assigned to it by the planner.
The executor mechanism is used to evaluate all four basic SQL query types: SELECT
, INSERT
, UPDATE
, and DELETE
. For SELECT
, the top-level executor code only needs to send each row returned by the query plan tree off to the client. For INSERT
, each returned row is inserted into the target table specified for the INSERT
. This is done in a special top-level plan node called ModifyTable
. (A simple INSERT ... VALUES
command creates a trivial plan tree consisting of a single Result
node, which computes just one result row, and ModifyTable
above it to perform the insertion. But INSERT ... SELECT
can demand the full power of the executor mechanism.) For UPDATE
, the planner arranges that each computed row includes all the updated column values, plus the TID (tuple ID, or row ID) of the original target row; this data is fed into a ModifyTable
node, which uses the information to create a new updated row and mark the old row deleted. For DELETE
, the only column that is actually returned by the plan is the TID, and the ModifyTable
node simply uses the TID to visit each target row and mark it deleted.
系統目錄(system catalog)是記錄資料庫管理系統儲存結構原始資料的地方,例如關於資料表和欄位的訊息以及內部日誌記錄訊息。PostgreSQL 的系統目錄是一般的資料表。您可以刪除並重新建立資料表、增加欄位、插入和更新內容,並以這種方式嚴重混淆您的系統。當然,通常情況下,不應該手動更改系統目錄,通常有 SQL 命令來執行此操作。(例如,CREATE DATABASE 向 pg_database 系統目錄插入一行 - 實際上是在磁碟上建立數據庫)。對於特別深奧的操作有一些例外,但其中很多已經隨著時間的推移而變為 SQL 命令,因此需要系統目錄的直接操作正在不斷減少。
目錄 pg_auth_members 顯示角色之間的成員資格關連。允許任何非循環的關連。
由於角色身份是叢集範圍的,因此 pg_auth_members 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_auth_members 副本,而不是每個資料庫一個副本。
Table 51.9. pg_auth_members
Columns
Name
Type
References
Description
roleid
oid
``.oid
具有成員的角色 ID
member
oid
.oid
作為 roleid 成員的角色 ID
grantor
oid
``.oid
授予此成員資格的角色 ID
admin_option
bool
如果成員可以將 roleid 的成員資格授予其他人,則為 True
The catalog pg_attribute
stores information about table columns. There will be exactly one pg_attribute
row for every column in every table in the database. (There will also be attribute entries for indexes, and indeed all objects that have pg_class
entries.)
The term attribute is equivalent to column and is used for historical reasons.
pg_attribute
ColumnsName
Type
References
Description
attrelid
oid
The table this column belongs to
attname
name
The column name
atttypid
oid
The data type of this column
attstattarget
int4
attlen
int2
A copy of pg_type.typlen
of this column's type
attnum
int2
The number of the column. Ordinary columns are numbered from 1 up. System columns, such as oid
, have (arbitrary) negative numbers.
attndims
int4
Number of dimensions, if the column is an array type; otherwise 0. (Presently, the number of dimensions of an array is not enforced, so any nonzero value effectively means “it's an array”.)
attcacheoff
int4
Always -1 in storage, but when loaded into a row descriptor in memory this might be updated to cache the offset of the attribute within the row
atttypmod
int4
atttypmod
records type-specific data supplied at table creation time (for example, the maximum length of a varchar
column). It is passed to type-specific input functions and length coercion functions. The value will generally be -1 for types that do not need atttypmod
.
attbyval
bool
A copy of pg_type.typbyval
of this column's type
attstorage
char
Normally a copy of pg_type.typstorage
of this column's type. For TOAST-able data types, this can be altered after column creation to control storage policy.
attalign
char
A copy of pg_type.typalign
of this column's type
attnotnull
bool
This represents a not-null constraint.
atthasdef
bool
This column has a default value, in which case there will be a corresponding entry in the pg_attrdef
catalog that actually defines the value.
attidentity
char
If a zero byte (''
), then not an identity column. Otherwise, a
= generated always, d
= generated by default.
attisdropped
bool
This column has been dropped and is no longer valid. A dropped column is still physically present in the table, but is ignored by the parser and so cannot be accessed via SQL.
attislocal
bool
This column is defined locally in the relation. Note that a column can be locally defined and inherited simultaneously.
attinhcount
int4
The number of direct ancestors this column has. A column with a nonzero number of ancestors cannot be dropped nor renamed.
attcollation
oid
The defined collation of the column, or zero if the column is not of a collatable data type.
attacl
aclitem[]
Column-level access privileges, if any have been granted specifically on this column
attoptions
text[]
Attribute-level options, as “keyword=value” strings
attfdwoptions
text[]
Attribute-level foreign data wrapper options, as “keyword=value” strings
In a dropped column's pg_attribute
entry, atttypid
is reset to zero, but attlen
and the other fields copied from pg_type
are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was later dropped, and so there is no pg_type
row anymore. attlen
and the other fields can be used to interpret the contents of a row of the table.
.oid
.oid
attstattarget
controls the level of detail of statistics accumulated for this column by . A zero value indicates that no statistics should be collected. A negative value says to use the system default statistics target. The exact meaning of positive values is data type-dependent. For scalar data types, attstattarget
is both the target number of “most common values” to collect, and the target number of histogram bins to create.
.oid
The catalog pg_authid
contains information about database authorization identifiers (roles). A role subsumes the concepts of “users” and “groups”. A user is essentially just a role with the rolcanlogin
flag set. Any role (with or without rolcanlogin
) can have other roles as members; see pg_auth_members
.
Since this catalog contains passwords, it must not be publicly readable. pg_roles
is a publicly readable view on pg_authid
that blanks out the password field.
Chapter 21 contains detailed information about user and privilege management.
Because user identities are cluster-wide, pg_authid
is shared across all databases of a cluster: there is only one copy of pg_authid
per cluster, not one per database.
pg_authid
ColumnsColumn Type
Description
oid
oid
Row identifier
rolname
name
Role name
rolsuper
bool
Role has superuser privileges
rolinherit
bool
Role automatically inherits privileges of roles it is a member of
rolcreaterole
bool
Role can create more roles
rolcreatedb
bool
Role can create databases
rolcanlogin
bool
Role can log in. That is, this role can be given as the initial session authorization identifier
rolreplication
bool
Role is a replication role. A replication role can initiate replication connections and create and drop replication slots.
rolbypassrls
bool
rolconnlimit
int4
For roles that can log in, this sets maximum number of concurrent connections this role can make. -1 means no limit.
rolpassword
text
Password (possibly encrypted); null if none. The format depends on the form of encryption used.
rolvaliduntil
timestamptz
Password expiry time (only used for password authentication); null if no expiration
For an MD5 encrypted password, rolpassword
column will begin with the string md5
followed by a 32-character hexadecimal MD5 hash. The MD5 hash will be of the user's password concatenated to their user name. For example, if user joe
has password xyzzy
, PostgreSQL will store the md5 hash of xyzzyjoe
.
If the password is encrypted with SCRAM-SHA-256, it has the format:
where salt
, StoredKey
and ServerKey
are in Base64 encoded format. This format is the same as that specified by RFC 5803.
A password that does not follow either of those formats is assumed to be unencrypted.
The catalog pg_class
catalogs tables and most everything else that has columns or is otherwise similar to a table. This includes indexes (but see also pg_index
), sequences (but see also pg_sequence
), views, materialized views, composite types, and TOAST tables; see relkind
. Below, when we mean all of these kinds of objects we speak of “relations”. Not all columns are meaningful for all relation types.
pg_class
ColumnsColumn Type
Description
oid
oid
Row identifier
relname
name
Name of the table, index, view, etc.
The OID of the namespace that contains this relation
The OID of the data type that corresponds to this table's row type, if any (zero for indexes, which have no pg_type
entry)
For typed tables, the OID of the underlying composite type, zero for all other relations
Owner of the relation
If this is a table or an index, the access method used (heap, B-tree, hash, etc.)
relfilenode
oid
Name of the on-disk file of this relation; zero means this is a “mapped” relation whose disk file name is determined by low-level state
The tablespace in which this relation is stored. If zero, the database's default tablespace is implied. (Not meaningful if the relation has no on-disk file.)
relpages
int4
Size of the on-disk representation of this table in pages (of size BLCKSZ
). This is only an estimate used by the planner. It is updated by VACUUM
, ANALYZE
, and a few DDL commands such as CREATE INDEX
.
reltuples
float4
Number of live rows in the table. This is only an estimate used by the planner. It is updated by VACUUM
, ANALYZE
, and a few DDL commands such as CREATE INDEX
.
relallvisible
int4
Number of pages that are marked all-visible in the table's visibility map. This is only an estimate used by the planner. It is updated by VACUUM
, ANALYZE
, and a few DDL commands such as CREATE INDEX
.
OID of the TOAST table associated with this table, 0 if none. The TOAST table stores large attributes “out of line” in a secondary table.
relhasindex
bool
True if this is a table and it has (or recently had) any indexes
relisshared
bool
True if this table is shared across all databases in the cluster. Only certain system catalogs (such as pg_database
) are shared.
relpersistence
char
p
= permanent table, u
= unlogged table, t
= temporary table
relkind
char
r
= ordinary table, i
= index, S
= sequence, t
= TOAST table, v
= view, m
= materialized view, c
= composite type, f
= foreign table, p
= partitioned table, I
= partitioned index
relnatts
int2
Number of user columns in the relation (system columns not counted). There must be this many corresponding entries in pg_attribute
. See also pg_attribute.attnum
.
relchecks
int2
relhasrules
bool
relhastriggers
bool
relhassubclass
bool
True if table or index has (or once had) any inheritance children
relrowsecurity
bool
relforcerowsecurity
bool
relispopulated
bool
True if relation is populated (this is true for all relations other than some materialized views)
relreplident
char
Columns used to form “replica identity” for rows: d
= default (primary key, if any), n
= nothing, f
= all columns, i
= index with indisreplident
set (same as nothing if the index used has been dropped)
relispartition
bool
True if table or index is a partition
For new relations being written during a DDL operation that requires a table rewrite, this contains the OID of the original relation; otherwise 0. That state is only visible internally; this field should never contain anything other than 0 for a user-visible relation.
relfrozenxid
xid
All transaction IDs before this one have been replaced with a permanent (“frozen”) transaction ID in this table. This is used to track whether the table needs to be vacuumed in order to prevent transaction ID wraparound or to allow pg_xact
to be shrunk. Zero (InvalidTransactionId
) if the relation is not a table.
relminmxid
xid
All multixact IDs before this one have been replaced by a transaction ID in this table. This is used to track whether the table needs to be vacuumed in order to prevent multixact ID wraparound or to allow pg_multixact
to be shrunk. Zero (InvalidMultiXactId
) if the relation is not a table.
relacl
aclitem[]
reloptions
text[]
Access-method-specific options, as “keyword=value” strings
relpartbound
pg_node_tree
If table is a partition (see relispartition
), internal representation of the partition bound
Several of the Boolean flags in pg_class
are maintained lazily: they are guaranteed to be true if that's the correct state, but may not be reset to false immediately when the condition is no longer true. For example, relhasindex
is set by CREATE INDEX
, but it is never cleared by DROP INDEX
. Instead, VACUUM
clears relhasindex
if it finds the table has no indexes. This arrangement avoids race conditions and improves concurrency.
The catalog pg_constraint
stores check, primary key, unique, foreign key, and exclusion constraints on tables. (Column constraints are not treated specially. Every column constraint is equivalent to some table constraint.) Not-null constraints are represented in the pg_attribute
catalog, not here.
User-defined constraint triggers (created with CREATE CONSTRAINT TRIGGER
) also give rise to an entry in this table.
Check constraints on domains are stored here, too.
Table 51.13. pg_constraint
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
conname
name
Constraint name (not necessarily unique!)
connamespace
oid
The OID of the namespace that contains this constraint
contype
char
c
= check constraint, f
= foreign key constraint, p
= primary key constraint, u
= unique constraint, t
= constraint trigger, x
= exclusion constraint
condeferrable
bool
Is the constraint deferrable?
condeferred
bool
Is the constraint deferred by default?
convalidated
bool
Has the constraint been validated? Currently, can only be false for foreign keys and CHECK constraints
conrelid
oid
The table this constraint is on; 0 if not a table constraint
contypid
oid
The domain this constraint is on; 0 if not a domain constraint
conindid
oid
The index supporting this constraint, if it's a unique, primary key, foreign key, or exclusion constraint; else 0
confrelid
oid
If a foreign key, the referenced table; else 0
confupdtype
char
Foreign key update action code: a
= no action, r
= restrict, c
= cascade, n
= set null, d
= set default
confdeltype
char
Foreign key deletion action code: a
= no action, r
= restrict, c
= cascade, n
= set null, d
= set default
confmatchtype
char
Foreign key match type: f
= full, p
= partial, s
= simple
conislocal
bool
This constraint is defined locally for the relation. Note that a constraint can be locally defined and inherited simultaneously.
coninhcount
int4
The number of direct inheritance ancestors this constraint has. A constraint with a nonzero number of ancestors cannot be dropped nor renamed.
connoinherit
bool
This constraint is defined locally for the relation. It is a non-inheritable constraint.
conkey
int2[]
If a table constraint (including foreign keys, but not constraint triggers), list of the constrained columns
confkey
int2[]
If a foreign key, list of the referenced columns
conpfeqop
oid[]
If a foreign key, list of the equality operators for PK = FK comparisons
conppeqop
oid[]
If a foreign key, list of the equality operators for PK = PK comparisons
conffeqop
oid[]
If a foreign key, list of the equality operators for FK = FK comparisons
conexclop
oid[]
If an exclusion constraint, list of the per-column exclusion operators
conbin
pg_node_tree
If a check constraint, an internal representation of the expression
consrc
text
If a check constraint, a human-readable representation of the expression
In the case of an exclusion constraint, conkey
is only useful for constraint elements that are simple column references. For other cases, a zero appears in conkey
and the associated index must be consulted to discover the expression that is constrained. (conkey
thus has the same contents as pg_index
.indkey
for the index.)
consrc
is not updated when referenced objects change; for example, it won't track renaming of columns. Rather than relying on this field, it's best to use pg_get_constraintdef()
to extract the definition of a check constraint.
pg_class.relchecks
needs to agree with the number of check-constraint entries found in this table for each relation.
系統目錄 pg_cast 儲存資料型別的轉換方式,包括了內建的和使用者定義的。
需要注意的是,pg_cast 並不代表系統知道如何執行的每一種型別轉換;只有那些不能從某些通用規則中推導出來的。例如,domain 和它的基本型別之間的轉換在 pg_cast 中就沒有明確表示。另一個重要的例外是「自動 I/O 強制轉換」,即使用資料型別自己的 I/O 函數執行的轉換為 text 或其他字串型別或從 text 及其他字串型別轉換的那些,在 pg_cast 中沒有明確表示。
pg_cast
ColumnsColumn Type
Description
oid
oid
Row identifier
castsource
oid
(references .oid
)
OID of the source data type
casttarget
oid
(references .oid
)
OID of the target data type
castfunc
oid
(references .oid
)
The OID of the function to use to perform this cast. Zero is stored if the cast method doesn't require a function.
castcontext
char
Indicates what contexts the cast can be invoked in. e
means only as an explicit cast (using CAST
or ::
syntax). a
means implicitly in assignment to a target column, as well as explicitly. i
means implicitly in expressions, as well as the other cases.
castmethod
char
Indicates how the cast is performed. f
means that the function specified in the castfunc
field is used. i
means that the input/output functions are used. b
means that the types are binary-coercible, thus no conversion is required.
The cast functions listed in pg_cast
must always take the cast source type as their first argument type, and return the cast destination type as their result type. A cast function can have up to three arguments. The second argument, if present, must be type integer
; it receives the type modifier associated with the destination type, or -1 if there is none. The third argument, if present, must be type boolean
; it receives true
if the cast is an explicit cast, false
otherwise.
It is legitimate to create a pg_cast
entry in which the source and target types are the same, if the associated function takes more than one argument. Such entries represent “length coercion functions” that coerce values of the type to be legal for a particular type modifier value.
When a pg_cast
entry has different source and target types and a function that takes more than one argument, it represents converting from one type to another and applying a length coercion in a single step. When no such entry is available, coercion to a type that uses a type modifier involves two steps, one to convert between data types and a second to apply the modifier.
The catalog pg_collation
describes the available collations, which are essentially mappings from an SQL name to operating system locale categories. See Section 23.2 for more information.
Table 51.12. pg_collation
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
collname
name
Collation name (unique per namespace and encoding)
collnamespace
oid
.oid
The OID of the namespace that contains this collation
collowner
oid
.oid
Owner of the collation
collprovider
char
Provider of the collation: d
= database default, c
= libc, i
= icu
collencoding
int4
Encoding in which the collation is applicable, or -1 if it works for any encoding
collcollate
name
LC_COLLATE
for this collation object
collctype
name
LC_CTYPE
for this collation object
collversion
text
Provider-specific version of the collation. This is recorded when the collation is created and then checked when it is used, to detect changes in the collation definition that could lead to data corruption.
Note that the unique key on this catalog is (collname
, collencoding
, collnamespace
) not just (collname
, collnamespace
). PostgreSQL generally ignores all collations that do not have collencoding
equal to either the current database's encoding or -1, and creation of new entries with the same name as an entry with collencoding
= -1 is forbidden. Therefore it is sufficient to use a qualified SQL name (schema
.name
) to identify a collation, even though this is not unique according to the catalog definition. The reason for defining the catalog this way is that initdb fills it in at cluster initialization time with entries for all locales available on the system, so it must be able to hold entries for all encodings that might ever be used in the cluster.
In the template0
database, it could be useful to create collations whose encoding does not match the database encoding, since they could match the encodings of databases later cloned from template0
. This would currently have to be done manually.
目錄 pg_database 儲存有關資料庫一些可用的訊息。資料庫是使用 CREATE DATABASE 命令建立的。關於某些參數的含義的詳細訊息,請參閱第 22 章。
與大多數系統目錄不同,pg_database 在叢集的所有資料庫之間共享:每個叢集只有一個 pg_database 副本,而不是每個資料庫一個副本。
Table 51.15. pg_database
欄位
名稱
型別
參閱
說明
oid
oid
資料列識別指標(隱藏屬性;必須明確選擇)
datname
name
資料庫名稱
datdba
oid
.oid
資料庫的擁有者,通常是建立它的使用者
encoding
int4
此資料庫的字元編碼(pg_encoding_to_char()可將此數字轉換為編碼名稱)
datcollate
name
這個資料庫的 LC_COLLATE
datctype
name
這個資料庫的 LC_CTYPE
datistemplate
bool
如果為 true,則該資料庫可以由具有 CREATEDB 權限的任何使用者複製;如果為 false,則只有超級使用者或資料庫的擁有者才能複製它。
datallowconn
bool
如果為 false,則沒有人可以連線到該資料庫。這用於保護 template0 資料庫免遭更改。
datconnlimit
int4
設定可以對此資料庫執行的最大同時連線數。-1 意味著沒有限制。
datlastsysoid
oid
資料庫中的最後一個系統 OID;特別適用於 pg_dump
datfrozenxid
xid
在這個事務 ID 之前在此資料庫中的所有事務 ID,已被替換為永久(「 frozen」)。這用於追踪是否需要清理資料庫以防止事務 ID 重覆或允許縮減 pg_xact。它是每個資料表 pg_class.relfrozenxid 的最小值。
datminmxid
xid
此資料庫中的所有 multixact ID 已被替換為該資料庫中的事務 ID。這用於追踪資料庫是否需要清理,以防止 multixact ID 重覆或允許縮減 pg_multixact。它是每個資料表 pg_class.relminmxid 的最小值。
dattablespace
oid
.oid
資料庫預設的資料表空間。在此資料庫中,pg_class.reltablespace 為零的所有資料表都將儲存在此資料表空間中;特別是所有非共享系統目錄都將在那裡。
datacl
aclitem[]
存取權限;詳情請參閱 和
The catalog pg_event_trigger
stores event triggers. See Chapter 39 for more information.
pg_event_trigger
ColumnsColumn Type
Description
oid
oid
Row identifier
evtname
name
Trigger name (must be unique)
evtevent
name
Identifies the event for which this trigger fires
evtowner
oid
(references .oid
)
Owner of the event trigger
evtfoid
oid
(references .oid
)
The function to be called
evtenabled
char
Controls in which modes the event trigger fires. O
= trigger fires in “origin” and “local” modes, D
= trigger is disabled, R
= trigger fires in “replica” mode, A
= trigger fires always.
evttags
text[]
Command tags for which this trigger will fire. If NULL, the firing of this trigger is not restricted on the basis of the command tag.
The task of the planner/optimizer is to create an optimal execution plan. A given SQL query (and hence, a query tree) can be actually executed in a wide variety of different ways, each of which will produce the same set of results. If it is computationally feasible, the query optimizer will examine each of these possible execution plans, ultimately selecting the execution plan that is expected to run the fastest.
In some situations, examining each possible way in which a query can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine a reasonable (not necessarily optimal) query plan in a reasonable amount of time, PostgreSQL uses a Genetic Query Optimizer (see ) when the number of joins exceeds a threshold (see ).
The planner's search procedure actually works with data structures called paths, which are simply cut-down representations of plans containing only as much information as the planner needs to make its decisions. After the cheapest path is determined, a full-fledged plan tree is built to pass to the executor. This represents the desired execution plan in sufficient detail for the executor to run it. In the rest of this section we'll ignore the distinction between paths and plans.
The planner/optimizer starts by generating plans for scanning each individual relation (table) used in the query. The possible plans are determined by the available indexes on each relation. There is always the possibility of performing a sequential scan on a relation, so a sequential scan plan is always created. Assume an index is defined on a relation (for example a B-tree index) and a query contains the restriction relation.attribute OPR constant
. If relation.attribute
happens to match the key of the B-tree index and OPR
is one of the operators listed in the index's operator class, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an index, further plans will be considered. Index scan plans are also generated for indexes that have a sort ordering that can match the query's ORDER BY
clause (if any), or a sort ordering that might be useful for merge joining (see below).
If the query requires joining two or more relations, plans for joining relations are considered after all feasible plans have been found for scanning single relations. The three available join strategies are:
nested loop join: The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.
When the query involves more than two relations, the final result must be built up by a tree of join steps, each with two inputs. The planner examines different possible join sequences to find the cheapest one.
If the query uses fewer than relations, a near-exhaustive search is conducted to find the best join sequence. The planner preferentially considers joins between any two relations for which there exist a corresponding join clause in the WHERE
qualification (i.e., for which a restriction like where rel1.attr1=rel2.attr2
exists). Join pairs with no join clause are considered only when there is no other choice, that is, a particular relation has no available join clauses to any other relation. All possible plans are generated for every join pair considered by the planner, and the one that is (estimated to be) the cheapest is chosen.
When geqo_threshold
is exceeded, the join sequences considered are determined by heuristics, as described in . Otherwise the process is the same.
The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as needed, plus any auxiliary steps needed, such as sort nodes or aggregate-function calculation nodes. Most of these plan node types have the additional ability to do selection (discarding rows that do not meet a specified Boolean condition) and projection (computation of a derived column set based on given column values, that is, evaluation of scalar expressions where needed). One of the responsibilities of the planner is to attach selection conditions from the WHERE
clause and computation of required output expressions to the most appropriate nodes of the plan tree.
Role bypasses every row level security policy, see for more information.
relnamespace
oid
(references .oid
)
reltype
oid
(references .oid
)
reloftype
oid
(references .oid
)
relowner
oid
(references .oid
)
relam
oid
(references .oid
)
reltablespace
oid
(references .oid
)
reltoastrelid
oid
(references .oid
)
Number of CHECK
constraints on the table; see catalog
True if table has (or once had) rules; see catalog
True if table has (or once had) triggers; see catalog
True if table has row level security enabled; see catalog
True if row level security (when enabled) will also apply to table owner; see catalog
relrewrite
oid
(references .oid
)
Access privileges; see for details
.oid
.oid
.oid
.oid
.oid
.attnum
.attnum
.oid
.oid
.oid
.oid
The catalog pg_index
contains part of the information about indexes. The rest is mostly in pg_class
.
pg_index
ColumnsColumn Type
Description
The OID of the pg_class
entry for this index
The OID of the pg_class
entry for the table this index is for
indnatts
int2
The total number of columns in the index (duplicates pg_class.relnatts
); this number includes both key and included attributes
indnkeyatts
int2
The number of key columns in the index, not counting any included columns, which are merely stored and do not participate in the index semantics
indisunique
bool
If true, this is a unique index
indisprimary
bool
If true, this index represents the primary key of the table (indisunique
should always be true when this is true)
indisexclusion
bool
If true, this index supports an exclusion constraint
indimmediate
bool
If true, the uniqueness check is enforced immediately on insertion (irrelevant if indisunique
is not true)
indisclustered
bool
If true, the table was last clustered on this index
indisvalid
bool
If true, the index is currently valid for queries. False means the index is possibly incomplete: it must still be modified by INSERT
/UPDATE
operations, but it cannot safely be used for queries. If it is unique, the uniqueness property is not guaranteed true either.
indcheckxmin
bool
If true, queries must not use the index until the xmin
of this pg_index
row is below their TransactionXmin
event horizon, because the table may contain broken HOT chains with incompatible rows that they can see
indisready
bool
If true, the index is currently ready for inserts. False means the index must be ignored by INSERT
/UPDATE
operations.
indislive
bool
If false, the index is in process of being dropped, and should be ignored for all purposes (including HOT-safety decisions)
indisreplident
bool
If true this index has been chosen as “replica identity” using ALTER TABLE ... REPLICA IDENTITY USING INDEX ...
This is an array of indnatts
values that indicate which table columns this index indexes. For example a value of 1 3
would mean that the first and the third table columns make up the index entries. Key columns come before non-key (included) columns. A zero in this array indicates that the corresponding index attribute is an expression over the table columns, rather than a simple column reference.
For each column in the index key (indnkeyatts
values), this contains the OID of the collation to use for the index, or zero if the column is not of a collatable data type.
indoption
int2vector
This is an array of indnkeyatts
values that store per-column flag bits. The meaning of the bits is defined by the index's access method.
indexprs
pg_node_tree
Expression trees (in nodeToString()
representation) for index attributes that are not simple column references. This is a list with one element for each zero entry in indkey
. Null if all index attributes are simple references.
indpred
pg_node_tree
Expression tree (in nodeToString()
representation) for partial index predicate. Null if not a partial index.
目錄 pg_extension 儲存有關已安裝延伸功能的資訊。有關延伸功能的詳細資訊,請參閱第 37.17 節。
Table 51.22. pg_extension
Columns
Name
Type
References
Description
oid
oid
資料列指標ID
extname
name
延伸功能名稱
extowner
oid
.oid
延伸功能的擁有者
extnamespace
oid
.oid
延伸功能之中所導出物件的綱要名稱
extrelocatable
bool
如果延伸功能可以接受重新定位到另一個綱要之中,則為 True
extversion
text
延伸功能的版本名稱
extconfig
oid[]
.oid
延伸功能組態資料表的 regclass OID 陣列,如果沒有,則為 NULL
extcondition
text[]
延伸功能組態資料表的 WHERE 子句過濾條件陣列,如果沒有,則為 NULL
請注意,與大多數帶有「namespace」欄位的目錄不同,extnamespace 並不暗指該延伸功能屬於該綱要(schema)。延伸功能並不在任何綱要之中。不過 extnamespace 指示包含大多數或所有延伸功能所屬物件的綱要。如果 extrelocatable 為 true,則此綱要實際上必須包含屬於該延伸功能的所有需要綱要的物件。
The catalog pg_opclass
defines index access method operator classes. Each operator class defines semantics for index columns of a particular data type and a particular index access method. An operator class essentially specifies that a particular operator family is applicable to a particular indexable column data type. The set of operators from the family that are actually usable with the indexed column are whichever ones accept the column's data type as their left-hand input.
Operator classes are described at length in Section 37.14.
Table 51.33. pg_opclass
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
opcmethod
oid
.oid
Index access method operator class is for
opcname
name
Name of this operator class
opcnamespace
oid
.oid
Namespace of this operator class
opcowner
oid
.oid
Owner of the operator class
opcfamily
oid
.oid
Operator family containing the operator class
opcintype
oid
.oid
Data type that the operator class indexes
opcdefault
bool
True if this operator class is the default for opcintype
opckeytype
oid
.oid
Type of data stored in index, or zero if same as opcintype
An operator class's opcmethod
must match the opfmethod
of its containing operator family. Also, there must be no more than one pg_opclass
row having opcdefault
true for any given combination of opcmethod
and opcintype
.
The catalog pg_policy
stores row level security policies for tables. A policy includes the kind of command that it applies to (possibly all commands), the roles that it applies to, the expression to be added as a security-barrier qualification to queries that include the table, and the expression to be added as a WITH CHECK
option for queries that attempt to add new records to the table.
Table 51.38. pg_policy
Columns
Name
Type
References
Description
polname
name
The name of the policy
polrelid
oid
.oid
The table to which the policy applies
polcmd
char
The command type to which the policy is applied: r
for SELECT
, a
for INSERT
, w
for UPDATE
, d
for DELETE
, or *
for all
polpermissive
boolean
Is the policy permissive or restrictive?
polroles
oid[]
.oid
The roles to which the policy is applied
polqual
pg_node_tree
The expression tree to be added to the security barrier qualifications for queries that use the table
polwithcheck
pg_node_tree
The expression tree to be added to the WITH CHECK qualifications for queries that attempt to add rows to the table
Policies stored in pg_policy
are applied only when pg_class
.relrowsecurity
is set for their table.
目錄 pg_language 註冊了可以撰寫函數或 stored procedure 的語言。有關語言處理程序的更多訊息,請參閱 CREATE LANGUAGE 和第 41 章。
pg_language
ColumnsName
Type
References
Description
oid
oid
資料列識別指標(隱藏屬性;必須明確選擇)
lanname
name
語言名稱
lanowner
oid
``.oid
語言的所有者
lanispl
bool
對於內部語言(例如 SQL)而言這是 false 的,對於使用者定義的語言則是 true。目前,pg_dump 仍然使用它來決定需要轉存哪些語言,但將來可能會被不同的機制所取代。
lanpltrusted
bool
如果這是一種受信任的語言,則為 True,這意味著它被認為不會授予對正常 SQL 執行環境之外任何內容的存取權限。只有超級使用者才能以不受信任的語言建立函數。
lanplcallfoid
oid
``.oid
對於非內部語言,這引用了語言處理程序,它是一個特殊的函數,負責執行使用特定語言所編寫的所有函數。
laninline
oid
``.oid
這引用了一個負責執行 “inline” 匿名程式區塊( 區塊)的函數。如果不支援 inline 區塊,則為零。
lanvalidator
oid
``.oid
這引用了一個語言驗證器函數,該函數負責在建立新函數時檢查它們的語法和有效性。如果未提供驗證器,則為零。
lanacl
aclitem[]
存取權限;有關詳細訊息,請參閱 和
The catalog pg_namespace
stores namespaces. A namespace is the structure underlying SQL schemas: each namespace can have a separate collection of relations, types, etc. without name conflicts.
pg_namespace
ColumnsColumn Type
Description
oid
oid
Row identifier
nspname
name
Name of the namespace
nspowner
oid
(references .oid
)
Owner of the namespace
nspacl
aclitem[]
存取權限; 詳見
indexrelid
oid
(references .oid
)
indrelid
oid
(references .oid
)
indkey
int2vector
(references .attnum
)
indcollation
oidvector
(references .oid
)
indclass
oidvector
(references .oid
)
For each column in the index key (indnkeyatts
values), this contains the OID of the operator class to use. See for details.
目錄 pg_proc 儲存有關函數、程序函數、彙總函數和窗函數(或統稱為 routines)的資訊。 有關更多資訊,請參閱 CREATE FUNCTION,CREATE PROCEDURE 和第 37.3 節。
如果 prokind 指示該項目用於彙總函數,則 pg_aggregate 中應有相對應的資料列。
pg_proc
ColumnsName
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
proname
name
Name of the function
pronamespace
oid
The OID of the namespace that contains this function
proowner
oid
Owner of the function
prolang
oid
Implementation language or call interface of this function
procost
float4
prorows
float4
Estimated number of result rows (zero if not proretset
)
provariadic
oid
Data type of the variadic array parameter's elements, or zero if the function does not have a variadic parameter
protransform
regproc
prokind
char
f
for a normal function, p
for a procedure, a
for an aggregate function, or w
for a window function
prosecdef
bool
Function is a security definer (i.e., a “setuid” function)
proleakproof
bool
The function has no side effects. No information about the arguments is conveyed except via the return value. Any function that might throw an error depending on the values of its arguments is not leak-proof.
proisstrict
bool
Function returns null if any call argument is null. In that case the function won't actually be called at all. Functions that are not “strict” must be prepared to handle null inputs.
proretset
bool
Function returns a set (i.e., multiple values of the specified data type)
provolatile
char
provolatile
tells whether the function's result depends only on its input arguments, or is affected by outside factors. It is i
for “immutable” functions, which always deliver the same result for the same inputs. It is s
for “stable” functions, whose results (for fixed inputs) do not change within a scan. It is v
for “volatile” functions, whose results might change at any time. (Use v
also for functions with side-effects, so that calls to them cannot get optimized away.)
proparallel
char
proparallel
tells whether the function can be safely run in parallel mode. It is s
for functions which are safe to run in parallel mode without restriction. It is r
for functions which can be run in parallel mode, but their execution is restricted to the parallel group leader; parallel worker processes cannot invoke these functions. It is u
for functions which are unsafe in parallel mode; the presence of such a function forces a serial execution plan.
pronargs
int2
Number of input arguments
pronargdefaults
int2
Number of arguments that have defaults
prorettype
oid
Data type of the return value
proargtypes
oidvector
An array with the data types of the function arguments. This includes only input arguments (including INOUT
and VARIADIC
arguments), and thus represents the call signature of the function.
proallargtypes
oid[]
An array with the data types of the function arguments. This includes all arguments (including OUT
and INOUT
arguments); however, if all the arguments are IN
arguments, this field will be null. Note that subscripting is 1-based, whereas for historical reasons proargtypes
is subscripted from 0.
proargmodes
char[]
An array with the modes of the function arguments, encoded as i
for IN
arguments, o
for OUT
arguments, b
for INOUT
arguments, v
for VARIADIC
arguments, t
for TABLE
arguments. If all the arguments are IN
arguments, this field will be null. Note that subscripts correspond to positions of proallargtypes
not proargtypes
.
proargnames
text[]
An array with the names of the function arguments. Arguments without a name are set to empty strings in the array. If none of the arguments have a name, this field will be null. Note that subscripts correspond to positions of proallargtypes
not proargtypes
.
proargdefaults
pg_node_tree
Expression trees (in nodeToString()
representation) for default values. This is a list with pronargdefaults
elements, corresponding to the last N
input arguments (i.e., the last N
proargtypes
positions). If none of the arguments have defaults, this field will be null.
protrftypes
oid[]
Data type OIDs for which to apply transforms.
prosrc
text
This tells the function handler how to invoke the function. It might be the actual source code of the function for interpreted languages, a link symbol, a file name, or just about anything else, depending on the implementation language/call convention.
probin
text
Additional information about how to invoke the function. Again, the interpretation is language-specific.
proconfig
text[]
Function's local settings for run-time configuration variables
proacl
aclitem[]
對於內建和動態載入的已編譯函數,prosrc 包含函數的 C 語言名稱(link symbol)。 對於所有其他目前已知的語言類型,prosrc 包含函數的原始碼。除了動態載入的 C 函數外,probin 均未使用,因為它用於記錄該函數的共享函式庫檔案的名稱。
目錄 pg_subscription_rel 包含每個訂閱中每個複寫關係的狀態。這是多對多的關連情況。
在執行 CREATE SUBSCRIPTION 或 ALTER SUBSCRIPTION ... REFRESH PUBLICATION 之後才會更新,此目錄僅會包含已知的訂閱資料表。
pg_subscription_rel
ColumnsColumn Type
Description
srsubid
oid
(references .oid
)
Reference to subscription
srrelid
oid
(references .oid
)
Reference to relation
srsubstate
char
State code: i
= initialize, d
= data is being copied, s
= synchronized, r
= ready (normal replication)
srsublsn
pg_lsn
Remote LSN of the state change used for synchronization coordination when in s
or r
states, otherwise null
目錄 pg_subscription 包含所有現有訂閱的邏輯複寫。有關邏輯複寫的相關資訊,請參閱第 30 章。
與大多數系統目錄不同,pg_subscription 在叢集的所有資料庫之間是共享的:每個叢集只有一個 pg_subscription,而不是每個資料庫一個。
普通使用者沒有權限讀取 subconninfo 欄位,因為該欄位可能包含純文字密碼。
pg_subscription
ColumnsColumn Type
Description
oid
oid
Row identifier
subdbid
oid
(references .oid
)
OID of the database which the subscription resides in
subname
name
Name of the subscription
subowner
oid
(references .oid
)
Owner of the subscription
subenabled
bool
If true, the subscription is enabled and should be replicating.
subconninfo
text
Connection string to the upstream database
subslotname
name
Name of the replication slot in the upstream database (also used for the local replication origin name); null represents NONE
subsynccommit
text
Contains the value of the synchronous_commit
setting for the subscription workers.
subpublications
text[]
Array of subscribed publication names. These reference the publications on the publisher server. For more on publications see .
The catalog pg_statistic
stores statistical data about the contents of the database. Entries are created by ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.
Normally there is one entry, with stainherit
= false
, for each table column that has been analyzed. If the table has inheritance children, a second entry with stainherit
= true
is also created. This row represents the column's statistics over the inheritance tree, i.e., statistics for the data you'd see with SELECT
column
FROM table
*, whereas the stainherit
= false
row represents the results of SELECT
column
FROM ONLY table
.
pg_statistic
also stores statistical data about the values of index expressions. These are described as if they were actual data columns; in particular, starelid
references the index. No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column. Currently, entries for index expressions always have stainherit
= false
.
Since different kinds of statistics might be appropriate for different kinds of data, pg_statistic
is designed not to assume very much about what sort of statistics it stores. Only extremely general statistics (such as nullness) are given dedicated columns in pg_statistic
. Everything else is stored in “slots”, which are groups of associated columns whose content is identified by a code number in one of the slot's columns. For more information see src/include/catalog/pg_statistic.h
.
pg_statistic
should not be readable by the public, since even statistical information about a table's contents might be considered sensitive. (Example: minimum and maximum values of a salary column might be quite interesting.) pg_stats
is a publicly readable view on pg_statistic
that only exposes information about those tables that are readable by the current user.
pg_statistic
ColumnsColumn Type
Description
starelid
oid
(references .oid
)
The table or index that the described column belongs to
staattnum
int2
(references .attnum
)
The number of the described column
stainherit
bool
If true, the stats include inheritance child columns, not just the values in the specified relation
stanullfrac
float4
The fraction of the column's entries that are null
stawidth
int4
The average stored width, in bytes, of nonnull entries
stadistinct
float4
The number of distinct nonnull data values in the column. A value greater than zero is the actual number of distinct values. A value less than zero is the negative of a multiplier for the number of rows in the table; for example, a column in which about 80% of the values are nonnull and each nonnull value appears about twice on average could be represented by stadistinct
= -0.4. A zero value means the number of distinct values is unknown.
stakind
N
int2
A code number indicating the kind of statistics stored in the N
th “slot” of the pg_statistic
row.
staop
N
oid
(references .oid
)
An operator used to derive the statistics stored in the N
th “slot”. For example, a histogram slot would show the <
operator that defines the sort order of the data.
stacoll
N
oid
(references .oid
)
The collation used to derive the statistics stored in the N
th “slot”. For example, a histogram slot for a collatable column would show the collation that defines the sort order of the data. Zero for noncollatable data.
stanumbers
N
float4[]
Numerical statistics of the appropriate kind for the N
th “slot”, or null if the slot kind does not involve numerical values
stavalues
N
anyarray
Column data values of the appropriate kind for the N
th “slot”, or null if the slot kind does not store any data values. Each array's element values are actually of the specific column's data type, or a related type such as an array's element type, so there is no way to define these columns' type more specifically than anyarray
.
The catalog pg_rewrite
stores rewrite rules for tables and views.
Table 51.44. pg_rewrite
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
rulename
name
Rule name
ev_class
oid
.oid
The table this rule is for
ev_type
char
Event type that the rule is for: 1 = SELECT
, 2 = UPDATE
, 3 = INSERT
, 4 = DELETE
ev_enabled
char
Controls in which modes the rule fires. O
= rule fires in “origin” and “local” modes, D
= rule is disabled, R
= rule fires in “replica” mode, A
= rule fires always.
is_instead
bool
True if the rule is an INSTEAD
rule
ev_qual
pg_node_tree
Expression tree (in the form of a nodeToString()
representation) for the rule's qualifying condition
ev_action
pg_node_tree
Query tree (in the form of a nodeToString()
representation) for the rule's action
pg_class.relhasrules
must be true if a table has any rules in this catalog.
The catalog pg_statistic_ext
holds definitions of extended planner statistics. Each row in this catalog corresponds to a statistics object created with CREATE STATISTICS.
pg_statistic_ext
ColumnsColumn Type
Description
oid
oid
Row identifier
stxrelid
oid
(references .oid
)
Table containing the columns described by this object
stxname
name
Name of the statistics object
stxnamespace
oid
(references .oid
)
The OID of the namespace that contains this statistics object
stxowner
oid
(references .oid
)
Owner of the statistics object
stxstattarget
int4
stxstattarget
controls the level of detail of statistics accumulated for this statistics object by . A zero value indicates that no statistics should be collected. A negative value says to use the maximum of the statistics targets of the referenced columns, if set, or the system default statistics target. Positive values of stxstattarget
determine the target number of “most common values” to collect.
stxkeys
int2vector
(references .attnum
)
An array of attribute numbers, indicating which table columns are covered by this statistics object; for example a value of 1 3
would mean that the first and the third table columns are covered
stxkind
char[]
An array containing codes for the enabled statistic kinds; valid values are: d
for n-distinct statistics, f
for functional dependency statistics, and m
for most common values (MCV) list statistics
The pg_statistic_ext
entry is filled in completely during CREATE STATISTICS
, but the actual statistical values are not computed then. Subsequent ANALYZE
commands compute the desired values and populate an entry in the pg_statistic_ext_data
catalog.
.oid
.oid
.oid
Estimated execution cost (in units of ); if proretset
, this is cost per row returned
.oid
.oid
Calls to this function can be simplified by this other function (see )
.oid
.oid
.oid
Access privileges; see and for details
The view pg_hba_file_rules
provides a summary of the contents of the client authentication configuration file, pg_hba.conf
. A row appears in this view for each non-empty, non-comment line in the file, with annotations indicating whether the rule could be applied successfully.
This view can be helpful for checking whether planned changes in the authentication configuration file will work, or for diagnosing a previous failure. Note that this view reports on the current contents of the file, not on what was last loaded by the server.
By default, the pg_hba_file_rules
view can be read only by superusers.
pg_hba_file_rules
ColumnsColumn Type
Description
line_number
int4
Line number of this rule in pg_hba.conf
type
text
Type of connection
database
text[]
List of database name(s) to which this rule applies
user_name
text[]
List of user and group name(s) to which this rule applies
address
text
Host name or IP address, or one of all
, samehost
, or samenet
, or null for local connections
netmask
text
IP address mask, or null if not applicable
auth_method
text
Authentication method
options
text[]
Options specified for authentication method, if any
error
text
If not null, an error message indicating why this line could not be processed
Usually, a row reflecting an incorrect entry will have values for only the line_number
and error
fields.
使用者身份驗證設定的相關資訊,請參閱第 20 章。
pg_available_extension_versions 檢視圖列出可用於安裝的特定延伸功能版本。另請參閱 pg_extension 目錄,該目錄列出了目前已安裝的延伸功能。
pg_available_extension_versions
ColumnsName
Type
Description
name
name
延伸功能名稱
version
text
版本名稱
installed
bool
如果目前已安裝此延伸功能的此版本,則為 True
superuser
bool
如果僅允許超級使用者安裝此延伸功能,則為 True
relocatable
bool
如果延伸功能可以接受重新定位到另一個綱要,則為 True
schema
name
延伸功能必須安裝到的綱要名稱,如果可部分或完全重新定位,則為 NULL
requires
name[]
必須預先安裝的延伸功能名稱,如果沒有則為 NULL
comment
text
延伸功能控制檔案中的註解文字內容
pg_available_extension_versions 檢視表是唯讀的。
The catalog pg_trigger
stores triggers on tables and views. See CREATE TRIGGER for more information.
Table 51.56. pg_trigger
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
tgrelid
oid
The table this trigger is on
tgname
name
Trigger name (must be unique among triggers of same table)
tgfoid
oid
The function to be called
tgtype
int2
Bit mask identifying trigger firing conditions
tgenabled
char
tgisinternal
bool
True if trigger is internally generated (usually, to enforce the constraint identified by tgconstraint
)
tgconstrrelid
oid
The table referenced by a referential integrity constraint
tgconstrindid
oid
The index supporting a unique, primary key, referential integrity, or exclusion constraint
tgconstraint
oid
The pg_constraint
entry associated with the trigger, if any
tgdeferrable
bool
True if constraint trigger is deferrable
tginitdeferred
bool
True if constraint trigger is initially deferred
tgnargs
int2
Number of argument strings passed to trigger function
tgattr
int2vector
Column numbers, if trigger is column-specific; otherwise an empty array
tgargs
bytea
Argument strings to pass to trigger, each NULL-terminated
tgqual
pg_node_tree
Expression tree (in nodeToString()
representation) for the trigger's WHEN
condition, or null if none
tgoldtable
name
REFERENCING
clause name for OLD TABLE
, or null if none
tgnewtable
name
REFERENCING
clause name for NEW TABLE
, or null if none
Currently, column-specific triggering is supported only for UPDATE
events, and so tgattr
is relevant only for that event type. tgtype
might contain bits for other event types as well, but those are presumed to be table-wide regardless of what is in tgattr
.
When tgconstraint
is nonzero, tgconstrrelid
, tgconstrindid
, tgdeferrable
, and tginitdeferred
are largely redundant with the referenced pg_constraint
entry. However, it is possible for a non-deferrable trigger to be associated with a deferrable constraint: foreign key constraints can have some deferrable and some non-deferrable triggers.
pg_class.relhastriggers
must be true if a relation has any triggers in this catalog.
The catalog pg_type
stores information about data types. Base types and enum types (scalar types) are created with CREATE TYPE, and domains with CREATE DOMAIN. A composite type is automatically created for each table in the database, to represent the row structure of the table. It is also possible to create composite types with CREATE TYPE AS
.
Table 51.62. pg_type
Columns
Name
Type
References
Description
oid
oid
Row identifier (hidden attribute; must be explicitly selected)
typname
name
Data type name
typnamespace
oid
The OID of the namespace that contains this type
typowner
oid
Owner of the type
typlen
int2
For a fixed-size type, typlen
is the number of bytes in the internal representation of the type. But for a variable-length type, typlen
is negative. -1 indicates a “varlena” type (one that has a length word), -2 indicates a null-terminated C string.
typbyval
bool
typbyval
determines whether internal routines pass a value of this type by value or by reference. typbyval
had better be false if typlen
is not 1, 2, or 4 (or 8 on machines where Datum is 8 bytes). Variable-length types are always passed by reference. Note that typbyval
can be false even if the length would allow pass-by-value.
typtype
char
typtype
is b
for a base type, c
for a composite type (e.g., a table's row type), d
for a domain, e
for an enum type, p
for a pseudo-type, or r
for a range type. See also typrelid
and typbasetype
.
typcategory
char
typispreferred
bool
True if the type is a preferred cast target within its typcategory
typisdefined
bool
True if the type is defined, false if this is a placeholder entry for a not-yet-defined type. When typisdefined
is false, nothing except the type name, namespace, and OID can be relied on.
typdelim
char
Character that separates two values of this type when parsing array input. Note that the delimiter is associated with the array element data type, not the array data type.
typrelid
oid
If this is a composite type (see typtype
), then this column points to the pg_class
entry that defines the corresponding table. (For a free-standing composite type, the pg_class
entry doesn't really represent a table, but it is needed anyway for the type'spg_attribute
entries to link to.) Zero for non-composite types.
typelem
oid
If typelem
is not 0 then it identifies another row in pg_type
. The current type can then be subscripted like an array yielding values of type typelem
. A “true” array type is variable length (typlen
= -1), but some fixed-length (typlen
> 0) types also have nonzero typelem
, for example name
and point
. If a fixed-length type has a typelem
then its internal representation must be some number of values of the typelem
data type with no other data. Variable-length array types have a header defined by the array subroutines.
typarray
oid
If typarray
is not 0 then it identifies another row in pg_type
, which is the “true” array type having this type as element
typinput
regproc
Input conversion function (text format)
typoutput
regproc
Output conversion function (text format)
typreceive
regproc
Input conversion function (binary format), or 0 if none
typsend
regproc
Output conversion function (binary format), or 0 if none
typmodin
regproc
Type modifier input function, or 0 if type does not support modifiers
typmodout
regproc
Type modifier output function, or 0 to use the standard format
typanalyze
regproc
Custom ANALYZE
function, or 0 to use the standard function
typalign
char
typalign
is the alignment required when storing a value of this type. It applies to storage on disk as well as most representations of the value inside PostgreSQL. When multiple values are stored consecutively, such as in the representation of a complete row on disk, padding is inserted before a datum of this type so that it begins on the specified boundary. The alignment reference is the beginning of the first datum in the sequence.Possible values are:c
= char
alignment, i.e., no alignment needed.s
= short
alignment (2 bytes on most machines).i
= int
alignment (4 bytes on most machines).d
= double
alignment (8 bytes on many machines, but by no means all).NoteFor types used in system tables, it is critical that the size and alignment defined in pg_type
agree with the way that the compiler will lay out the column in a structure representing a table row.
typstorage
char
typstorage
tells for varlena types (those with typlen
= -1) if the type is prepared for toasting and what the default strategy for attributes of this type should be. Possible values arep
: Value must always be stored plain.e
: Value can be stored in a “secondary” relation (if relation has one, see pg_class.reltoastrelid
).m
: Value can be stored compressed inline.x
: Value can be stored compressed inline or stored in “secondary” storage.Note that m
columns can also be moved out to secondary storage, but only as a last resort (e
and x
columns are moved first).
typnotnull
bool
typnotnull
represents a not-null constraint on a type. Used for domains only.
typbasetype
oid
If this is a domain (see typtype
), then typbasetype
identifies the type that this one is based on. Zero if this type is not a domain.
typtypmod
int4
Domains use typtypmod
to record the typmod
to be applied to their base type (-1 if base type does not use a typmod
). -1 if this type is not a domain.
typndims
int4
typndims
is the number of array dimensions for a domain over an array (that is, typbasetype
is an array type). Zero for types other than domains over array types.
typcollation
oid
typcollation
specifies the collation of the type. If the type does not support collations, this will be zero. A base type that supports collations will have DEFAULT_COLLATION_OID
here. A domain over a collatable type can have some other collation OID, if one was specified for the domain.
typdefaultbin
pg_node_tree
If typdefaultbin
is not null, it is the nodeToString()
representation of a default expression for the type. This is only used for domains.
typdefault
text
typdefault
is null if the type has no associated default value. If typdefaultbin
is not null, typdefault
must contain a human-readable version of the default expression represented by typdefaultbin
. If typdefaultbin
is null and typdefault
is not, then typdefault
is the external representation of the type's default value, which can be fed to the type's input converter to produce a constant.
typacl
aclitem[]
Table 51.63 lists the system-defined values of typcategory
. Any future additions to this list will also be upper-case ASCII letters. All other ASCII characters are reserved for user-defined categories.
Table 51.63. typcategory
Codes
Code
Category
A
Array types
B
Boolean types
C
Composite types
D
Date/time types
E
Enum types
G
Geometric types
I
Network address types
N
Numeric types
P
Pseudo-types
R
Range types
S
String types
T
Timespan types
U
User-defined types
V
Bit-string types
X
unknown
type
The catalog pg_tablespace
stores information about the available tablespaces. Tables can be placed in particular tablespaces to aid administration of disk layout.
Unlike most system catalogs, pg_tablespace
is shared across all databases of a cluster: there is only one copy of pg_tablespace
per cluster, not one per database.
pg_tablespace
ColumnsColumn Type
Description
oid
oid
Row identifier
spcname
name
Tablespace name
spcowner
oid
(references .oid
)
Owner of the tablespace, usually the user who created it
spcacl
aclitem[]
Access privileges; see for details
spcoptions
text[]
Tablespace-level options, as “keyword=value” strings
.oid
.oid
Controls in which modes the trigger fires. O
= trigger fires in “origin” and “local” modes, D
= trigger is disabled, R
= trigger fires in “replica” mode, A
= trigger fires always.
.oid
.oid
.oid
.attnum
.oid
.oid
typcategory
is an arbitrary classification of data types that is used by the parser to determine which implicit casts should be “preferred”. See .
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
.oid
Access privileges; see and for details
The view pg_locks
provides access to information about the locks held by active processes within the database server. See Chapter 13 for more discussion of locking.
pg_locks
contains one row per active lockable object, requested lock mode, and relevant process. Thus, the same lockable object might appear many times, if multiple processes are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all.
There are several distinct types of lockable objects: whole relations (e.g., tables), individual pages of relations, individual tuples of relations, transaction IDs (both virtual and permanent IDs), and general database objects (identified by class OID and object OID, in the same way as in pg_description
or pg_depend
). Also, the right to extend a relation is represented as a separate lockable object, as is the right to update pg_database
.datfrozenxid
. Also, “advisory” locks can be taken on numbers that have user-defined meanings.
pg_locks
ColumnsColumn Type
Description
locktype
text
OID of the database in which the lock target exists, or zero if the target is a shared object, or null if the target is a transaction ID
OID of the relation targeted by the lock, or null if the target is not a relation or part of a relation
page
int4
Page number targeted by the lock within the relation, or null if the target is not a relation page or tuple
tuple
int2
Tuple number targeted by the lock within the page, or null if the target is not a tuple
virtualxid
text
Virtual ID of the transaction targeted by the lock, or null if the target is not a virtual transaction ID
transactionid
xid
ID of the transaction targeted by the lock, or null if the target is not a transaction ID
OID of the system catalog containing the lock target, or null if the target is not a general database object
objid
oid
(references any OID column)
OID of the lock target within its system catalog, or null if the target is not a general database object
objsubid
int2
Column number targeted by the lock (the classid
and objid
refer to the table itself), or zero if the target is some other general database object, or null if the target is not a general database object
virtualtransaction
text
Virtual ID of the transaction that is holding or awaiting this lock
pid
int4
Process ID of the server process holding or awaiting this lock, or null if the lock is held by a prepared transaction
mode
text
granted
bool
True if lock is held, false if lock is awaited
fastpath
bool
True if lock was taken via fast path, false if taken via main lock table
granted 為 true 的話,代表此鎖定由該筆資料的程序所持有。False 表示此程序目前正在等待取得鎖定,這意味著至少一個其他程序正持有或等待同一可鎖定物件上在鎖定模式有衝突。等待的程序將會一直休眠,直到另一個鎖定被釋放(或檢測到 deadlock 情況)為止。一個程序等待最多只可以取得一個鎖定。
在整個交易事務執行過程中,伺服器程序對事務的虛擬事務 ID 持有排他鎖定(exclusive lock)。如果將永久性 ID 分配給事務(通常僅在事務變更資料庫狀態時才會發生),它還會對事務的永久性事務 ID 持有排他鎖定,直到結束。當一個程序發現有必要專門等待另一個事務結束時,它透過嘗試獲取另一個事務的 ID(取決於情況的虛擬 ID 或永久 ID)上的共享鎖定(share lock)來做到這一點。僅當另一個事務結束並釋放其鎖定時,該操作才會成功。
儘管 tuple 是可鎖定的物件型別,但是有關資料列級鎖定的資訊是儲存在磁碟上,而不是儲存在記憶體之中,因此資料列級的鎖定通常不會出現在此檢視表中。如果程序正在等待資料列級的鎖定,則它通常在檢視表中顯示為正在等待該資料列鎖定目前持有者的永久事務 ID。
Advisory locks can be acquired on keys consisting of either a single bigint
value or two integer values. A bigint
key is displayed with its high-order half in the classid
column, its low-order half in the objid
column, and objsubid
equal to 1. The original bigint
value can be reassembled with the expression (classid::bigint << 32) | objid::bigint
. Integer keys are displayed with the first key in the classid
column, the second key in the objid
column, and objsubid
equal to 2. The actual meaning of the keys is up to the user. Advisory locks are local to each database, so the database
column is meaningful for an advisory lock.
pg_locks
provides a global view of all locks in the database cluster, not only those relevant to the current database. Although its relation
column can be joined against pg_class
.oid
to identify locked relations, this will only work correctly for relations in the current database (those for which the database
column is either the current database's OID or zero).
The pid
column can be joined to the pid
column of the pg_stat_activity
view to get more information on the session holding or awaiting each lock, for example
Also, if you are using prepared transactions, the virtualtransaction
column can be joined to the transaction
column of the pg_prepared_xacts
view to get more information on prepared transactions that hold locks. (A prepared transaction can never be waiting for a lock, but it continues to hold the locks it acquired while running.) For example:
While it is possible to obtain information about which processes block which other processes by joining pg_locks
against itself, this is very difficult to get right in detail. Such a query would have to encode knowledge about which lock modes conflict with which others. Worse, the pg_locks
view does not expose information about which processes are ahead of which others in lock wait queues, nor information about which processes are parallel workers running on behalf of which other client sessions. It is better to use the pg_blocking_pids()
function (see Table 9.63) to identify which process(es) a waiting process is blocked behind.
The pg_locks
view displays data from both the regular lock manager and the predicate lock manager, which are separate systems; in addition, the regular lock manager subdivides its locks into regular and fast-path locks. This data is not guaranteed to be entirely consistent. When the view is queried, data on fast-path locks (with fastpath
= true
) is gathered from each backend one at a time, without freezing the state of the entire lock manager, so it is possible for locks to be taken or released while information is gathered. Note, however, that these locks are known not to conflict with any other lock currently in place. After all backends have been queried for fast-path locks, the remainder of the regular lock manager is locked as a unit, and a consistent snapshot of all remaining locks is collected as an atomic action. After unlocking the regular lock manager, the predicate lock manager is similarly locked and all predicate locks are collected as an atomic action. Thus, with the exception of fast-path locks, each lock manager will deliver a consistent set of results, but as we do not lock both lock managers simultaneously, it is possible for locks to be taken or released after we interrogate the regular lock manager and before we interrogate the predicate lock manager.
Locking the regular and/or predicate lock manager could have some impact on database performance if this view is very frequently accessed. The locks are held only for the minimum amount of time necessary to obtain data from the lock managers, but this does not completely eliminate the possibility of a performance impact.
The view pg_prepared_xacts
displays information about transactions that are currently prepared for two-phase commit (see PREPARE TRANSACTION for details).
pg_prepared_xacts
contains one row per prepared transaction. An entry is removed when the transaction is committed or rolled back.
pg_prepared_xacts
ColumnsColumn Type
Description
transaction
xid
Numeric transaction identifier of the prepared transaction
gid
text
Global transaction identifier that was assigned to the transaction
prepared
timestamptz
Time at which the transaction was prepared for commit
owner
name
(references .rolname
)
Name of the user that executed the transaction
database
name
(references .datname
)
Name of the database in which the transaction was executed
When the pg_prepared_xacts
view is accessed, the internal transaction manager data structures are momentarily locked, and a copy is made for the view to display. This ensures that the view produces a consistent set of results, while not blocking normal operations longer than necessary. Nonetheless there could be some impact on database performance if this view is frequently accessed.
檢視表 pg_indexes 提供資料庫中每一個索引的資訊。
pg_indexes
ColumnsColumn Type
Description
schemaname
name
(references .nspname
)
Name of schema containing table and index
tablename
name
(references .relname
)
Name of table the index is for
indexname
name
(references .relname
)
Name of index
tablespace
name
(references .spcname
)
Name of tablespace containing index (null if default for database)
indexdef
text
Index definition (a reconstructed CREATE INDEX
command)
pg_roles 這個 view 提供對資料庫角色的資訊。 這只是 pg_authid 的一個公共可讀的 view,它將密碼部份予以屏蔽。
此 view 列出底層的 OID 欄位,因此可能需要執行與其他目錄的交叉查詢。
Table 51.83. pg_roles
欄位
Name
Type
References
Description
rolname
name
角色名稱
rolsuper
bool
角色具有超級使用者權限
rolinherit
bool
角色自動繼承它所屬角色的權限
rolcreaterole
bool
角色可以建立更多角色
rolcreatedb
bool
角色可以建立資料庫
rolcanlogin
bool
角色可以登入。也就是說,可以將此角色作為初始連線認證使用
rolreplication
bool
角色是可以進行資料複寫的角色。複寫角色表示可以啟動資料複寫連線並建立和刪除複寫對象。
rolconnlimit
int4
對於可以登入的角色,這個設定此角色可以建立的最大同時連線數。 -1 意味著沒有限制。
rolpassword
text
不是密碼(讀出來都是********)
rolvaliduntil
timestamptz
密碼到期時間(僅用於密碼驗證); 如果沒有到期時間,則顯示 null
rolbypassrls
bool
角色繞過每一個資料列層級的安全原則,參閱了解更多訊息。
rolconfig
text[]
執行環境時用於角色的組態預設值
oid
oid
.oid
角色的 ID
The pg_replication_origin_status
view contains information about how far replay for a certain origin has progressed. For more on replication origins see Chapter 49.
Table 51.80. pg_replication_origin_status
Columns
Name
Type
References
Description
local_id
Oid
.roident
internal node identifier
external_id
text
.roname
external node identifier
remote_lsn
pg_lsn
The origin node's LSN up to which data has been replicated.
local_lsn
pg_lsn
This node's LSN at which remote_lsn
has been replicated. Used to flush commit records before persisting data to disk when using asynchronous commits.
The pg_replication_slots
view provides a listing of all replication slots that currently exist on the database cluster, along with their current state.
For more on replication slots, see Section 26.2.6 and Chapter 48.
pg_replication_slots
ColumnsColumn Type
Description
slot_name
name
A unique, cluster-wide identifier for the replication slot
plugin
name
The base name of the shared object containing the output plugin this logical slot is using, or null for physical slots.
slot_type
text
The slot type: physical
or logical
datoid
oid
(references .oid
)
The OID of the database this slot is associated with, or null. Only logical slots have an associated database.
database
name
(references .datname
)
The name of the database this slot is associated with, or null. Only logical slots have an associated database.
temporary
bool
True if this is a temporary replication slot. Temporary slots are not saved to disk and are automatically dropped on error or when the session has finished.
active
bool
True if this slot is currently actively being used
active_pid
int4
The process ID of the session using this slot if the slot is currently actively being used. NULL
if inactive.
xmin
xid
The oldest transaction that this slot needs the database to retain. VACUUM
cannot remove tuples deleted by any later transaction.
catalog_xmin
xid
The oldest transaction affecting the system catalogs that this slot needs the database to retain. VACUUM
cannot remove catalog tuples deleted by any later transaction.
restart_lsn
pg_lsn
The address (LSN
) of oldest WAL which still might be required by the consumer of this slot and thus won't be automatically removed during checkpoints unless this LSN gets behind more than from the current LSN. NULL
if the LSN
of this slot has never been reserved.
confirmed_flush_lsn
pg_lsn
The address (LSN
) up to which the logical slot's consumer has confirmed receiving data. Data older than this is not available anymore. NULL
for physical slots.
wal_status
text
Availability of WAL files claimed by this slot. Possible values are:
reserved
means that the claimed files are within max_wal_size
.
extended
means that max_wal_size
is exceeded but the files are still retained, either by the replication slot or by wal_keep_size
.
unreserved
means that the slot no longer retains the required WAL files and some of them are to be removed at the next checkpoint. This state can return to reserved
or extended
.
lost
means that some required WAL files have been removed and this slot is no longer usable.
The last two states are seen only when is non-negative. If restart_lsn
is NULL, this field is null.
safe_wal_size
int8
The number of bytes that can be written to WAL such that this slot is not in danger of getting in state "lost". It is NULL for lost slots, as well as if max_slot_wal_keep_size
is -1
.
The view pg_stats
provides access to the information stored in the catalog. This view allows access only to rows of pg_statistic
that correspond to tables the user has permission to read, and therefore it is safe to allow public read access to this view.
pg_stats
is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic
.
pg_stats
ColumnsThe maximum number of entries in the array fields can be controlled on a column-by-column basis using the ALTER TABLE SET STATISTICS
command, or globally by setting the run-time parameter.
檢視表 pg_settings 提供對伺服器的執行時參數的存取。它本質上是 和 指令的替代介面。它也提供 SHOW 無法直接獲得的一些資訊存取,例如最小值和最大值。
pg_settings
Columns設定內容有幾種可能的值,是為了降低變更組態的複雜度,它們是:
internal
這些設定無法直接更改;它們反映了內部所決定的值,其中一些可以透過使用不同的組態選項重建伺服器,或透過更改提供給 initdb 的選項來調整。
postmaster
這些設定只能在伺服器啟動時套用,因此任何變更都需要重新啟動伺服器。這些設定的值通常儲存在 postgresql.conf 檔案中,或在啟動伺服器時在命令列中給予。當然,也可以在伺服器啟動時設定任何層級較低的設定。
sighup
可以在 postgresql.conf 中對這些設定進行變更,而毌須重新啟動伺服器。只要向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf 並套用變更。postmaster 還會將 SIGHUP 信號轉發給其子程序,以便它們都獲取新值。
superuser-backend
可以在 postgresql.conf 中對這些設定進行變更,且毌須重新啟動伺服器。它們也可以在連線要求的封包中設定為特別連線(例如,透過 libpq 的 PGOPTIONS 環境變數),但前提是連線使用者是超級使用者。但是,這些設定在啟動後的連線中永遠不會變更。如果你在 postgresql.conf 中更改它們,請向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。
backend
可以在 postgresql.conf 中對這些設定進行變更,而毌須重新啟動伺服器。它們也可以在連線請求封包中設定為特別連線(例如,透過 libpq 的 PGOPTIONS 環境變數);任何使用者都可以為他們的連線進行這樣的變更。但是,這些設定在啟動後的連線中永遠無法變更。如果你在 postgresql.conf 中更改它們,請向 postmaster 發送一個 SIGHUP 信號,使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。
superuser
這些設定可以從 postgresql.conf 設定,也可以透過 SET 指令在連線中設定;但只有超級使用者可以透過 SET 來更改。僅當沒有使用 SET 建立連線專用的值時,postgresql.conf 中的變更才會影響現有連線。
user
這些設定可以從 postgresql.conf 設定,也可以透過 SET 指令在連線中設定。允許任何使用者變更其連線中所使用的值。僅當未使用 SET 未建立連線專用值時,postgresql.conf 中的變更才會影響現有連線。
The pg_shmem_allocations
view shows allocations made from the server's main shared memory segment. This includes both memory allocated by postgres itself and memory allocated by extensions using the mechanisms detailed in .
Note that this view does not include memory allocated using the dynamic shared memory infrastructure.
pg_shmem_allocations
ColumnsAnonymous allocations are allocations that have been made with ShmemAlloc()
directly, rather than via ShmemInitStruct()
or ShmemInitHash()
.
By default, the pg_shmem_allocations
view can be read only by superusers.
Type of the lockable object: relation
, extend
, frozenid
, page
, tuple
, transactionid
, virtualxid
, spectoken
, object
, userlock
, or advisory
. (See also.)
database
oid
(references .oid
)
relation
oid
(references .oid
)
classid
oid
(references .oid
)
Name of the lock mode held or desired by this process (see and )
有關變更這些參數的各種方法和更多資訊,請參閱。
pg_settings 檢視表無法INSERT 或 DELETE,但可以 UPDATE。套用於一行 pg_settings 的 UPDATE 相當於對該參數執行 指令。此變更僅影響目前連線所使用的值。如果在稍後中止的交易事務中發出 UPDATE,則在回溯事務時 UPDATE 指令的效果會消失。一旦提交了相關的事務,則效果將持續到連線結束,除非被另一個 UPDATE 或 SET 覆蓋。
Column Type
Description
schemaname
name
(references pg_namespace
.nspname
)
檢視表的綱要名稱
viewname
name
(references pg_class
.relname
)
檢視表名稱
viewowner
name
(references pg_authid
.rolname
)
檢視圖的擁有者名稱
definition
text
檢視表的定義(重新建構的 SELECT 查詢)
Column Type
Description
schemaname
name
(references pg_namespace
.nspname
)
Name of schema containing table
tablename
name
(references pg_class
.relname
)
Name of table
attname
name
(references pg_attribute
.attname
)
Name of the column described by this row
inherited
bool
If true, this row includes inheritance child columns, not just the values in the specified table
null_frac
float4
Fraction of column entries that are null
avg_width
int4
Average width in bytes of column's entries
n_distinct
float4
If greater than zero, the estimated number of distinct values in the column. If less than zero, the negative of the number of distinct values divided by the number of rows. (The negated form is used when ANALYZE
believes that the number of distinct values is likely to increase as the table grows; the positive form is used when the column seems to have a fixed number of possible values.) For example, -1 indicates a unique column in which the number of distinct values is the same as the number of rows.
most_common_vals
anyarray
A list of the most common values in the column. (Null if no values seem to be more common than any others.)
most_common_freqs
float4[]
A list of the frequencies of the most common values, i.e., number of occurrences of each divided by total number of rows. (Null when most_common_vals
is.)
histogram_bounds
anyarray
A list of values that divide the column's values into groups of approximately equal population. The values in most_common_vals
, if present, are omitted from this histogram calculation. (This column is null if the column data type does not have a <
operator or if the most_common_vals
list accounts for the entire population.)
correlation
float4
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a <
operator.)
most_common_elems
anyarray
A list of non-null element values most often appearing within values of the column. (Null for scalar types.)
most_common_elem_freqs
float4[]
A list of the frequencies of the most common element values, i.e., the fraction of rows containing at least one instance of the given value. Two or three additional values follow the per-element frequencies; these are the minimum and maximum of the preceding per-element frequencies, and optionally the frequency of null elements. (Null when most_common_elems
is.)
elem_count_histogram
float4[]
A histogram of the counts of distinct non-null element values within the values of the column, followed by the average number of distinct non-null elements. (Null for scalar types.)
Column Type
Description
usename
name
User name
usesysid
oid
ID of this user
usecreatedb
bool
User can create databases
usesuper
bool
User is a superuser
userepl
bool
User can initiate streaming replication and put the system in and out of backup mode.
usebypassrls
bool
User bypasses every row level security policy, see Section 5.8 for more information.
passwd
text
Not the password (always reads as ********
)
valuntil
timestamptz
Password expiry time (only used for password authentication)
useconfig
text[]
Session defaults for run-time configuration variables
Column Type
Description
name
text
The name of the shared memory allocation. NULL for unused memory and <anonymous>
for anonymous allocations.
off
int8
The offset at which the allocation starts. NULL for anonymous allocations, since details related to them are not known.
size
int8
Size of the allocation
allocated_size
int8
Size of the allocation including padding. For anonymous allocations, no information about padding is available, so the size
and allocated_size
columns will always be equal. Padding is not meaningful for free memory, so the columns will be equal in that case also.
Column Type
Description
name
text
執行階段的組態參數名稱
setting
text
參數的現值
unit
text
參數隱含的單位
category
text
參數的邏輯分類
short_desc
text
參數的簡要說明
extra_desc
text
更進一步的詳細參數說明
context
text
組態參數值的必要內容(詳見下文)
vartype
text
參數型別(bool、enum、integer、real 或 string)
source
text
目前參數值的來源
min_val
text
參數的最小允許值(非數字型別為 null)
max_val
text
參數的最大允許值(非數字型別為 null)
enumvals
text[]
列舉參數的允許值(非列舉型別為 null)
boot_val
text
如果未另行設定參數,則在伺服器啟動時預先給予參數值
reset_val
text
RESET 將參數重置為目前連線中的值
sourcefile
text
組態檔案目前設定為何(對於從組態檔案以外來源設定的值,或者由非超級使用者也不是 pg_read_all_settings 的成員所給予,為null);在組態檔案中使用 include 指令時會很有幫助。
sourceline
int4
組態檔案中目前設定所在的行號(對於從組態檔案以外來源所設定的值,或者由非超級使用者,也不是 pg_read_all_settings 成員所給予的值,則為 null)。
pending_restart
bool
如果組態檔案中的值已更改但需要重新啟動,則為 true;否則為 false。
To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true
(or on
, yes
, 1
) tells the backend to go into physical replication walsender mode, wherein a small set of replication commands, shown below, can be issued instead of SQL statements.
Passing database
as the value for the replication
parameter instructs the backend to go into logical replication walsender mode, connecting to the database specified in the dbname
parameter. In logical replication walsender mode, the replication commands shown below as well as normal SQL commands can be issued.
In either physical replication or logical replication walsender mode, only the simple query protocol can be used.
For the purpose of testing replication commands, you can make a replication connection via psql or any other libpq-using tool with a connection string including the replication
option, e.g.:
However, it is often more useful to use pg_receivewal (for physical replication) or pg_recvlogical (for logical replication).
Replication commands are logged in the server log when log_replication_commands is enabled.
The commands accepted in replication mode are:
IDENTIFY_SYSTEM
Requests the server to identify itself. Server replies with a result set of a single row, containing four fields:
systemid
(text
)
The unique system identifier identifying the cluster. This can be used to check that the base backup used to initialize the standby came from the same cluster.
timeline
(int4
)
Current timeline ID. Also useful to check that the standby is consistent with the master.
xlogpos
(text
)
Current WAL flush location. Useful to get a known location in the write-ahead log where streaming can start.
dbname
(text
)
Database connected to or null.
SHOW
name
Requests the server to send the current setting of a run-time parameter. This is similar to the SQL command SHOW.
name
The name of a run-time parameter. Available parameters are documented in Chapter 19.
TIMELINE_HISTORY
tli
Requests the server to send over the timeline history file for timeline tli
. Server replies with a result set of a single row, containing two fields:
filename
(text
)
File name of the timeline history file, e.g., 00000002.history
.
content
(bytea
)
Contents of the timeline history file.
CREATE_REPLICATION_SLOT
slot_name
[ TEMPORARY
] { PHYSICAL
[ RESERVE_WAL
] | LOGICAL
output_plugin
[ EXPORT_SNAPSHOT
| NOEXPORT_SNAPSHOT
| USE_SNAPSHOT
] }
Create a physical or logical replication slot. See Section 26.2.6 for more about replication slots.
slot_name
The name of the slot to create. Must be a valid replication slot name (see Section 26.2.6.1).
output_plugin
The name of the output plugin used for logical decoding (see Section 48.6).
TEMPORARY
Specify that this replication slot is a temporary one. Temporary slots are not saved to disk and are automatically dropped on error or when the session has finished.
RESERVE_WAL
Specify that this physical replication slot reserves WAL immediately. Otherwise, WAL is only reserved upon connection from a streaming replication client.
EXPORT_SNAPSHOT
NOEXPORT_SNAPSHOT
USE_SNAPSHOT
Decides what to do with the snapshot created during logical slot initialization. EXPORT_SNAPSHOT
, which is the default, will export the snapshot for use in other sessions. This option can't be used inside a transaction. USE_SNAPSHOT
will use the snapshot for the current transaction executing the command. This option must be used in a transaction, and CREATE_REPLICATION_SLOT
must be the first command run in that transaction. Finally, NOEXPORT_SNAPSHOT
will just use the snapshot for logical decoding as normal but won't do anything else with it.
In response to this command, the server will send a one-row result set containing the following fields:
slot_name
(text
)
The name of the newly-created replication slot.
consistent_point
(text
)
The WAL location at which the slot became consistent. This is the earliest location from which streaming can start on this replication slot.
snapshot_name
(text
)
The identifier of the snapshot exported by the command. The snapshot is valid until a new command is executed on this connection or the replication connection is closed. Null if the created slot is physical.
output_plugin
(text
)
The name of the output plugin used by the newly-created replication slot. Null if the created slot is physical.
START_REPLICATION
[ SLOT
slot_name
] [ PHYSICAL
] XXX/XXX
[ TIMELINE
tli
]
Instructs server to start streaming WAL, starting at WAL location XXX/XXX
. If TIMELINE
option is specified, streaming starts on timeline tli
; otherwise, the server's current timeline is selected. The server can reply with an error, for example if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.
If a slot's name is provided via slot_name
, it will be updated as replication progresses so that the server knows which WAL segments, and if hot_standby_feedback
is on which transactions, are still needed by the standby.
If the client requests a timeline that's not the latest but is part of the history of the server, the server will stream all the WAL on that timeline starting from the requested start point up to the point where the server switched to another timeline. If the client requests streaming at exactly the end of an old timeline, the server responds immediately with CommandComplete without entering COPY mode.
After streaming all the WAL on a timeline that is not the latest one, the server will end streaming by exiting the COPY mode. When the client acknowledges this by also exiting COPY mode, the server sends a result set with one row and two columns, indicating the next timeline in this server's history. The first column is the next timeline's ID (type int8
), and the second column is the WAL location where the switch happened (type text
). Usually, the switch position is the end of the WAL that was streamed, but there are corner cases where the server can send some WAL from the old timeline that it has not itself replayed before promoting. Finally, the server sends two CommandComplete messages (one that ends the CopyData and the other ends the START_REPLICATION
itself), and is ready to accept a new command.
WAL data is sent as a series of CopyData messages. (This allows other information to be intermixed; in particular the server can send an ErrorResponse message if it encounters a failure after beginning to stream.) The payload of each CopyData message from server to the client contains a message of one of the following formats:
XLogData (B)
Byte1('w')
Identifies the message as WAL data.
Int64
The starting point of the WAL data in this message.
Int64
The current end of WAL on the server.
Int64
The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
Byten
A section of the WAL data stream.
A single WAL record is never split across two XLogData messages. When a WAL record crosses a WAL page boundary, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be sent in different XLogData messages.
Primary keepalive message (B)
Byte1('k')
Identifies the message as a sender keepalive.
Int64
The current end of WAL on the server.
Int64
The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.
Byte1
1 means that the client should reply to this message as soon as possible, to avoid a timeout disconnect. 0 otherwise.
The receiving process can send replies back to the sender at any time, using one of the following message formats (also in the payload of a CopyData message):
Standby status update (F)
Byte1('r')
Identifies the message as a receiver status update.
Int64
The location of the last WAL byte + 1 received and written to disk in the standby.
Int64
The location of the last WAL byte + 1 flushed to disk in the standby.Int64
The location of the last WAL byte + 1 applied in the standby.Int64
The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.Byte1
If 1, the client requests the server to reply to this message immediately. This can be used to ping the server, to test if the connection is still healthy.Hot Standby feedback message (F)Byte1('h')
Identifies the message as a Hot Standby feedback message.Int64
The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.Int32
The standby's current global xmin, excluding the catalog_xmin from any replication slots. If both this value and the following catalog_xmin are 0 this is treated as a notification that Hot Standby feedback will no longer be sent on this connection. Later non-zero messages may reinitiate the feedback mechanism.Int32
The epoch of the global xmin xid on the standby.Int32
The lowest catalog_xmin of any replication slots on the standby. Set to 0 if no catalog_xmin exists on the standby or if hot standby feedback is being disabled.Int32
The epoch of the catalog_xmin xid on the standby.START_REPLICATION
SLOT
slot_name
LOGICAL
XXX/XXX
[ ( option_name
[ option_value
] [, ...] ) ]
Instructs server to start streaming WAL for logical replication, starting at WAL location XXX/XXX
. The server can reply with an error, for example if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.
The messages inside the CopyBothResponse messages are of the same format documented for START_REPLICATION ... PHYSICAL
, including two CommandComplete messages.
The output plugin associated with the selected slot is used to process the output for streaming.SLOT
slot_name
The name of the slot to stream changes from. This parameter is required, and must correspond to an existing logical replication slot created with CREATE_REPLICATION_SLOT
in LOGICAL
mode.XXX/XXX
The WAL location to begin streaming at.option_name
The name of an option passed to the slot's logical decoding plugin.option_value
Optional value, in the form of a string constant, associated with the specified option.DROP_REPLICATION_SLOT
slot_name
[ WAIT
]
Drops a replication slot, freeing any reserved server-side resources. If the slot is a logical slot that was created in a database other than the database the walsender is connected to, this command fails.slot_name
The name of the slot to drop.WAIT
This option causes the command to wait if the slot is active until it becomes inactive, instead of the default behavior of raising an error.BASE_BACKUP
[ LABEL
'label'
] [ PROGRESS
] [ FAST
] [ WAL
] [ NOWAIT
] [ MAX_RATE
rate
] [ TABLESPACE_MAP
] [ NOVERIFY_CHECKSUMS
] [ MANIFEST
manifest_option
] [ MANIFEST_CHECKSUMS
checksum_algorithm
]
Instructs the server to start streaming a base backup. The system will automatically be put in backup mode before the backup is started, and taken out of it when the backup is complete. The following options are accepted:LABEL
'label'
Sets the label of the backup. If none is specified, a backup label of base backup
will be used. The quoting rules for the label are the same as a standard SQL string with standard_conforming_strings turned on.PROGRESS
Request information required to generate a progress report. This will send back an approximate size in the header of each tablespace, which can be used to calculate how far along the stream is done. This is calculated by enumerating all the file sizes once before the transfer is even started, and might as such have a negative impact on the performance. In particular, it might take longer before the first data is streamed. Since the database files can change during the backup, the size is only approximate and might both grow and shrink between the time of approximation and the sending of the actual files.FAST
Request a fast checkpoint.WAL
Include the necessary WAL segments in the backup. This will include all the files between start and stop backup in the pg_wal
directory of the base directory tar file.NOWAIT
By default, the backup will wait until the last required WAL segment has been archived, or emit a warning if log archiving is not enabled. Specifying NOWAIT
disables both the waiting and the warning, leaving the client responsible for ensuring the required log is available.MAX_RATE
rate
Limit (throttle) the maximum amount of data transferred from server to client per unit of time. The expected unit is kilobytes per second. If this option is specified, the value must either be equal to zero or it must fall within the range from 32 kB through 1 GB (inclusive). If zero is passed or the option is not specified, no restriction is imposed on the transfer.TABLESPACE_MAP
Include information about symbolic links present in the directory pg_tblspc
in a file named tablespace_map
. The tablespace map file includes each symbolic link name as it exists in the directory pg_tblspc/
and the full path of that symbolic link.NOVERIFY_CHECKSUMS
By default, checksums are verified during a base backup if they are enabled. Specifying NOVERIFY_CHECKSUMS
disables this verification.MANIFEST
manifest_option
When this option is specified with a value of yes
or force-encode
, a backup manifest is created and sent along with the backup. The manifest is a list of every file present in the backup with the exception of any WAL files that may be included. It also stores the size, last modification time, and optionally a checksum for each file. A value of force-encode
forces all filenames to be hex-encoded; otherwise, this type of encoding is performed only for files whose names are non-UTF8 octet sequences. force-encode
is intended primarily for testing purposes, to be sure that clients which read the backup manifest can handle this case. For compatibility with previous releases, the default is MANIFEST 'no'
.MANIFEST_CHECKSUMS
checksum_algorithm
Specifies the checksum algorithm that should be applied to each file included in the backup manifest. Currently, the available algorithms are NONE
, CRC32C
, SHA224
, SHA256
, SHA384
, and SHA512
. The default is CRC32C
.
When the backup is started, the server will first send two ordinary result sets, followed by one or more CopyResponse results.
The first ordinary result set contains the starting position of the backup, in a single row with two columns. The first column contains the start position given in XLogRecPtr format, and the second column contains the corresponding timeline ID.
The second ordinary result set has one row for each tablespace. The fields in this row are:spcoid
(oid
)
The OID of the tablespace, or null if it's the base directory.spclocation
(text
)
The full path of the tablespace directory, or null if it's the base directory.size
(int8
)
The approximate size of the tablespace, in kilobytes (1024 bytes), if progress report has been requested; otherwise it's null.
After the second regular result set, one or more CopyResponse results will be sent, one for the main data directory and one for each additional tablespace other than pg_default
and pg_global
. The data in the CopyResponse results will be a tar format (following the “ustar interchange format” specified in the POSIX 1003.1-2008 standard) dump of the tablespace contents, except that the two trailing blocks of zeroes specified in the standard are omitted. After the tar data is complete, and if a backup manifest was requested, another CopyResponse result is sent, containing the manifest data for the current base backup. In any case, a final ordinary result set will be sent, containing the WAL end position of the backup, in the same format as the start position.
The tar archive for the data directory and each tablespace will contain all files in the directories, regardless of whether they are PostgreSQL files or other files added to the same directory. The only excluded files are:
postmaster.pid
postmaster.opts
pg_internal.init
(found in multiple directories)
Various temporary files and directories created during the operation of the PostgreSQL server, such as any file or directory beginning with pgsql_tmp
and temporary relations.
Unlogged relations, except for the init fork which is required to recreate the (empty) unlogged relation on recovery.
pg_wal
, including subdirectories. If the backup is run with WAL files included, a synthesized version of pg_wal
will be included, but it will only contain the files necessary for the backup to work, not the rest of the contents.
pg_dynshmem
, pg_notify
, pg_replslot
, pg_serial
, pg_snapshots
, pg_stat_tmp
, and pg_subtrans
are copied as empty directories (even if they are symbolic links).
Files other than regular files and directories, such as symbolic links (other than for the directories listed above) and special device files, are skipped. (Symbolic links in pg_tblspc
are maintained.)
Owner, group, and file mode are set if the underlying file system on the server supports it.
This section describes the logical replication protocol, which is the message flow started by the START_REPLICATION
SLOT
slot_name
LOGICAL
replication command.
The logical streaming replication protocol builds on the primitives of the physical streaming replication protocol.
The logical replication START_REPLICATION
command accepts following parameters:proto_version
Protocol version. Currently only version 1
is supported.publication_names
Comma separated list of publication names for which to subscribe (receive changes). The individual publication names are treated as standard objects names and can be quoted the same as needed.
The individual protocol messages are discussed in the following subsections. Individual messages are described in Section 52.9.
All top-level protocol messages begin with a message type byte. While represented in code as a character, this is a signed byte with no associated encoding.
Since the streaming replication protocol supplies a message length there is no need for top-level protocol messages to embed a length in their header.
With the exception of the START_REPLICATION
command and the replay progress messages, all information flows only from the backend to the frontend.
The logical replication protocol sends individual transactions one by one. This means that all messages between a pair of Begin and Commit messages belong to the same transaction.
Every sent transaction contains zero or more DML messages (Insert, Update, Delete). In case of a cascaded setup it can also contain Origin messages. The origin message indicated that the transaction originated on different replication node. Since a replication node in the scope of logical replication protocol can be pretty much anything, the only identifier is the origin name. It's downstream's responsibility to handle this as needed (if needed). The Origin message is always sent before any DML messages in the transaction.
Every DML message contains an arbitrary relation ID, which can be mapped to an ID in the Relation messages. The Relation messages describe the schema of the given relation. The Relation message is sent for a given relation either because it is the first time we send a DML message for given relation in the current session or because the relation definition has changed since the last Relation message was sent for it. The protocol assumes that the client is capable of caching the metadata for as many relations as needed.
PostgreSQLuses a message-based protocol for communication between frontends and backends (clients and servers). The protocol is supported overTCP/IPand also over Unix-domain sockets. Port number 5432 has been registered with IANA as the customary TCP port number for servers supporting this protocol, but in practice any non-privileged port number can be used.
This document describes version 3.0 of the protocol, implemented inPostgreSQL7.4 and later. For descriptions of the earlier protocol versions, see previous releases of thePostgreSQLdocumentation. A single server can support multiple protocol versions. The initial startup-request message tells the server which protocol version the client is attempting to use, and then the server follows that protocol if it is able.
In order to serve multiple clients efficiently, the server launches a new“backend”process for each client. In the current implementation, a new child process is created immediately after an incoming connection is detected. This is transparent to the protocol, however. For purposes of the protocol, the terms“backend”and“server”are interchangeable; likewise“frontend”and“client”are interchangeable.
The protocol has separate phases for startup and normal operation. In the startup phase, the frontend opens a connection to the server and authenticates itself to the satisfaction of the server. (This might involve a single message, or multiple messages depending on the authentication method being used.) If all goes well, the server then sends status information to the frontend, and finally enters normal operation. Except for the initial startup-request message, this part of the protocol is driven by the server.
During normal operation, the frontend sends queries and other commands to the backend, and the backend sends back query results and other responses. There are a few cases (such as NOTIFY
) wherein the backend will send unsolicited messages, but for the most part this portion of a session is driven by frontend requests.
Termination of the session is normally by frontend choice, but can be forced by the backend in certain cases. In any case, when the backend closes the connection, it will roll back any open (incomplete) transaction before exiting.
Within normal operation, SQL commands can be executed through either of two sub-protocols. In the “simple query” protocol, the frontend just sends a textual query string, which is parsed and immediately executed by the backend. In the “extended query” protocol, processing of queries is separated into multiple steps: parsing, binding of parameter values, and execution. This offers flexibility and performance benefits, at the cost of extra complexity.
Normal operation has additional sub-protocols for special operations such as COPY
.
All communication is through a stream of messages. The first byte of a message identifies the message type, and the next four bytes specify the length of the rest of the message (this length count includes itself, but not the message-type byte). The remaining contents of the message are determined by the message type. For historical reasons, the very first message sent by the client (the startup message) has no initial message-type byte.
To avoid losing synchronization with the message stream, both servers and clients typically read an entire message into a buffer (using the byte count) before attempting to process its contents. This allows easy recovery if an error is detected while processing the contents. In extreme situations (such as not having enough memory to buffer the message), the receiver can use the byte count to determine how much input to skip before it resumes reading messages.
Conversely, both servers and clients must take care never to send an incomplete message. This is commonly done by marshaling the entire message in a buffer before beginning to send it. If a communications failure occurs partway through sending or receiving a message, the only sensible response is to abandon the connection, since there is little hope of recovering message-boundary synchronization.
In the extended-query protocol, execution of SQL commands is divided into multiple steps. The state retained between steps is represented by two types of objects: prepared statements and portals. A prepared statement represents the result of parsing and semantic analysis of a textual query string. A prepared statement is not in itself ready to execute, because it might lack specific values for parameters. A portal represents a ready-to-execute or already-partially-executed statement, with any missing parameter values filled in. (For SELECT
statements, a portal is equivalent to an open cursor, but we choose to use a different term since cursors don't handle non-SELECT
statements.)
The overall execution cycle consists of a parse step, which creates a prepared statement from a textual query string; a bind step, which creates a portal given a prepared statement and values for any needed parameters; and an execute step that runs a portal's query. In the case of a query that returns rows (SELECT
, SHOW
, etc), the execute step can be told to fetch only a limited number of rows, so that multiple execute steps might be needed to complete the operation.
The backend can keep track of multiple prepared statements and portals (but note that these exist only within a session, and are never shared across sessions). Existing prepared statements and portals are referenced by names assigned when they were created. In addition, an “unnamed” prepared statement and portal exist. Although these behave largely the same as named objects, operations on them are optimized for the case of executing a query only once and then discarding it, whereas operations on named objects are optimized on the expectation of multiple uses.
Data of a particular data type might be transmitted in any of several different formats. As of PostgreSQL 7.4 the only supported formats are “text” and “binary”, but the protocol makes provision for future extensions. The desired format for any value is specified by a format code. Clients can specify a format code for each transmitted parameter value and for each column of a query result. Text has format code zero, binary has format code one, and all other format codes are reserved for future definition.
The text representation of values is whatever strings are produced and accepted by the input/output conversion functions for the particular data type. In the transmitted representation, there is no trailing null character; the frontend must add one to received values if it wants to process them as C strings. (The text format does not allow embedded nulls, by the way.)
Binary representations for integers use network byte order (most significant byte first). For other data types consult the documentation or source code to learn about the binary representation. Keep in mind that binary representations for complex data types might change across server versions; the text format is usually the more portable choice.
The view pg_tables
provides access to useful information about each table in the database.
pg_tables
ColumnsName
Type
References
Description
schemaname
name
.nspname
Name of schema containing table
tablename
name
.relname
Name of table
tableowner
name
.rolname
Name of table's owner
tablespace
name
.spcname
Name of tablespace containing table (null if default for database)
hasindexes
boolean
.relhasindex
True if table has (or recently had) any indexes
hasrules
boolean
.relhasrules
True if table has (or once had) rules
hastriggers
boolean
.relhastriggers
True if table has (or once had) triggers
rowsecurity
boolean
.relrowsecurity
True if row security is enabled on the table
Source code formatting uses 4 column tab spacing, with tabs preserved (i.e., tabs are not expanded to spaces). Each logical indentation level is one additional tab stop.
Layout rules (brace positioning, etc) follow BSD conventions. In particular, curly braces for the controlled blocks of if
, while
, switch
, etc go on their own lines.
Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.)
Do not use C++ style comments (//
comments). Strict ANSI C compilers do not accept them. For the same reason, do not use C++ extensions such as declaring new variables mid-block.
The preferred style for multi-line comment blocks is
Note that comment blocks that begin in column 1 will be preserved as-is by pgindent, but it will re-flow indented comment blocks as though they were plain text. If you want to preserve the line breaks in an indented block, add dashes like this:
While submitted patches do not absolutely have to follow these formatting rules, it's a good idea to do so. Your code will get run through pgindent before the next release, so there's no point in making it look nice under some other set of formatting conventions. A good rule of thumb for patches is “make the new code look like the existing code around it”.
The src/tools
directory contains sample settings files that can be used with the emacs, xemacs or vim editors to help ensure that they format code according to these conventions.
The text browsing tools more and less can be invoked as:
to make them show tabs appropriately.
Error, warning, and log messages generated within the server code should be created using ereport
, or its older cousin elog
. The use of this function is complex enough to require some explanation.
There are two required elements for every message: a severity level (ranging from DEBUG
to PANIC
) and a primary message text. In addition there are optional elements, the most common of which is an error identifier code that follows the SQL spec's SQLSTATE conventions. ereport
itself is just a shell function, that exists mainly for the syntactic convenience of making message generation look like a function call in the C source code. The only parameter accepted directly by ereport
is the severity level. The primary message text and any optional message elements are generated by calling auxiliary functions, such as errmsg
, within the ereport
call.
A typical call to ereport
might look like this:
This specifies error severity level ERROR
(a run-of-the-mill error). The errcode
call specifies the SQLSTATE error code using a macro defined in src/include/utils/errcodes.h
. The errmsg
call provides the primary message text. Notice the extra set of parentheses surrounding the auxiliary function calls — these are annoying but syntactically necessary.
Here is a more complex example:
This illustrates the use of format codes to embed run-time values into a message text. Also, an optional “hint” message is provided.
If the severity level is ERROR
or higher, ereport
aborts the execution of the user-defined function and does not return to the caller. If the severity level is lower than ERROR
, ereport
returns normally.
The available auxiliary routines for ereport
are:
errcode(sqlerrcode)
specifies the SQLSTATE error identifier code for the condition. If this routine is not called, the error identifier defaults to ERRCODE_INTERNAL_ERROR
when the error severity level is ERROR
or higher, ERRCODE_WARNING
when the error level is WARNING
, otherwise (for NOTICE
and below) ERRCODE_SUCCESSFUL_COMPLETION
. While these defaults are often convenient, always think whether they are appropriate before omitting the errcode()
call.
errmsg(const char *msg, ...)
specifies the primary error message text, and possibly run-time values to insert into it. Insertions are specified by sprintf
-style format codes. In addition to the standard format codes accepted by sprintf
, the format code %m
can be used to insert the error message returned by strerror
for the current value of errno
. [13] %m
does not require any corresponding entry in the parameter list for errmsg
. Note that the message string will be run through gettext
for possible localization before format codes are processed.
errmsg_internal(const char *msg, ...)
is the same as errmsg
, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for “cannot happen” cases that are probably not worth expending translation effort on.
errmsg_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errmsg
, but with support for various plural forms of the message. fmt_singular
is the English singular format, fmt_plural
is the English plural format, n
is the integer value that determines which plural form is needed, and the remaining arguments are formatted according to the selected format string. For more information see Section 55.2.2.
errdetail(const char *msg, ...)
supplies an optional “detail” message; this is to be used when there is additional information that seems inappropriate to put in the primary message. The message string is processed in just the same way as for errmsg
.
errdetail_internal(const char *msg, ...)
is the same as errdetail
, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for detail messages that are not worth expending translation effort on, for instance because they are too technical to be useful to most users.
errdetail_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errdetail
, but with support for various plural forms of the message. For more information see Section 55.2.2.
errdetail_log(const char *msg, ...)
is the same as errdetail
except that this string goes only to the server log, never to the client. If both errdetail
(or one of its equivalents above) and errdetail_log
are used then one string goes to the client and the other to the log. This is useful for error details that are too security-sensitive or too bulky to include in the report sent to the client.
errdetail_log_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...)
is like errdetail_log
, but with support for various plural forms of the message. For more information see Section 55.2.2.
errhint(const char *msg, ...)
supplies an optional “hint” message; this is to be used when offering suggestions about how to fix the problem, as opposed to factual details about what went wrong. The message string is processed in just the same way as for errmsg
.
errcontext(const char *msg, ...)
is not normally called directly from an ereport
message site; rather it is used in error_context_stack
callback functions to provide information about the context in which an error occurred, such as the current location in a PL function. The message string is processed in just the same way as for errmsg
. Unlike the other auxiliary functions, this can be called more than once per ereport
call; the successive strings thus supplied are concatenated with separating newlines.
errposition(int cursorpos)
specifies the textual location of an error within a query string. Currently it is only useful for errors detected in the lexical and syntactic analysis phases of query processing.
errtable(Relation rel)
specifies a relation whose name and schema name should be included as auxiliary fields in the error report.
errtablecol(Relation rel, int attnum)
specifies a column whose name, table name, and schema name should be included as auxiliary fields in the error report.
errtableconstraint(Relation rel, const char *conname)
specifies a table constraint whose name, table name, and schema name should be included as auxiliary fields in the error report. Indexes should be considered to be constraints for this purpose, whether or not they have an associated pg_constraint
entry. Be careful to pass the underlying heap relation, not the index itself, as rel
.
errdatatype(Oid datatypeOid)
specifies a data type whose name and schema name should be included as auxiliary fields in the error report.
errdomainconstraint(Oid datatypeOid, const char *conname)
specifies a domain constraint whose name, domain name, and schema name should be included as auxiliary fields in the error report.
errcode_for_file_access()
is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a file-access-related system call. It uses the saved errno
to determine which error code to generate. Usually this should be used in combination with %m
in the primary error message text.
errcode_for_socket_access()
is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a socket-related system call.
errhidestmt(bool hide_stmt)
can be called to specify suppression of the STATEMENT:
portion of a message in the postmaster log. Generally this is appropriate if the message text includes the current statement already.
errhidecontext(bool hide_ctx)
can be called to specify suppression of the CONTEXT:
portion of a message in the postmaster log. This should only be used for verbose debugging messages where the repeated inclusion of context would bloat the log volume too much.
At most one of the functions errtable
, errtablecol
, errtableconstraint
, errdatatype
, or errdomainconstraint
should be used in an ereport
call. These functions exist to allow applications to extract the name of a database object associated with the error condition without having to examine the potentially-localized error message text. These functions should be used in error reports for which it's likely that applications would wish to have automatic error handling. As of PostgreSQL 9.3, complete coverage exists only for errors in SQLSTATE class 23 (integrity constraint violation), but this is likely to be expanded in future.
There is an older function elog
that is still heavily used. An elog
call:
is exactly equivalent to:
Notice that the SQLSTATE error code is always defaulted, and the message string is not subject to translation. Therefore, elog
should be used only for internal errors and low-level debug logging. Any message that is likely to be of interest to ordinary users should go through ereport
. Nonetheless, there are enough internal “cannot happen” error checks in the system that elog
is still widely used; it is preferred for those messages for its notational simplicity.
Advice about writing good error messages can be found in Section 54.3.
[13] That is, the value that was current when the ereport
call was reached; changes of errno
within the auxiliary reporting routines will not affect it. That would not be true if you were to write strerror(errno)
explicitly in errmsg
's parameter list; accordingly, do not do so.
Code in PostgreSQL should only rely on language features available in the C89 standard. That means a conforming C89 compiler has to be able to compile postgres, at least aside from a few platform dependent pieces. Features from later revision of the C standard or compiler specific features can be used, if a fallback is provided.
For example static inline
and _StaticAssert()
are currently used, even though they are from newer revisions of the C standard. If not available we respectively fall back to defining the functions without inline, and to using a C89 compatible replacement that performs the same checks, but emits rather cryptic messages.
Both, macros with arguments and static inline
functions, may be used. The latter are preferable if there are multiple-evaluation hazards when written as a macro, as e.g. the case with
or when the macro would be very long. In other cases it's only possible to use macros, or at least easier. For example because expressions of various types need to be passed to the macro.
When the definition of an inline function references symbols (i.e. variables, functions) that are only available as part of the backend, the function may not be visible when included from frontend code.
In this example CurrentMemoryContext
, which is only available in the backend, is referenced and the function thus hidden with a #ifndef FRONTEND
. This rule exists because some compilers emit references to symbols contained in inline functions even if the function is not used.
To be suitable to run inside a signal handler code has to be written very carefully. The fundamental problem is that, unless blocked, a signal handler can interrupt code at any time. If code inside the signal handler uses the same state as code outside chaos may ensue. As an example consider what happens if a signal handler tries to acquire a lock that's already held in the interrupted code.
Barring special arrangements code in signal handlers may only call async-signal safe functions (as defined in POSIX) and access variables of type volatile sig_atomic_t
. A few functions in postgres
are also deemed signal safe, importantly SetLatch()
.
In most cases signal handlers should do nothing more than note that a signal has arrived, and wake up code running outside of the handler using a latch. An example of such a handler is the following:
errno
is saved and restored because SetLatch()
might change it. If that were not done interrupted code that's currently inspecting errno
might see the wrong value.
For clarity, it is preferred to explicitly dereference a function pointer when calling the pointed-to function if the pointer is a simple variable, for example:
(even though emit_log_hook(edata)
would also work). When the function pointer is part of a structure, then the extra punctuation can and usually should be omitted, for example:
This section describes how to implement native language support in a program or library that is part of the PostgreSQL distribution. Currently, it only applies to C programs.
Adding NLS Support to a Program
Insert this code into the start-up sequence of the program:
(The progname
can actually be chosen freely.)
Wherever a message that is a candidate for translation is found, a call to gettext()
needs to be inserted. E.g.:
would be changed to:
(gettext
is defined as a no-op if NLS support is not configured.)
This tends to add a lot of clutter. One common shortcut is to use:
Another solution is feasible if the program does much of its communication through one or a few functions, such as ereport()
in the backend. Then you make this function call gettext
internally on all input strings.
Add a file nls.mk
in the directory with the program sources. This file will be read as a makefile. The following variable assignments need to be made here:CATALOG_NAME
The program name, as provided in the textdomain()
call.AVAIL_LANGUAGES
List of provided translations — initially empty.GETTEXT_FILES
List of files that contain translatable strings, i.e., those marked with gettext
or an alternative solution. Eventually, this will include nearly all source files of the program. If this list gets too long you can make the first “file” be a +
and the second word be a file that contains one file name per line.GETTEXT_TRIGGERS
The tools that generate message catalogs for the translators to work on need to know what function calls contain translatable strings. By default, only gettext()
calls are known. If you used _
or other identifiers you need to list them here. If the translatable string is not the first argument, the item needs to be of the form func:2
(for the second argument). If you have a function that supports pluralized messages, the item should look like func:1,2
(identifying the singular and plural message arguments).
The build system will automatically take care of building and installing the message catalogs.
Here are some guidelines for writing messages that are easily translatable.
Do not construct sentences at run-time, like:
The word order within the sentence might be different in other languages. Also, even if you remember to call gettext()
on each fragment, the fragments might not translate well separately. It's better to duplicate a little code so that each message to be translated is a coherent whole. Only numbers, file names, and such-like run-time variables should be inserted at run time into a message text.
For similar reasons, this won't work:
because it assumes how the plural is formed. If you figured you could solve it like this:
then be disappointed. Some languages have more than two forms, with some peculiar rules. It's often best to design the message to avoid the issue altogether, for instance like this:
If you really want to construct a properly pluralized message, there is support for this, but it's a bit awkward. When generating a primary or detail error message in ereport()
, you can write something like this:
The first argument is the format string appropriate for English singular form, the second is the format string appropriate for English plural form, and the third is the integer control value that determines which plural form to use. Subsequent arguments are formatted per the format string as usual. (Normally, the pluralization control value will also be one of the values to be formatted, so it has to be written twice.) In English it only matters whether n
is 1 or not 1, but in other languages there can be many different plural forms. The translator sees the two English forms as a group and has the opportunity to supply multiple substitute strings, with the appropriate one being selected based on the run-time value of n
.
If you need to pluralize a message that isn't going directly to an errmsg
or errdetail
report, you have to use the underlying function ngettext
. See the gettext documentation.
If you want to communicate something to the translator, such as about how a message is intended to line up with other output, precede the occurrence of the string with a comment that starts with translator
, e.g.:
These comments are copied to the message catalog files so that the translators can see them.
This style guide is offered in the hope of maintaining a consistent, user-friendly style throughout all the messages generated by PostgreSQL.
The primary message should be short, factual, and avoid reference to implementation details such as specific function names. “Short” means “should fit on one line under normal conditions”. Use a detail message if needed to keep the primary message short, or if you feel a need to mention implementation details such as the particular system call that failed. Both primary and detail messages should be factual. Use a hint message for suggestions about what to do to fix the problem, especially if the suggestion might not always be applicable.
For example, instead of:
write:
Rationale: keeping the primary message short helps keep it to the point, and lets clients lay out screen space on the assumption that one line is enough for error messages. Detail and hint messages can be relegated to a verbose mode, or perhaps a pop-up error-details window. Also, details and hints would normally be suppressed from the server log to save space. Reference to implementation details is best avoided since users aren't expected to know the details.
Don't put any specific assumptions about formatting into the message texts. Expect clients and the server log to wrap lines to fit their own needs. In long messages, newline characters (\n) can be used to indicate suggested paragraph breaks. Don't end a message with a newline. Don't use tabs or other formatting characters. (In error context displays, newlines are automatically added to separate levels of context such as function calls.)
Rationale: Messages are not necessarily displayed on terminal-type displays. In GUI displays or browsers these formatting instructions are at best ignored.
English text should use double quotes when quoting is appropriate. Text in other languages should consistently use one kind of quotes that is consistent with publishing customs and computer output of other programs.
Rationale: The choice of double quotes over single quotes is somewhat arbitrary, but tends to be the preferred use. Some have suggested choosing the kind of quotes depending on the type of object according to SQL conventions (namely, strings single quoted, identifiers double quoted). But this is a language-internal technical issue that many users aren't even familiar with, it won't scale to other kinds of quoted terms, it doesn't translate to other languages, and it's pretty pointless, too.
Use quotes always to delimit file names, user-supplied identifiers, and other variables that might contain words. Do not use them to mark up variables that will not contain words (for example, operator names).
There are functions in the backend that will double-quote their own output at need (for example, format_type_be()
). Do not put additional quotes around the output of such functions.
Rationale: Objects can have names that create ambiguity when embedded in a message. Be consistent about denoting where a plugged-in name starts and ends. But don't clutter messages with unnecessary or duplicate quote marks.
The rules are different for primary error messages and for detail/hint messages:
Primary error messages: Do not capitalize the first letter. Do not end a message with a period. Do not even think about ending a message with an exclamation point.
Detail and hint messages: Use complete sentences, and end each with a period. Capitalize the first word of sentences. Put two spaces after the period if another sentence follows (for English text; might be inappropriate in other languages).
Error context strings: Do not capitalize the first letter and do not end the string with a period. Context strings should normally not be complete sentences.
Rationale: Avoiding punctuation makes it easier for client applications to embed the message into a variety of grammatical contexts. Often, primary messages are not grammatically complete sentences anyway. (And if they're long enough to be more than one sentence, they should be split into primary and detail parts.) However, detail and hint messages are longer and might need to include multiple sentences. For consistency, they should follow complete-sentence style even when there's only one sentence.
Use lower case for message wording, including the first letter of a primary error message. Use upper case for SQL commands and key words if they appear in the message.
Rationale: It's easier to make everything look more consistent this way, since some messages are complete sentences and some not.
Use the active voice. Use complete sentences when there is an acting subject (“A could not do B”). Use telegram style without subject if the subject would be the program itself; do not use “I” for the program.
Rationale: The program is not human. Don't pretend otherwise.
Use past tense if an attempt to do something failed, but could perhaps succeed next time (perhaps after fixing some problem). Use present tense if the failure is certainly permanent.
There is a nontrivial semantic difference between sentences of the form:
and:
The first one means that the attempt to open the file failed. The message should give a reason, such as “disk full” or “file doesn't exist”. The past tense is appropriate because next time the disk might not be full anymore or the file in question might exist.
The second form indicates that the functionality of opening the named file does not exist at all in the program, or that it's conceptually impossible. The present tense is appropriate because the condition will persist indefinitely.
Rationale: Granted, the average user will not be able to draw great conclusions merely from the tense of the message, but since the language provides us with a grammar we should use it correctly.
When citing the name of an object, state what kind of object it is.
Rationale: Otherwise no one will know what “foo.bar.baz” refers to.
Square brackets are only to be used (1) in command synopses to denote optional arguments, or (2) to denote an array subscript.
Rationale: Anything else does not correspond to widely-known customary usage and will confuse people.
When a message includes text that is generated elsewhere, embed it in this style:
Rationale: It would be difficult to account for all possible error codes to paste this into a single smooth sentence, so some sort of punctuation is needed. Putting the embedded text in parentheses has also been suggested, but it's unnatural if the embedded text is likely to be the most important part of the message, as is often the case.
Messages should always state the reason why an error occurred. For example:
If no reason is known you better fix the code.
Don't include the name of the reporting routine in the error text. We have other mechanisms for finding that out when needed, and for most users it's not helpful information. If the error text doesn't make as much sense without the function name, reword it.
Avoid mentioning called function names, either; instead say what the code was trying to do:
If it really seems necessary, mention the system call in the detail message. (In some cases, providing the actual values passed to the system call might be appropriate information for the detail message.)
Rationale: Users don't know what all those functions do.
Unable. “Unable” is nearly the passive voice. Better use “cannot” or “could not”, as appropriate.
Bad. Error messages like “bad result” are really hard to interpret intelligently. It's better to write why the result is “bad”, e.g., “invalid format”.
Illegal. “Illegal” stands for a violation of the law, the rest is “invalid”. Better yet, say why it's invalid.
Unknown. Try to avoid “unknown”. Consider “error: unknown response”. If you don't know what the response is, how do you know it's erroneous? “Unrecognized” is often a better choice. Also, be sure to include the value being complained of.
Find vs. Exists. If the program uses a nontrivial algorithm to locate a resource (e.g., a path search) and that algorithm fails, it is fair to say that the program couldn't “find” the resource. If, on the other hand, the expected location of the resource is known but the program cannot access it there then say that the resource doesn't “exist”. Using “find” in this case sounds weak and confuses the issue.
May vs. Can vs. Might. “May” suggests permission (e.g., "You may borrow my rake."), and has little use in documentation or error messages. “Can” suggests ability (e.g., "I can lift that log."), and “might” suggests possibility (e.g., "It might rain today."). Using the proper word clarifies meaning and assists translation.
Contractions. Avoid contractions, like “can't”; use “cannot” instead.
Spell out words in full. For instance, avoid:
spec
stats
parens
auth
xact
Rationale: This will improve consistency.
Keep in mind that error message texts need to be translated into other languages. Follow the guidelines in Section 55.2.2 to avoid making life difficult for translators.
All operations on a foreign table are handled through its foreign data wrapper, which consists of a set of functions that the core server calls. The foreign data wrapper is responsible for fetching data from the remote data source and returning it to the PostgreSQL executor. If updating foreign tables is to be supported, the wrapper must handle that, too. This chapter outlines how to write a new foreign data wrapper.
The foreign data wrappers included in the standard distribution are good references when trying to write your own. Look into the contrib
subdirectory of the source tree. The CREATE FOREIGN DATA WRAPPER reference page also has some useful details.
The SQL standard specifies an interface for writing foreign data wrappers. However, PostgreSQL does not implement that API, because the effort to accommodate it into PostgreSQL would be large, and the standard API hasn't gained wide adoption anyway.
使用目前編譯語言的「version 1」介面以外的語言撰寫的函數(包括使用者定義的程序語言函數和用 SQL 撰寫的函數)在被呼叫時,都將透過特定語言的呼叫處理程序函數。呼叫處理程序有責任以有意義的方式執行功能,例如透過解譯程式原始碼。本章概述如何撰寫新程序語言的呼叫處理程序。
程序語言的呼叫處理程序是「一般」函數,必須使用「version 1」介面以編譯語言(例如 C)撰寫,並在 PostgreSQL 中註冊為不帶任何參數且回傳 language_handler 型別。這種特殊的偽型別將函數標識為呼叫處理程序,並防止在 SQL 指令中直接呼叫該函數。有關 C 語言呼叫約定和動態載入的更多詳細訊,請參閱第 37.10 節。
呼叫處理程序的呼叫方式與其他任何函數相同:它會接收到一個指向 FunctionCallInfoBaseData 結構的指標,該結構包含參數值和有關被呼叫函數的資訊,並且期待回傳 Datum 結果(也可能設定 FunctionCallInfoBaseData 結構中 isnull 的欄位,如果該函數希望回傳 SQL NULL 的結果)。呼叫處理程序和普通被呼叫函數之間的區別在於 FunctionCallInfoBaseData 結構的 flinfo->fn_oid 欄位會包含要呼叫實際函數的 OID,而不包含呼叫處理程序本身的 OID。呼叫處理程序必須使用此欄位來決定所要執行的功能。同樣地,傳遞的參數列表是根據目標函數而不是呼叫處理程序的宣告設定。
呼叫處理程序要從 pg_proc 系統目錄中取得函數的項目,並分析被呼叫函數的參數和回傳型別。該函數的 CREATE FUNCTION 命令的 AS 子句可以在 pg_proc 行的 prosrc 欄位中看到。這通常是程序語言的原始碼,但是從理論上說,它也可能是其他內容,例如檔案的路徑名稱,或告訴呼叫處理程序詳細操作的其他任何內容。
通常,每個 SQL 語句會多次呼叫同一函數。 呼叫處理程序可以透過使用 flinfo->fn_extra 欄位來避免重複查詢有關被呼叫函數的資訊。最初將為 NULL,但可以由呼叫處理程序設定為指向有關被呼叫函數的資訊。在後續的呼叫中,如果 flinfo->fn_extra 已經為非 NULL,則可以使用它,並且跳過資訊查詢步驟。呼叫處理程序必須確保使 flinfo->fn_extra 指向至少可以保存到目前查詢結束的快取,因為 FmgrInfo 資料結構可以保留那麼長時間。一種方法是在 flinfo->fn_mcxt 指定的快取內容中分配額外的資料。此類資料通常與 FmgrInfo 本身俱有相同的壽命。但是處理程序還可以選擇使用壽命更長的快取,以便它可以跨查詢快取函數定義資訊。
當將程序語言函數作為事件觸發器呼叫時,不會以通常的方式傳遞任何參數,但是 FunctionCallInfoBaseData 的 context 欄位指向 TriggerData 結構,而不是像在普通函數呼叫中那樣為 NULL。語言處理程序應提供程序語言函數的機制,以取得所觸發資訊。
This is a template for a procedural-language handler written in C:
Only a few thousand lines of code have to be added instead of the dots to complete the call handler.
After having compiled the handler function into a loadable module (see Section 37.10.5), the following commands then register the sample procedural language:
Although providing a call handler is sufficient to create a minimal procedural language, there are two other functions that can optionally be provided to make the language more convenient to use. These are a validator and an inline handler. A validator can be provided to allow language-specific checking to be done during CREATE FUNCTION. An inline handler can be provided to allow the language to support anonymous code blocks executed via the DO command.
If a validator is provided by a procedural language, it must be declared as a function taking a single parameter of type oid
. The validator's result is ignored, so it is customarily declared to return void
. The validator will be called at the end of a CREATE FUNCTION
command that has created or updated a function written in the procedural language. The passed-in OID is the OID of the function's pg_proc
row. The validator must fetch this row in the usual way, and do whatever checking is appropriate. First, call CheckFunctionValidatorAccess()
to diagnose explicit calls to the validator that the user could not achieve through CREATE FUNCTION
. Typical checks then include verifying that the function's argument and result types are supported by the language, and that the function's body is syntactically correct in the language. If the validator finds the function to be okay, it should just return. If it finds an error, it should report that via the normal ereport()
error reporting mechanism. Throwing an error will force a transaction rollback and thus prevent the incorrect function definition from being committed.
Validator functions should typically honor the check_function_bodies parameter: if it is turned off then any expensive or context-sensitive checking should be skipped. If the language provides for code execution at compilation time, the validator must suppress checks that would induce such execution. In particular, this parameter is turned off by pg_dump so that it can load procedural language functions without worrying about side effects or dependencies of the function bodies on other database objects. (Because of this requirement, the call handler should avoid assuming that the validator has fully checked the function. The point of having a validator is not to let the call handler omit checks, but to notify the user immediately if there are obvious errors in a CREATE FUNCTION
command.) While the choice of exactly what to check is mostly left to the discretion of the validator function, note that the core CREATE FUNCTION
code only executes SET
clauses attached to a function when check_function_bodies
is on. Therefore, checks whose results might be affected by GUC parameters definitely should be skipped when check_function_bodies
is off, to avoid false failures when reloading a dump.
If an inline handler is provided by a procedural language, it must be declared as a function taking a single parameter of type internal
. The inline handler's result is ignored, so it is customarily declared to return void
. The inline handler will be called when a DO
statement is executed specifying the procedural language. The parameter actually passed is a pointer to an InlineCodeBlock
struct, which contains information about the DO
statement's parameters, in particular the text of the anonymous code block to be executed. The inline handler should execute this code and return.
It's recommended that you wrap all these function declarations, as well as the CREATE LANGUAGE
command itself, into an extension so that a simple CREATE EXTENSION
command is sufficient to install the language. See Section 37.17 for information about writing extensions.
The procedural languages included in the standard distribution are good references when trying to write your own language handler. Look into the src/pl
subdirectory of the source tree. The CREATE LANGUAGE reference page also has some useful details.
The FDW author needs to implement a handler function, and optionally a validator function. Both functions must be written in a compiled language such as C, using the version-1 interface. For details on C language calling conventions and dynamic loading, see Section 37.10.
The handler function simply returns a struct of function pointers to callback functions that will be called by the planner, executor, and various maintenance commands. Most of the effort in writing an FDW is in implementing these callback functions. The handler function must be registered with PostgreSQL as taking no arguments and returning the special pseudo-type fdw_handler
. The callback functions are plain C functions and are not visible or callable at the SQL level. The callback functions are described in Section 56.2.
The validator function is responsible for validating options given in CREATE
and ALTER
commands for its foreign data wrapper, as well as foreign servers, user mappings, and foreign tables using the wrapper. The validator function must be registered as taking two arguments, a text array containing the options to be validated, and an OID representing the type of object the options are associated with (in the form of the OID of the system catalog the object would be stored in, either ForeignDataWrapperRelationId
, ForeignServerRelationId
, UserMappingRelationId
, or ForeignTableRelationId
). If no validator function is supplied, options are not checked at object creation time or object alteration time.
Written by Martin Utesch (<
utesch@aut.tu-freiberg.de
>
) for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
The FDW callback functions GetForeignRelSize
, GetForeignPaths
, GetForeignPlan
, PlanForeignModify
, GetForeignJoinPaths
, GetForeignUpperPaths
, and PlanDirectModify
must fit into the workings of the PostgreSQL planner. Here are some notes about what they must do.
The information in root
and baserel
can be used to reduce the amount of information that has to be fetched from the foreign table (and therefore reduce the cost). baserel->baserestrictinfo
is particularly interesting, as it contains restriction quals (WHERE
clauses) that should be used to filter the rows to be fetched. (The FDW itself is not required to enforce these quals, as the core executor can check them instead.) baserel->reltarget->exprs
can be used to determine which columns need to be fetched; but note that it only lists columns that have to be emitted by the ForeignScan
plan node, not columns that are used in qual evaluation but not output by the query.
Various private fields are available for the FDW planning functions to keep information in. Generally, whatever you store in FDW private fields should be palloc'd, so that it will be reclaimed at the end of planning.
baserel->fdw_private
is a void
pointer that is available for FDW planning functions to store information relevant to the particular foreign table. The core planner does not touch it except to initialize it to NULL when the RelOptInfo
node is created. It is useful for passing information forward from GetForeignRelSize
to GetForeignPaths
and/or GetForeignPaths
to GetForeignPlan
, thereby avoiding recalculation.
GetForeignPaths
can identify the meaning of different access paths by storing private information in the fdw_private
field of ForeignPath
nodes. fdw_private
is declared as a List
pointer, but could actually contain anything since the core planner does not touch it. However, best practice is to use a representation that's dumpable by nodeToString
, for use with debugging support available in the backend.
GetForeignPlan
can examine the fdw_private
field of the selected ForeignPath
node, and can generate fdw_exprs
and fdw_private
lists to be placed in the ForeignScan
plan node, where they will be available at execution time. Both of these lists must be represented in a form that copyObject
knows how to copy. The fdw_private
list has no other restrictions and is not interpreted by the core backend in any way. The fdw_exprs
list, if not NIL, is expected to contain expression trees that are intended to be executed at run time. These trees will undergo post-processing by the planner to make them fully executable.
In GetForeignPlan
, generally the passed-in target list can be copied into the plan node as-is. The passed scan_clauses
list contains the same clauses as baserel->baserestrictinfo
, but may be re-ordered for better execution efficiency. In simple cases the FDW can just strip RestrictInfo
nodes from the scan_clauses
list (using extract_actual_clauses
) and put all the clauses into the plan node's qual list, which means that all the clauses will be checked by the executor at run time. More complex FDWs may be able to check some of the clauses internally, in which case those clauses can be removed from the plan node's qual list so that the executor doesn't waste time rechecking them.
As an example, the FDW might identify some restriction clauses of the form foreign_variable
=
sub_expression
, which it determines can be executed on the remote server given the locally-evaluated value of the sub_expression
. The actual identification of such a clause should happen during GetForeignPaths
, since it would affect the cost estimate for the path. The path's fdw_private
field would probably include a pointer to the identified clause's RestrictInfo
node. Then GetForeignPlan
would remove that clause from scan_clauses
, but add the sub_expression
to fdw_exprs
to ensure that it gets massaged into executable form. It would probably also put control information into the plan node's fdw_private
field to tell the execution functions what to do at run time. The query transmitted to the remote server would involve something like WHERE
foreign_variable
= $1, with the parameter value obtained at run time from evaluation of the fdw_exprs
expression tree.
Any clauses removed from the plan node's qual list must instead be added to fdw_recheck_quals
or rechecked by RecheckForeignScan
in order to ensure correct behavior at the READ COMMITTED
isolation level. When a concurrent update occurs for some other table involved in the query, the executor may need to verify that all of the original quals are still satisfied for the tuple, possibly against a different set of parameter values. Using fdw_recheck_quals
is typically easier than implementing checks inside RecheckForeignScan
, but this method will be insufficient when outer joins have been pushed down, since the join tuples in that case might have some fields go to NULL without rejecting the tuple entirely.
Another ForeignScan
field that can be filled by FDWs is fdw_scan_tlist
, which describes the tuples returned by the FDW for this plan node. For simple foreign table scans this can be set to NIL
, implying that the returned tuples have the row type declared for the foreign table. A non-NIL
value must be a target list (list of TargetEntry
s) containing Vars and/or expressions representing the returned columns. This might be used, for example, to show that the FDW has omitted some columns that it noticed won't be needed for the query. Also, if the FDW can compute expressions used by the query more cheaply than can be done locally, it could add those expressions to fdw_scan_tlist
. Note that join plans (created from paths made by GetForeignJoinPaths
) must always supply fdw_scan_tlist
to describe the set of columns they will return.
The FDW should always construct at least one path that depends only on the table's restriction clauses. In join queries, it might also choose to construct path(s) that depend on join clauses, for example foreign_variable
=
local_variable
. Such clauses will not be found in baserel->baserestrictinfo
but must be sought in the relation's join lists. A path using such a clause is called a “parameterized path”. It must identify the other relations used in the selected join clause(s) with a suitable value of param_info
; use get_baserel_parampathinfo
to compute that value. In GetForeignPlan
, the local_variable
portion of the join clause would be added to fdw_exprs
, and then at run time the case works the same as for an ordinary restriction clause.
If an FDW supports remote joins, GetForeignJoinPaths
should produce ForeignPath
s for potential remote joins in much the same way as GetForeignPaths
works for base tables. Information about the intended join can be passed forward to GetForeignPlan
in the same ways described above. However, baserestrictinfo
is not relevant for join relations; instead, the relevant join clauses for a particular join are passed to GetForeignJoinPaths
as a separate parameter (extra->restrictlist
).
An FDW might additionally support direct execution of some plan actions that are above the level of scans and joins, such as grouping or aggregation. To offer such options, the FDW should generate paths and insert them into the appropriate upper relation. For example, a path representing remote aggregation should be inserted into the UPPERREL_GROUP_AGG
relation, using add_path
. This path will be compared on a cost basis with local aggregation performed by reading a simple scan path for the foreign relation (note that such a path must also be supplied, else there will be an error at plan time). If the remote-aggregation path wins, which it usually would, it will be converted into a plan in the usual way, by calling GetForeignPlan
. The recommended place to generate such paths is in the GetForeignUpperPaths
callback function, which is called for each upper relation (i.e., each post-scan/join processing step), if all the base relations of the query come from the same FDW.
PlanForeignModify
and the other callbacks described in are designed around the assumption that the foreign relation will be scanned in the usual way and then individual row updates will be driven by a local ModifyTable
plan node. This approach is necessary for the general case where an update requires reading local tables as well as foreign tables. However, if the operation could be executed entirely by the foreign server, the FDW could generate a path representing that and insert it into the UPPERREL_FINAL
upper relation, where it would compete against the ModifyTable
approach. This approach could also be used to implement remote SELECT FOR UPDATE
, rather than using the row locking callbacks described in . Keep in mind that a path inserted into UPPERREL_FINAL
is responsible for implementing all behavior of the query.
When planning an UPDATE
or DELETE
, PlanForeignModify
and PlanDirectModify
can look up the RelOptInfo
struct for the foreign table and make use of the baserel->fdw_private
data previously created by the scan-planning functions. However, in INSERT
the target table is not scanned so there is no RelOptInfo
for it. The List
returned by PlanForeignModify
has the same restrictions as the fdw_private
list of a ForeignScan
plan node, that is it must contain only structures that copyObject
knows how to copy.
INSERT
with an ON CONFLICT
clause does not support specifying the conflict target, as unique constraints or exclusion constraints on remote tables are not locally known. This in turn implies that ON CONFLICT DO UPDATE
is not supported, since the specification is mandatory there.
If an FDW's underlying storage mechanism has a concept of locking individual rows to prevent concurrent updates of those rows, it is usually worthwhile for the FDW to perform row-level locking with as close an approximation as practical to the semantics used in ordinary PostgreSQL tables. There are multiple considerations involved in this.
One key decision to be made is whether to perform early locking or late locking. In early locking, a row is locked when it is first retrieved from the underlying store, while in late locking, the row is locked only when it is known that it needs to be locked. (The difference arises because some rows may be discarded by locally-checked restriction or join conditions.) Early locking is much simpler and avoids extra round trips to a remote store, but it can cause locking of rows that need not have been locked, resulting in reduced concurrency or even unexpected deadlocks. Also, late locking is only possible if the row to be locked can be uniquely re-identified later. Preferably the row identifier should identify a specific version of the row, as PostgreSQL TIDs do.
By default, PostgreSQL ignores locking considerations when interfacing to FDWs, but an FDW can perform early locking without any explicit support from the core code. The API functions described in Section 56.2.5, which were added in PostgreSQL 9.5, allow an FDW to use late locking if it wishes.
An additional consideration is that in READ COMMITTED
isolation mode, PostgreSQL may need to re-check restriction and join conditions against an updated version of some target tuple. Rechecking join conditions requires re-obtaining copies of the non-target rows that were previously joined to the target tuple. When working with standard PostgreSQL tables, this is done by including the TIDs of the non-target tables in the column list projected through the join, and then re-fetching non-target rows when required. This approach keeps the join data set compact, but it requires inexpensive re-fetch capability, as well as a TID that can uniquely identify the row version to be re-fetched. By default, therefore, the approach used with foreign tables is to include a copy of the entire row fetched from a foreign table in the column list projected through the join. This puts no special demands on the FDW but can result in reduced performance of merge and hash joins. An FDW that is capable of meeting the re-fetch requirements can choose to do it the first way.
For an UPDATE
or DELETE
on a foreign table, it is recommended that the ForeignScan
operation on the target table perform early locking on the rows that it fetches, perhaps via the equivalent of SELECT FOR UPDATE
. An FDW can detect whether a table is an UPDATE
/DELETE
target at plan time by comparing its relid to root->parse->resultRelation
, or at execution time by using ExecRelationIsTargetRelation()
. An alternative possibility is to perform late locking within the ExecForeignUpdate
or ExecForeignDelete
callback, but no special support is provided for this.
For foreign tables that are specified to be locked by a SELECT FOR UPDATE/SHARE
command, the ForeignScan
operation can again perform early locking by fetching tuples with the equivalent of SELECT FOR UPDATE/SHARE
. To perform late locking instead, provide the callback functions defined in Section 56.2.5. In GetForeignRowMarkType
, select rowmark option ROW_MARK_EXCLUSIVE
, ROW_MARK_NOKEYEXCLUSIVE
, ROW_MARK_SHARE
, or ROW_MARK_KEYSHARE
depending on the requested lock strength. (The core code will act the same regardless of which of these four options you choose.) Elsewhere, you can detect whether a foreign table was specified to be locked by this type of command by using get_plan_rowmark
at plan time, or ExecFindRowMark
at execution time; you must check not only whether a non-null rowmark struct is returned, but that its strength
field is not LCS_NONE
.
Lastly, for foreign tables that are used in an UPDATE
, DELETE
or SELECT FOR UPDATE/SHARE
command but are not specified to be row-locked, you can override the default choice to copy entire rows by having GetForeignRowMarkType
select option ROW_MARK_REFERENCE
when it sees lock strength LCS_NONE
. This will cause RefetchForeignRow
to be called with that value for markType
; it should then re-fetch the row without acquiring any new lock. (If you have a GetForeignRowMarkType
function but don't wish to re-fetch unlocked rows, select option ROW_MARK_COPY
for LCS_NONE
.)
See src/include/nodes/lockoptions.h
, the comments for RowMarkType
and PlanRowMark
in src/include/nodes/plannodes.h
, and the comments for ExecRowMark
in src/include/nodes/execnodes.h
for additional information.
Several helper functions are exported from the core server so that authors of foreign data wrappers can get easy access to attributes of FDW-related objects, such as FDW options. To use any of these functions, you need to include the header file foreign/foreign.h
in your source file. That header also defines the struct types that are returned by these functions.
This function returns a ForeignDataWrapper
object for the foreign-data wrapper with the given OID. A ForeignDataWrapper
object contains properties of the FDW (see foreign/foreign.h
for details). flags
is a bitwise-or'd bit mask indicating an extra set of options. It can take the value FDW_MISSING_OK
, in which case a NULL
result is returned to the caller instead of an error for an undefined object.
This function returns a ForeignDataWrapper
object for the foreign-data wrapper with the given OID. A ForeignDataWrapper
object contains properties of the FDW (see foreign/foreign.h
for details).
This function returns a ForeignServer
object for the foreign server with the given OID. A ForeignServer
object contains properties of the server (see foreign/foreign.h
for details). flags
is a bitwise-or'd bit mask indicating an extra set of options. It can take the value FSV_MISSING_OK
, in which case a NULL
result is returned to the caller instead of an error for an undefined object.
This function returns a ForeignServer
object for the foreign server with the given OID. A ForeignServer
object contains properties of the server (see foreign/foreign.h
for details).
This function returns a UserMapping
object for the user mapping of the given role on the given server. (If there is no mapping for the specific user, it will return the mapping for PUBLIC
, or throw error if there is none.) A UserMapping
object contains properties of the user mapping (see foreign/foreign.h
for details).
This function returns a ForeignTable
object for the foreign table with the given OID. A ForeignTable
object contains properties of the foreign table (see foreign/foreign.h
for details).
This function returns the per-column FDW options for the column with the given foreign table OID and attribute number, in the form of a list of DefElem
. NIL is returned if the column has no options.
Some object types have name-based lookup functions in addition to the OID-based ones:
This function returns a ForeignDataWrapper
object for the foreign-data wrapper with the given name. If the wrapper is not found, return NULL if missing_ok is true, otherwise raise an error.
This function returns a ForeignServer
object for the foreign server with the given name. If the server is not found, return NULL if missing_ok is true, otherwise raise an error.
The FDW handler function returns a palloc'd FdwRoutine
struct containing pointers to the callback functions described below. The scan-related functions are required, the rest are optional.
The FdwRoutine
struct type is declared in src/include/foreign/fdwapi.h
, which see for additional details.
Obtain relation size estimates for a foreign table. This is called at the beginning of planning for a query that scans a foreign table. root
is the planner's global information about the query; baserel
is the planner's information about this table; and foreigntableid
is the pg_class
OID of the foreign table. (foreigntableid
could be obtained from the planner data structures, but it's passed explicitly to save effort.)
This function should update baserel->rows
to be the expected number of rows returned by the table scan, after accounting for the filtering done by the restriction quals. The initial value of baserel->rows
is just a constant default estimate, which should be replaced if at all possible. The function may also choose to update baserel->width
if it can compute a better estimate of the average result row width.
See Section 56.4 for additional information.
Create possible access paths for a scan on a foreign table. This is called during query planning. The parameters are the same as for GetForeignRelSize
, which has already been called.
This function must generate at least one access path (ForeignPath
node) for a scan on the foreign table and must call add_path
to add each such path to baserel->pathlist
. It's recommended to use create_foreignscan_path
to build the ForeignPath
nodes. The function can generate multiple access paths, e.g., a path which has valid pathkeys
to represent a pre-sorted result. Each access path must contain cost estimates, and can contain any FDW-private information that is needed to identify the specific scan method intended.
See Section 56.4 for additional information.
Create a ForeignScan
plan node from the selected foreign access path. This is called at the end of query planning. The parameters are as for GetForeignRelSize
, plus the selected ForeignPath
(previously produced by GetForeignPaths
, GetForeignJoinPaths
, or GetForeignUpperPaths
), the target list to be emitted by the plan node, the restriction clauses to be enforced by the plan node, and the outer subplan of the ForeignScan
, which is used for rechecks performed by RecheckForeignScan
. (If the path is for a join rather than a base relation, foreigntableid
is InvalidOid
.)
This function must create and return a ForeignScan
plan node; it's recommended to use make_foreignscan
to build the ForeignScan
node.
See Section 56.4 for additional information.
Begin executing a foreign scan. This is called during executor startup. It should perform any initialization needed before the scan can start, but not start executing the actual scan (that should be done upon the first call to IterateForeignScan
). The ForeignScanState
node has already been created, but its fdw_state
field is still NULL. Information about the table to scan is accessible through the ForeignScanState
node (in particular, from the underlying ForeignScan
plan node, which contains any FDW-private information provided by GetForeignPlan
). eflags
contains flag bits describing the executor's operating mode for this plan node.
Note that when (eflags & EXEC_FLAG_EXPLAIN_ONLY)
is true, this function should not perform any externally-visible actions; it should only do the minimum required to make the node state valid for ExplainForeignScan
and EndForeignScan
.
Fetch one row from the foreign source, returning it in a tuple table slot (the node's ScanTupleSlot
should be used for this purpose). Return NULL if no more rows are available. The tuple table slot infrastructure allows either a physical or virtual tuple to be returned; in most cases the latter choice is preferable from a performance standpoint. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginForeignScan
if you need longer-lived storage, or use the es_query_cxt
of the node's EState
.
The rows returned must match the fdw_scan_tlist
target list if one was supplied, otherwise they must match the row type of the foreign table being scanned. If you choose to optimize away fetching columns that are not needed, you should insert nulls in those column positions, or else generate a fdw_scan_tlist
list with those columns omitted.
Note that PostgreSQL's executor doesn't care whether the rows returned violate any constraints that were defined on the foreign table — but the planner does care, and may optimize queries incorrectly if there are rows visible in the foreign table that do not satisfy a declared constraint. If a constraint is violated when the user has declared that the constraint should hold true, it may be appropriate to raise an error (just as you would need to do in the case of a data type mismatch).
Restart the scan from the beginning. Note that any parameters the scan depends on may have changed value, so the new scan does not necessarily return exactly the same rows.
End the scan and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.
If an FDW supports performing foreign joins remotely (rather than by fetching both tables' data and doing the join locally), it should provide this callback function:
Create possible access paths for a join of two (or more) foreign tables that all belong to the same foreign server. This optional function is called during query planning. As with GetForeignPaths
, this function should generate ForeignPath
path(s) for the supplied joinrel
(use create_foreign_join_path
to build them), and call add_path
to add these paths to the set of paths considered for the join. But unlike GetForeignPaths
, it is not necessary that this function succeed in creating at least one path, since paths involving local joining are always possible.
Note that this function will be invoked repeatedly for the same join relation, with different combinations of inner and outer relations; it is the responsibility of the FDW to minimize duplicated work.
If a ForeignPath
path is chosen for the join, it will represent the entire join process; paths generated for the component tables and subsidiary joins will not be used. Subsequent processing of the join path proceeds much as it does for a path scanning a single foreign table. One difference is that the scanrelid
of the resulting ForeignScan
plan node should be set to zero, since there is no single relation that it represents; instead, the fs_relids
field of the ForeignScan
node represents the set of relations that were joined. (The latter field is set up automatically by the core planner code, and need not be filled by the FDW.) Another difference is that, because the column list for a remote join cannot be found from the system catalogs, the FDW must fill fdw_scan_tlist
with an appropriate list of TargetEntry
nodes, representing the set of columns it will supply at run time in the tuples it returns.
See Section 56.4 for additional information.
If an FDW supports performing remote post-scan/join processing, such as remote aggregation, it should provide this callback function:
Create possible access paths for upper relation processing, which is the planner's term for all post-scan/join query processing, such as aggregation, window functions, sorting, and table updates. This optional function is called during query planning. Currently, it is called only if all base relation(s) involved in the query belong to the same FDW. This function should generate ForeignPath
path(s) for any post-scan/join processing that the FDW knows how to perform remotely (use create_foreign_upper_path
to build them), and call add_path
to add these paths to the indicated upper relation. As with GetForeignJoinPaths
, it is not necessary that this function succeed in creating any paths, since paths involving local processing are always possible.
The stage
parameter identifies which post-scan/join step is currently being considered. output_rel
is the upper relation that should receive paths representing computation of this step, and input_rel
is the relation representing the input to this step. The extra
parameter provides additional details, currently, it is set only for UPPERREL_PARTIAL_GROUP_AGG
or UPPERREL_GROUP_AGG
, in which case it points to a GroupPathExtraData
structure; or for UPPERREL_FINAL
, in which case it points to a FinalPathExtraData
structure. (Note that ForeignPath
paths added to output_rel
would typically not have any direct dependency on paths of the input_rel
, since their processing is expected to be done externally. However, examining paths previously generated for the previous processing step can be useful to avoid redundant planning work.)
See Section 56.4 for additional information.
If an FDW supports writable foreign tables, it should provide some or all of the following callback functions depending on the needs and capabilities of the FDW:
UPDATE
and DELETE
operations are performed against rows previously fetched by the table-scanning functions. The FDW may need extra information, such as a row ID or the values of primary-key columns, to ensure that it can identify the exact row to update or delete. To support that, this function can add extra hidden, or “junk”, target columns to the list of columns that are to be retrieved from the foreign table during an UPDATE
or DELETE
.
To do that, add TargetEntry
items to parsetree->targetList
, containing expressions for the extra values to be fetched. Each such entry must be marked resjunk
= true
, and must have a distinct resname
that will identify it at execution time. Avoid using names matching ctid
N
, wholerow
, or wholerow
N
, as the core system can generate junk columns of these names. If the extra expressions are more complex than simple Vars, they must be run through eval_const_expressions
before adding them to the targetlist.
Although this function is called during planning, the information provided is a bit different from that available to other planning routines. parsetree
is the parse tree for the UPDATE
or DELETE
command, while target_rte
and target_relation
describe the target foreign table.
If the AddForeignUpdateTargets
pointer is set to NULL
, no extra target expressions are added. (This will make it impossible to implement DELETE
operations, though UPDATE
may still be feasible if the FDW relies on an unchanging primary key to identify rows.)
Perform any additional planning actions needed for an insert, update, or delete on a foreign table. This function generates the FDW-private information that will be attached to the ModifyTable
plan node that performs the update action. This private information must have the form of a List
, and will be delivered to BeginForeignModify
during the execution stage.
root
is the planner's global information about the query. plan
is the ModifyTable
plan node, which is complete except for the fdwPrivLists
field. resultRelation
identifies the target foreign table by its range table index. subplan_index
identifies which target of the ModifyTable
plan node this is, counting from zero; use this if you want to index into plan->plans
or other substructure of the plan
node.
See Section 56.4 for additional information.
If the PlanForeignModify
pointer is set to NULL
, no additional plan-time actions are taken, and the fdw_private
list delivered to BeginForeignModify
will be NIL.
Begin executing a foreign table modification operation. This routine is called during executor startup. It should perform any initialization needed prior to the actual table modifications. Subsequently, ExecForeignInsert
, ExecForeignUpdate
or ExecForeignDelete
will be called for each tuple to be inserted, updated, or deleted.
mtstate
is the overall state of the ModifyTable
plan node being executed; global data about the plan and execution state is available via this structure. rinfo
is the ResultRelInfo
struct describing the target foreign table. (The ri_FdwState
field of ResultRelInfo
is available for the FDW to store any private state it needs for this operation.) fdw_private
contains the private data generated by PlanForeignModify
, if any. subplan_index
identifies which target of the ModifyTable
plan node this is. eflags
contains flag bits describing the executor's operating mode for this plan node.
Note that when (eflags & EXEC_FLAG_EXPLAIN_ONLY)
is true, this function should not perform any externally-visible actions; it should only do the minimum required to make the node state valid for ExplainForeignModify
and EndForeignModify
.
If the BeginForeignModify
pointer is set to NULL
, no action is taken during executor startup.
Insert one tuple into the foreign table. estate
is global execution state for the query. rinfo
is the ResultRelInfo
struct describing the target foreign table. slot
contains the tuple to be inserted; it will match the row-type definition of the foreign table. planSlot
contains the tuple that was generated by the ModifyTable
plan node's subplan; it differs from slot
in possibly containing additional “junk” columns. (The planSlot
is typically of little interest for INSERT
cases, but is provided for completeness.)
The return value is either a slot containing the data that was actually inserted (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually inserted (again, typically as a result of triggers). The passed-in slot
can be re-used for this purpose.
The data in the returned slot is used only if the INSERT
statement has a RETURNING
clause or involves a view WITH CHECK OPTION
; or if the foreign table has an AFTER ROW
trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING
clause or WITH CHECK OPTION
constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.
If the ExecForeignInsert
pointer is set to NULL
, attempts to insert into the foreign table will fail with an error message.
Note that this function is also called when inserting routed tuples into a foreign-table partition or executing COPY FROM
on a foreign table, in which case it is called in a different way than it is in the INSERT
case. See the callback functions described below that allow the FDW to support that.
Update one tuple in the foreign table. estate
is global execution state for the query. rinfo
is the ResultRelInfo
struct describing the target foreign table. slot
contains the new data for the tuple; it will match the row-type definition of the foreign table. planSlot
contains the tuple that was generated by the ModifyTable
plan node's subplan; it differs from slot
in possibly containing additional “junk” columns. In particular, any junk columns that were requested by AddForeignUpdateTargets
will be available from this slot.
The return value is either a slot containing the row as it was actually updated (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually updated (again, typically as a result of triggers). The passed-in slot
can be re-used for this purpose.
The data in the returned slot is used only if the UPDATE
statement has a RETURNING
clause or involves a view WITH CHECK OPTION
; or if the foreign table has an AFTER ROW
trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING
clause or WITH CHECK OPTION
constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.
If the ExecForeignUpdate
pointer is set to NULL
, attempts to update the foreign table will fail with an error message.
Delete one tuple from the foreign table. estate
is global execution state for the query. rinfo
is the ResultRelInfo
struct describing the target foreign table. slot
contains nothing useful upon call, but can be used to hold the returned tuple. planSlot
contains the tuple that was generated by the ModifyTable
plan node's subplan; in particular, it will carry any junk columns that were requested by AddForeignUpdateTargets
. The junk column(s) must be used to identify the tuple to be deleted.
The return value is either a slot containing the row that was deleted, or NULL if no row was deleted (typically as a result of triggers). The passed-in slot
can be used to hold the tuple to be returned.
The data in the returned slot is used only if the DELETE
query has a RETURNING
clause or the foreign table has an AFTER ROW
trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING
clause. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.
If the ExecForeignDelete
pointer is set to NULL
, attempts to delete from the foreign table will fail with an error message.
End the table update and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.
If the EndForeignModify
pointer is set to NULL
, no action is taken during executor shutdown.
Tuples inserted into a partitioned table by INSERT
or COPY FROM
are routed to partitions. If an FDW supports routable foreign-table partitions, it should also provide the following callback functions. These functions are also called when COPY FROM
is executed on a foreign table.
Begin executing an insert operation on a foreign table. This routine is called right before the first tuple is inserted into the foreign table in both cases when it is the partition chosen for tuple routing and the target specified in a COPY FROM
command. It should perform any initialization needed prior to the actual insertion. Subsequently, ExecForeignInsert
will be called for each tuple to be inserted into the foreign table.
mtstate
is the overall state of the ModifyTable
plan node being executed; global data about the plan and execution state is available via this structure. rinfo
is the ResultRelInfo
struct describing the target foreign table. (The ri_FdwState
field of ResultRelInfo
is available for the FDW to store any private state it needs for this operation.)
When this is called by a COPY FROM
command, the plan-related global data in mtstate
is not provided and the planSlot
parameter of ExecForeignInsert
subsequently called for each inserted tuple is NULL
, whether the foreign table is the partition chosen for tuple routing or the target specified in the command.
If the BeginForeignInsert
pointer is set to NULL
, no action is taken for the initialization.
Note that if the FDW does not support routable foreign-table partitions and/or executing COPY FROM
on foreign tables, this function or ExecForeignInsert
subsequently called must throw error as needed.
End the insert operation and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.
If the EndForeignInsert
pointer is set to NULL
, no action is taken for the termination.
Report which update operations the specified foreign table supports. The return value should be a bit mask of rule event numbers indicating which operations are supported by the foreign table, using the CmdType
enumeration; that is, (1 << CMD_UPDATE) = 4
for UPDATE
, (1 << CMD_INSERT) = 8
for INSERT
, and (1 << CMD_DELETE) = 16
for DELETE
.
If the IsForeignRelUpdatable
pointer is set to NULL
, foreign tables are assumed to be insertable, updatable, or deletable if the FDW provides ExecForeignInsert
, ExecForeignUpdate
, or ExecForeignDelete
respectively. This function is only needed if the FDW supports some tables that are updatable and some that are not. (Even then, it's permissible to throw an error in the execution routine instead of checking in this function. However, this function is used to determine updatability for display in the information_schema
views.)
Some inserts, updates, and deletes to foreign tables can be optimized by implementing an alternative set of interfaces. The ordinary interfaces for inserts, updates, and deletes fetch rows from the remote server and then modify those rows one at a time. In some cases, this row-by-row approach is necessary, but it can be inefficient. If it is possible for the foreign server to determine which rows should be modified without actually retrieving them, and if there are no local structures which would affect the operation (row-level local triggers, stored generated columns, or WITH CHECK OPTION
constraints from parent views), then it is possible to arrange things so that the entire operation is performed on the remote server. The interfaces described below make this possible.
Decide whether it is safe to execute a direct modification on the remote server. If so, return true
after performing planning actions needed for that. Otherwise, return false
. This optional function is called during query planning. If this function succeeds, BeginDirectModify
, IterateDirectModify
and EndDirectModify
will be called at the execution stage, instead. Otherwise, the table modification will be executed using the table-updating functions described above. The parameters are the same as for PlanForeignModify
.
To execute the direct modification on the remote server, this function must rewrite the target subplan with a ForeignScan
plan node that executes the direct modification on the remote server. The operation
field of the ForeignScan
must be set to the CmdType
enumeration appropriately; that is, CMD_UPDATE
for UPDATE
, CMD_INSERT
for INSERT
, and CMD_DELETE
for DELETE
.
See Section 56.4 for additional information.
If the PlanDirectModify
pointer is set to NULL
, no attempts to execute a direct modification on the remote server are taken.
Prepare to execute a direct modification on the remote server. This is called during executor startup. It should perform any initialization needed prior to the direct modification (that should be done upon the first call to IterateDirectModify
). The ForeignScanState
node has already been created, but its fdw_state
field is still NULL. Information about the table to modify is accessible through the ForeignScanState
node (in particular, from the underlying ForeignScan
plan node, which contains any FDW-private information provided by PlanDirectModify
). eflags
contains flag bits describing the executor's operating mode for this plan node.
Note that when (eflags & EXEC_FLAG_EXPLAIN_ONLY)
is true, this function should not perform any externally-visible actions; it should only do the minimum required to make the node state valid for ExplainDirectModify
and EndDirectModify
.
If the BeginDirectModify
pointer is set to NULL
, no attempts to execute a direct modification on the remote server are taken.
When the INSERT
, UPDATE
or DELETE
query doesn't have a RETURNING
clause, just return NULL after a direct modification on the remote server. When the query has the clause, fetch one result containing the data needed for the RETURNING
calculation, returning it in a tuple table slot (the node's ScanTupleSlot
should be used for this purpose). The data that was actually inserted, updated or deleted must be stored in the es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple
of the node's EState
. Return NULL if no more rows are available. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginDirectModify
if you need longer-lived storage, or use the es_query_cxt
of the node's EState
.
The rows returned must match the fdw_scan_tlist
target list if one was supplied, otherwise they must match the row type of the foreign table being updated. If you choose to optimize away fetching columns that are not needed for the RETURNING
calculation, you should insert nulls in those column positions, or else generate a fdw_scan_tlist
list with those columns omitted.
Whether the query has the clause or not, the query's reported row count must be incremented by the FDW itself. When the query doesn't have the clause, the FDW must also increment the row count for the ForeignScanState
node in the EXPLAIN ANALYZE
case.
If the IterateDirectModify
pointer is set to NULL
, no attempts to execute a direct modification on the remote server are taken.
Clean up following a direct modification on the remote server. It is normally not important to release palloc'd memory, but for example open files and connections to the remote server should be cleaned up.
If the EndDirectModify
pointer is set to NULL
, no attempts to execute a direct modification on the remote server are taken.
If an FDW wishes to support late row locking (as described in Section 56.5), it must provide the following callback functions:
Report which row-marking option to use for a foreign table. rte
is the RangeTblEntry
node for the table and strength
describes the lock strength requested by the relevant FOR UPDATE/SHARE
clause, if any. The result must be a member of the RowMarkType
enum type.
This function is called during query planning for each foreign table that appears in an UPDATE
, DELETE
, or SELECT FOR UPDATE/SHARE
query and is not the target of UPDATE
or DELETE
.
If the GetForeignRowMarkType
pointer is set to NULL
, the ROW_MARK_COPY
option is always used. (This implies that RefetchForeignRow
will never be called, so it need not be provided either.)
See Section 56.5 for more information.
Re-fetch one tuple slot from the foreign table, after locking it if required. estate
is global execution state for the query. erm
is the ExecRowMark
struct describing the target foreign table and the row lock type (if any) to acquire. rowid
identifies the tuple to be fetched. slot
contains nothing useful upon call, but can be used to hold the returned tuple. updated
is an output parameter.
This function should store the tuple into the provided slot, or clear it if the row lock couldn't be obtained. The row lock type to acquire is defined by erm->markType
, which is the value previously returned by GetForeignRowMarkType
. (ROW_MARK_REFERENCE
means to just re-fetch the tuple without acquiring any lock, and ROW_MARK_COPY
will never be seen by this routine.)
In addition, *updated
should be set to true
if what was fetched was an updated version of the tuple rather than the same version previously obtained. (If the FDW cannot be sure about this, always returning true
is recommended.)
Note that by default, failure to acquire a row lock should result in raising an error; returning with an empty slot is only appropriate if the SKIP LOCKED
option is specified by erm->waitPolicy
.
The rowid
is the ctid
value previously read for the row to be re-fetched. Although the rowid
value is passed as a Datum
, it can currently only be a tid
. The function API is chosen in hopes that it may be possible to allow other data types for row IDs in future.
If the RefetchForeignRow
pointer is set to NULL
, attempts to re-fetch rows will fail with an error message.
See Section 56.5 for more information.
Recheck that a previously-returned tuple still matches the relevant scan and join qualifiers, and possibly provide a modified version of the tuple. For foreign data wrappers which do not perform join pushdown, it will typically be more convenient to set this to NULL
and instead set fdw_recheck_quals
appropriately. When outer joins are pushed down, however, it isn't sufficient to reapply the checks relevant to all the base tables to the result tuple, even if all needed attributes are present, because failure to match some qualifier might result in some attributes going to NULL, rather than in no tuple being returned. RecheckForeignScan
can recheck qualifiers and return true if they are still satisfied and false otherwise, but it can also store a replacement tuple into the supplied slot.
To implement join pushdown, a foreign data wrapper will typically construct an alternative local join plan which is used only for rechecks; this will become the outer subplan of the ForeignScan
. When a recheck is required, this subplan can be executed and the resulting tuple can be stored in the slot. This plan need not be efficient since no base table will return more than one row; for example, it may implement all joins as nested loops. The function GetExistingLocalJoinPath
may be used to search existing paths for a suitable local join path, which can be used as the alternative local join plan. GetExistingLocalJoinPath
searches for an unparameterized path in the path list of the specified join relation. (If it does not find such a path, it returns NULL, in which case a foreign data wrapper may build the local path by itself or may choose not to create access paths for that join.)
EXPLAIN
Print additional EXPLAIN
output for a foreign table scan. This function can call ExplainPropertyText
and related functions to add fields to the EXPLAIN
output. The flag fields in es
can be used to determine what to print, and the state of the ForeignScanState
node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE
case.
If the ExplainForeignScan
pointer is set to NULL
, no additional information is printed during EXPLAIN
.
Print additional EXPLAIN
output for a foreign table update. This function can call ExplainPropertyText
and related functions to add fields to the EXPLAIN
output. The flag fields in es
can be used to determine what to print, and the state of the ModifyTableState
node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE
case. The first four arguments are the same as for BeginForeignModify
.
If the ExplainForeignModify
pointer is set to NULL
, no additional information is printed during EXPLAIN
.
Print additional EXPLAIN
output for a direct modification on the remote server. This function can call ExplainPropertyText
and related functions to add fields to the EXPLAIN
output. The flag fields in es
can be used to determine what to print, and the state of the ForeignScanState
node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE
case.
If the ExplainDirectModify
pointer is set to NULL
, no additional information is printed during EXPLAIN
.
ANALYZE
This function is called when ANALYZE is executed on a foreign table. If the FDW can collect statistics for this foreign table, it should return true
, and provide a pointer to a function that will collect sample rows from the table in func
, plus the estimated size of the table in pages in totalpages
. Otherwise, return false
.
If the FDW does not support collecting statistics for any tables, the AnalyzeForeignTable
pointer can be set to NULL
.
If provided, the sample collection function must have the signature
A random sample of up to targrows
rows should be collected from the table and stored into the caller-provided rows
array. The actual number of rows collected must be returned. In addition, store estimates of the total numbers of live and dead rows in the table into the output parameters totalrows
and totaldeadrows
. (Set totaldeadrows
to zero if the FDW does not have any concept of dead rows.)
IMPORT FOREIGN SCHEMA
Obtain a list of foreign table creation commands. This function is called when executing IMPORT FOREIGN SCHEMA, and is passed the parse tree for that statement, as well as the OID of the foreign server to use. It should return a list of C strings, each of which must contain a CREATE FOREIGN TABLE command. These strings will be parsed and executed by the core server.
Within the ImportForeignSchemaStmt
struct, remote_schema
is the name of the remote schema from which tables are to be imported. list_type
identifies how to filter table names: FDW_IMPORT_SCHEMA_ALL
means that all tables in the remote schema should be imported (in this case table_list
is empty), FDW_IMPORT_SCHEMA_LIMIT_TO
means to include only tables listed in table_list
, and FDW_IMPORT_SCHEMA_EXCEPT
means to exclude the tables listed in table_list
. options
is a list of options used for the import process. The meanings of the options are up to the FDW. For example, an FDW could use an option to define whether the NOT NULL
attributes of columns should be imported. These options need not have anything to do with those supported by the FDW as database object options.
The FDW may ignore the local_schema
field of the ImportForeignSchemaStmt
, because the core server will automatically insert that name into the parsed CREATE FOREIGN TABLE
commands.
The FDW does not have to concern itself with implementing the filtering specified by list_type
and table_list
, either, as the core server will automatically skip any returned commands for tables excluded according to those options. However, it's often useful to avoid the work of creating commands for excluded tables in the first place. The function IsImportableForeignTable()
may be useful to test whether a given foreign-table name will pass the filter.
If the FDW does not support importing table definitions, the ImportForeignSchema
pointer can be set to NULL
.
A ForeignScan
node can, optionally, support parallel execution. A parallel ForeignScan
will be executed in multiple processes and must return each row exactly once across all cooperating processes. To do this, processes can coordinate through fixed-size chunks of dynamic shared memory. This shared memory is not guaranteed to be mapped at the same address in every process, so it must not contain pointers. The following functions are all optional, but most are required if parallel execution is to be supported.
Test whether a scan can be performed within a parallel worker. This function will only be called when the planner believes that a parallel plan might be possible, and should return true if it is safe for that scan to run within a parallel worker. This will generally not be the case if the remote data source has transaction semantics, unless the worker's connection to the data can somehow be made to share the same transaction context as the leader.
If this function is not defined, it is assumed that the scan must take place within the parallel leader. Note that returning true does not mean that the scan itself can be done in parallel, only that the scan can be performed within a parallel worker. Therefore, it can be useful to define this method even when parallel execution is not supported.
Estimate the amount of dynamic shared memory that will be required for parallel operation. This may be higher than the amount that will actually be used, but it must not be lower. The return value is in bytes. This function is optional, and can be omitted if not needed; but if it is omitted, the next three functions must be omitted as well, because no shared memory will be allocated for the FDW's use.
Initialize the dynamic shared memory that will be required for parallel operation. coordinate
points to a shared memory area of size equal to the return value of EstimateDSMForeignScan
. This function is optional, and can be omitted if not needed.
Re-initialize the dynamic shared memory required for parallel operation when the foreign-scan plan node is about to be re-scanned. This function is optional, and can be omitted if not needed. Recommended practice is that this function reset only shared state, while the ReScanForeignScan
function resets only local state. Currently, this function will be called before ReScanForeignScan
, but it's best not to rely on that ordering.
Initialize a parallel worker's local state based on the shared state set up by the leader during InitializeDSMForeignScan
. This function is optional, and can be omitted if not needed.
Release resources when it is anticipated the node will not be executed to completion. This is not called in all cases; sometimes, EndForeignScan
may be called without this function having been called first. Since the DSM segment used by parallel query is destroyed just after this callback is invoked, foreign data wrappers that wish to take some action before the DSM segment goes away should implement this method.
This function is called while converting a path parameterized by the top-most parent of the given child relation child_rel
to be parameterized by the child relation. The function is used to reparameterize any paths or translate any expression nodes saved in the given fdw_private
member of a ForeignPath
. The callback may use reparameterize_path_by_child
, adjust_appendrel_attrs
or adjust_appendrel_attrs_multilevel
as required.
Among all relational operators the most difficult one to process and optimize is the join. The number of possible query plans grows exponentially with the number of joins in the query. Further optimization effort is caused by the support of a variety of join methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., B-tree, hash, GiST and GIN in PostgreSQL) as access paths for relations.
The normal PostgreSQL query optimizer performs a near-exhaustive search over the space of alternative strategies. This algorithm, first introduced in IBM's System R database, produces a near-optimal join order, but can take an enormous amount of time and memory space when the number of joins in the query grows large. This makes the ordinary PostgreSQL query optimizer inappropriate for queries that join a large number of tables.
The Institute of Automatic Control at the University of Mining and Technology, in Freiberg, Germany, encountered some problems when it wanted to use PostgreSQL as the backend for a decision support knowledge based system for the maintenance of an electrical power grid. The DBMS needed to handle large join queries for the inference machine of the knowledge based system. The number of joins in these queries made using the normal query optimizer infeasible.
In the following we describe the implementation of a genetic algorithm to solve the join ordering problem in a manner that is efficient for queries involving large numbers of joins.
This chapter defines the interface between the core PostgreSQL system and index access methods, which manage individual index types. The core system knows nothing about indexes beyond what is specified here, so it is possible to develop entirely new index types by writing add-on code.
All indexes in PostgreSQL are what are known technically as secondary indexes; that is, the index is physically separate from the table file that it describes. Each index is stored as its own physical relation and so is described by an entry in the pg_class
catalog. The contents of an index are entirely under the control of its index access method. In practice, all index access methods divide indexes into standard-size pages so that they can use the regular storage manager and buffer manager to access the index contents. (All the existing index access methods furthermore use the standard page layout described in , and most use the same format for index tuple headers; but these decisions are not forced on an access method.)
An index is effectively a mapping from some data key values to tuple identifiers, or TIDs, of row versions (tuples) in the index's parent table. A TID consists of a block number and an item number within that block (see ). This is sufficient information to fetch a particular row version from the table. Indexes are not directly aware that under MVCC, there might be multiple extant versions of the same logical row; to an index, each tuple is an independent object that needs its own index entry. Thus, an update of a row always creates all-new index entries for the row, even if the key values did not change. (HOT tuples are an exception to this statement; but indexes do not deal with those, either.) Index entries for dead tuples are reclaimed (by vacuuming) when the dead tuples themselves are reclaimed.
The genetic algorithm (GA) is a heuristic optimization method which operates through randomized search. The set of possible solutions for the optimization problem is considered as a population of individuals. The degree of adaptation of an individual to its environment is specified by its fitness.
The coordinates of an individual in the search space are represented by chromosomes, in essence a set of character strings. A gene is a subsection of a chromosome which encodes the value of a single parameter being optimized. Typical encodings for a gene could be binary or integer.
Through simulation of the evolutionary operations recombination, mutation, and selection new generations of search points are found that show a higher average fitness than their ancestors.
According to the comp.ai.genetic FAQ it cannot be stressed too strongly that a GA is not a pure random search for a solution to a problem. A GA uses stochastic processes, but the result is distinctly non-random (better than random).
Figure 59.1. Structured Diagram of a Genetic Algorithm
The following resources contain additional information about genetic algorithms:
, (FAQ for )
, by Craig Reynolds
The GEQO module approaches the query optimization problem as though it were the well-known traveling salesman problem (TSP). Possible query plans are encoded as integer strings. Each string represents the join order from one relation of the query to the next. For example, the join tree
is encoded by the integer string '4-1-3-2', which means, first join relation '4' and '1', then '3', and then '2', where 1, 2, 3, 4 are relation IDs within the PostgreSQL optimizer.
Specific characteristics of the GEQO implementation in PostgreSQL are:
Usage of a steady state GA (replacement of the least fit individuals in a population, not whole-generational replacement) allows fast convergence towards improved query plans. This is essential for query handling with reasonable time;
Usage of edge recombination crossover which is especially suited to keep edge losses low for the solution of the TSP by means of a GA;
Mutation as genetic operator is deprecated so that no repair mechanisms are needed to generate legal TSP tours.
Parts of the GEQO module are adapted from D. Whitley's Genitor algorithm.
The GEQO module allows the PostgreSQL query optimizer to support large join queries effectively through non-exhaustive search.
The GEQO planning process uses the standard planner code to generate plans for scans of individual relations. Then join plans are developed using the genetic approach. As shown above, each candidate join plan is represented by a sequence in which to join the base relations. In the initial stage, the GEQO code simply generates some possible join sequences at random. For each join sequence considered, the standard planner code is invoked to estimate the cost of performing the query using that join sequence. (For each step of the join sequence, all three possible join strategies are considered; and all the initially-determined relation scan plans are available. The estimated cost is the cheapest of these possibilities.) Join sequences with lower estimated cost are considered “more fit” than those with higher cost. The genetic algorithm discards the least fit candidates. Then new candidates are generated by combining genes of more-fit candidates — that is, by using randomly-chosen portions of known low-cost join sequences to create new sequences for consideration. This process is repeated until a preset number of join sequences have been considered; then the best one found at any time during the search is used to generate the finished plan.
This process is inherently nondeterministic, because of the randomized choices made during both the initial population selection and subsequent “mutation” of the best candidates. To avoid surprising changes of the selected plan, each run of the GEQO algorithm restarts its random number generator with the current parameter setting. As long as geqo_seed
and the other GEQO parameters are kept fixed, the same plan will be generated for a given query (and other planner inputs such as statistics). To experiment with different search paths, try changing geqo_seed
.
Work is still needed to improve the genetic algorithm parameter settings. In file src/backend/optimizer/geqo/geqo_main.c
, routines gimme_pool_size
and gimme_number_generations
, we have to find a compromise for the parameter settings to satisfy two competing demands:
Optimality of the query plan
Computing time
In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner's join selection and cost estimation code from scratch. To the extent that different candidates use similar sub-sequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state.
At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.
This chapter explains the interface between the core PostgreSQL system and table access methods, which manage the storage for tables. The core system knows little about these access methods beyond what is specified here, so it is possible to develop entirely new access method types by writing add-on code.
Each table access method is described by a row in the system catalog. The pg_am
entry specifies a name and a handler function for the table access method. These entries can be created and deleted using the and SQL commands.
A table access method handler function must be declared to accept a single argument of type internal
and to return the pseudo-type table_am_handler
. The argument is a dummy value that simply serves to prevent handler functions from being called directly from SQL commands. The result of the function must be a pointer to a struct of type TableAmRoutine
, which contains everything that the core code needs to know to make use of the table access method. The return value needs to be of server lifetime, which is typically achieved by defining it as a static const
variable in global scope. The TableAmRoutine
struct, also called the access method's API struct, defines the behavior of the access method using callbacks. These callbacks are pointers to plain C functions and are not visible or callable at the SQL level. All the callbacks and their behavior is defined in the TableAmRoutine
structure (with comments inside the struct defining the requirements for callbacks). Most callbacks have wrapper functions, which are documented from the point of view of a user (rather than an implementor) of the table access method. For details, please refer to the file.
To implement an access method, an implementor will typically need to implement an AM-specific type of tuple table slot (see ), which allows code outside the access method to hold references to tuples of the AM, and to access the columns of the tuple.
Currently, the way an AM actually stores data is fairly unconstrained. For example, it's possible, but not required, to use postgres' shared buffer cache. In case it is used, it likely makes sense to use PostgreSQL's standard page layout as described in .
One fairly large constraint of the table access method API is that, currently, if the AM wants to support modifications and/or indexes, it is necessary for each tuple to have a tuple identifier (TID) consisting of a block number and an item number (see also ). It is not strictly necessary that the sub-parts of TIDs have the same meaning they e.g., have for heap
, but if bitmap scan support is desired (it is optional), the block number needs to provide locality.
For crash safety, an AM can use postgres' , or a custom implementation. If WAL is chosen, either can be used, or a new type of WAL records can be implemented. Generic WAL Records are easy, but imply higher WAL volume. Implementation of a new type of WAL record currently requires modifications to core code (specifically, src/include/access/rmgrlist.h
).
To implement transactional support in a manner that allows different table access methods be accessed within a single transaction, it likely is necessary to closely integrate with the machinery in src/backend/access/transam/xlog.c
.
Any developer of a new table access method
can refer to the existing heap
implementation present in src/backend/access/heap/heapam_handler.c
for details of its implementation.
P(t)
generation of ancestors at a time t
P''(t)
generation of descendants at a time t
Although all built-in WAL-logged modules have their own types of WAL records, there is also a generic WAL record type, which describes changes to pages in a generic way. This is useful for extensions that provide custom access methods, because they cannot register their own WAL redo routines.
The API for constructing generic WAL records is defined in access/generic_xlog.h
and implemented in access/transam/generic_xlog.c
.
To perform a WAL-logged data update using the generic WAL record facility, follow these steps:
state = GenericXLogStart(relation)
— start construction of a generic WAL record for the given relation.
page = GenericXLogRegisterBuffer(state, buffer, flags)
— register a buffer to be modified within the current generic WAL record. This function returns a pointer to a temporary copy of the buffer's page, where modifications should be made. (Do not modify the buffer's contents directly.) The third argument is a bit mask of flags applicable to the operation. Currently the only such flag is GENERIC_XLOG_FULL_IMAGE
, which indicates that a full-page image rather than a delta update should be included in the WAL record. Typically this flag would be set if the page is new or has been rewritten completely. GenericXLogRegisterBuffer
can be repeated if the WAL-logged action needs to modify multiple pages.
Apply modifications to the page images obtained in the previous step.
GenericXLogFinish(state)
— apply the changes to the buffers and emit the generic WAL record.
WAL record construction can be canceled between any of the above steps by calling GenericXLogAbort(state)
. This will discard all changes to the page image copies.
Please note the following points when using the generic WAL record facility:
No direct modifications of buffers are allowed! All modifications must be done in copies acquired from GenericXLogRegisterBuffer()
. In other words, code that makes generic WAL records should never call BufferGetPage()
for itself. However, it remains the caller's responsibility to pin/unpin and lock/unlock the buffers at appropriate times. Exclusive lock must be held on each target buffer from before GenericXLogRegisterBuffer()
until after GenericXLogFinish()
.
Registrations of buffers (step 2) and modifications of page images (step 3) can be mixed freely, i.e., both steps may be repeated in any sequence. Keep in mind that buffers should be registered in the same order in which locks are to be obtained on them during replay.
The maximum number of buffers that can be registered for a generic WAL record is MAX_GENERIC_XLOG_PAGES
. An error will be thrown if this limit is exceeded.
Generic WAL assumes that the pages to be modified have standard layout, and in particular that there is no useful data between pd_lower
and pd_upper
.
Since you are modifying copies of buffer pages, GenericXLogStart()
does not start a critical section. Thus, you can safely do memory allocation, error throwing, etc. between GenericXLogStart()
and GenericXLogFinish()
. The only actual critical section is present inside GenericXLogFinish()
. There is no need to worry about calling GenericXLogAbort()
during an error exit, either.
GenericXLogFinish()
takes care of marking buffers dirty and setting their LSNs. You do not need to do this explicitly.
For unlogged relations, everything works the same except that no actual WAL record is emitted. Thus, you typically do not need to do any explicit checks for unlogged relations.
The generic WAL redo function will acquire exclusive locks to buffers in the same order as they were registered. After redoing all changes, the locks will be released in the same order.
If GENERIC_XLOG_FULL_IMAGE
is not specified for a registered buffer, the generic WAL record contains a delta between the old and the new page images. This delta is based on byte-by-byte comparison. This is not very compact for the case of moving data within a page, and might be improved in the future.
GiST stands for Generalized Search Tree. It is a balanced, tree-structured access method, that acts as a base template in which to implement arbitrary indexing schemes. B-trees, R-trees and many other indexing schemes can be implemented in GiST.
One advantage of GiST is that it allows the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.
Some of the information here is derived from the University of California at Berkeley's GiST Indexing Project web site and Marcel Kornacker's thesis, Access Methods for Next-Generation Database Systems. The GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their web site.
PostgreSQL programs (server and client) can issue their messages in your favorite language — if the messages have been translated. Creating and maintaining translated message sets needs the help of people who speak their own language well and want to contribute to the PostgreSQL effort. You do not have to be a programmer at all to do this. This section explains how to help.
We won't judge your language skills — this section is about software tools. Theoretically, you only need a text editor. But this is only in the unlikely event that you do not want to try out your translated messages. When you configure your source tree, be sure to use the --enable-nls
option. This will also check for the libintl library and the msgfmt
program, which all end users will need anyway. To try out your work, follow the applicable portions of the installation instructions.
If you want to start a new translation effort or want to do a message catalog merge (described later), you will need the programs xgettext
and msgmerge
, respectively, in a GNU-compatible implementation. Later, we will try to arrange it so that if you use a packaged source distribution, you won't need xgettext
. (If working from Git, you will still need it.) GNU Gettext 0.10.36 or later is currently recommended.
Your local gettext implementation should come with its own documentation. Some of that is probably duplicated in what follows, but for additional details you should look there.
The pairs of original (English) messages and their (possibly) translated equivalents are kept in message catalogs, one for each program (although related programs can share a message catalog) and for each target language. There are two file formats for message catalogs: The first is the “PO” file (for Portable Object), which is a plain text file with special syntax that translators edit. The second is the “MO” file (for Machine Object), which is a binary file generated from the respective PO file and is used while the internationalized program is run. Translators do not deal with MO files; in fact hardly anyone does.
The extension of the message catalog file is to no surprise either .po
or .mo
. The base name is either the name of the program it accompanies, or the language the file is for, depending on the situation. This is a bit confusing. Examples are psql.po
(PO file for psql) or fr.mo
(MO file in French).
The file format of the PO files is illustrated here:
The msgid's are extracted from the program source. (They need not be, but this is the most common way.) The msgstr lines are initially empty and are filled in with useful strings by the translator. The strings can contain C-style escape characters and can be continued across lines as illustrated. (The next line must start at the beginning of the line.)
The # character introduces a comment. If whitespace immediately follows the # character, then this is a comment maintained by the translator. There can also be automatic comments, which have a non-whitespace character immediately following the #. These are maintained by the various tools that operate on the PO files and are intended to aid the translator.
The #. style comments are extracted from the source file where the message is used. Possibly the programmer has inserted information for the translator, such as about expected alignment. The #: comment indicates the exact location(s) where the message is used in the source. The translator need not look at the program source, but can if there is doubt about the correct translation. The #, comments contain flags that describe the message in some way. There are currently two flags: fuzzy
is set if the message has possibly been outdated because of changes in the program source. The translator can then verify this and possibly remove the fuzzy flag. Note that fuzzy messages are not made available to the end user. The other flag is c-format
, which indicates that the message is a printf
-style format template. This means that the translation should also be a format string with the same number and type of placeholders. There are tools that can verify this, which key off the c-format flag.
OK, so how does one create a “blank” message catalog? First, go into the directory that contains the program whose messages you want to translate. If there is a file nls.mk
, then this program has been prepared for translation.
If there are already some .po
files, then someone has already done some translation work. The files are named language
.po, where language
is the ISO 639-1 two-letter language code (in lower case), e.g., fr.po
for French. If there is really a need for more than one translation effort per language then the files can also be named language
_region
.po where region
is the ISO 3166-1 two-letter country code (in upper case), e.g., pt_BR.po
for Portuguese in Brazil. If you find the language you wanted you can just start working on that file.
If you need to start a new translation effort, then first run the command:
This will create a file progname
.pot. (.pot
to distinguish it from PO files that are “in production”. The T
stands for “template”.) Copy this file to language
.po and edit it. To make it known that the new language is available, also edit the file nls.mk
and add the language (or language and country) code to the line that looks like:
(Other languages can appear, of course.)
As the underlying program or library changes, messages might be changed or added by the programmers. In this case you do not need to start from scratch. Instead, run the command:
which will create a new blank message catalog file (the pot file you started with) and will merge it with the existing PO files. If the merge algorithm is not sure about a particular message it marks it “fuzzy” as explained above. The new PO file is saved with a .po.new
extension.
The PO files can be edited with a regular text editor. The translator should only change the area between the quotes after the msgstr directive, add comments, and alter the fuzzy flag. There is (unsurprisingly) a PO mode for Emacs, which I find quite useful.
The PO files need not be completely filled in. The software will automatically fall back to the original string if no translation (or an empty translation) is available. It is no problem to submit incomplete translations for inclusions in the source tree; that gives room for other people to pick up your work. However, you are encouraged to give priority to removing fuzzy entries after doing a merge. Remember that fuzzy entries will not be installed; they only serve as reference for what might be the right translation.
Here are some things to keep in mind while editing the translations:
Make sure that if the original ends with a newline, the translation does, too. Similarly for tabs, etc.
If the original is a printf
format string, the translation also needs to be. The translation also needs to have the same format specifiers in the same order. Sometimes the natural rules of the language make this impossible or at least awkward. In that case you can modify the format specifiers like this:
Then the first placeholder will actually use the second argument from the list. The digits
$ needs to follow the % immediately, before any other format manipulators. (This feature really exists in the printf
family of functions. You might not have heard of it before because there is little use for it outside of message internationalization.)
If the original string contains a linguistic mistake, report that (or fix it yourself in the program source) and translate normally. The corrected string can be merged in when the program sources have been updated. If the original string contains a factual mistake, report that (or fix it yourself) and do not translate it. Instead, you can mark the string with a comment in the PO file.
Maintain the style and tone of the original string. Specifically, messages that are not sentences (cannot open file %s
) should probably not start with a capital letter (if your language distinguishes letter case) or end with a period (if your language uses punctuation marks). It might help to read Section 53.3.
If you don't know what a message means, or if it is ambiguous, ask on the developers' mailing list. Chances are that English speaking end users might also not understand it or find it ambiguous, so it's best to improve the message.
As shown in , a btree operator class must provide five comparison operators, <
, <=
, =
, >=
and >
. One might expect that <>
should also be part of the operator class, but it is not, because it would almost never be useful to use a <>
WHERE clause in an index search. (For some purposes, the planner treats <>
as associated with a btree operator class; but it finds that operator via the =
operator's negator link, rather than from pg_amop
.)
When several data types share near-identical sorting semantics, their operator classes can be grouped into an operator family. Doing so is advantageous because it allows the planner to make deductions about cross-type comparisons. Each operator class within the family should contain the single-type operators (and associated support functions) for its input data type, while cross-type comparison operators and support functions are “loose” in the family. It is recommendable that a complete set of cross-type operators be included in the family, thus ensuring that the planner can represent any comparison conditions that it deduces from transitivity.
There are some basic assumptions that a btree operator family must satisfy:
An =
operator must be an equivalence relation; that is, for all non-null values A
, B
, C
of the data type:
A
=
A
is true (reflexive law)
if A
=
B
, then B
=
A
(symmetric law)
if A
=
B
and B
=
C
, then A
=
C
(transitive law)
A <
operator must be a strong ordering relation; that is, for all non-null values A
, B
, C
:
A
<
A
is false (irreflexive law)
if A
<
B
and B
<
C
, then A
<
C
(transitive law)
Furthermore, the ordering is total; that is, for all non-null values A
, B
:
exactly one of A
<
B
, A
=
B
, and B
<
A
is true (trichotomy law)
(The trichotomy law justifies the definition of the comparison support function, of course.)
The other three operators are defined in terms of =
and <
in the obvious way, and must act consistently with them.
For an operator family supporting multiple data types, the above laws must hold when A
, B
, C
are taken from any data types in the family. The transitive laws are the trickiest to ensure, as in cross-type situations they represent statements that the behaviors of two or three different operators are consistent. As an example, it would not work to put float8
and numeric
into the same operator family, at least not with the current semantics that numeric
values are converted to float8
for comparison to a float8
. Because of the limited accuracy of float8
, this means there are distinct numeric
values that will compare equal to the same float8
value, and thus the transitive law would fail.
Another requirement for a multiple-data-type family is that any implicit or binary-coercion casts that are defined between data types included in the operator family must not change the associated sort ordering.
It should be fairly clear why a btree index requires these laws to hold within a single data type: without them there is no ordering to arrange the keys with. Also, index searches using a comparison key of a different data type require comparisons to behave sanely across two data types. The extensions to three or more data types within a family are not strictly required by the btree index mechanism itself, but the planner relies on them for optimization purposes.
PostgreSQL includes an implementation of the standard btree (multi-way balanced tree) index data structure. Any data type that can be sorted into a well-defined linear order can be indexed by a btree index. The only limitation is that an index entry cannot exceed approximately one-third of a page (after TOAST compression, if applicable).
Because each btree operator class imposes a sort order on its data type, btree operator classes (or, really, operator families) have come to be used as PostgreSQL's general representation and understanding of sorting semantics. Therefore, they've acquired some features that go beyond what would be needed just to support btree indexes, and parts of the system that are quite distant from the btree AM make use of them.
The core PostgreSQL distribution includes the GiST operator classes shown in Table 64.1. (Some of the optional modules described in Appendix F provide additional GiST operator classes.)
Name
Indexed Data Type
Indexable Operators
Ordering Operators
box_ops
box
&&
&>
&<
&<|
>>
<<
<<|
<@
@>
@
|&>
|>>
~
~=
circle_ops
circle
&&
&>
&<
&<|
>>
<<
<<|
<@
@>
@
|&>
|>>
~
~=
<->
inet_ops
inet
, cidr
&&
>>
>>=
>
>=
<>
<<
<<=
<
<=
=
point_ops
point
>>
>^
<<
<@
<@
<@
<^
~=
<->
poly_ops
polygon
&&
&>
&<
&<|
>>
<<
<<|
<@
@>
@
|&>
|>>
~
~=
<->
range_ops
any range type
&&
&>
&<
>>
<<
<@
-|-
=
@>
@>
tsquery_ops
tsquery
<@
@>
tsvector_ops
tsvector
@@
For historical reasons, the inet_ops
operator class is not the default class for types inet
and cidr
. To use it, mention the class name in CREATE INDEX
, for example
Building large GiST indexes by simply inserting all the tuples tends to be slow, because if the index tuples are scattered across the index and the index is large enough to not fit in cache, the insertions need to perform a lot of random I/O. Beginning in version 9.2, PostgreSQL supports a more efficient method to build GiST indexes based on buffering, which can dramatically reduce the number of random I/Os needed for non-ordered data sets. For well-ordered data sets the benefit is smaller or non-existent, because only a small number of pages receive new tuples at a time, and those pages fit in cache even if the index as whole does not.
However, buffering index build needs to call the penalty
function more often, which consumes some extra CPU resources. Also, the buffers used in the buffering build need temporary disk space, up to the size of the resulting index. Buffering can also influence the quality of the resulting index, in both positive and negative directions. That influence depends on various factors, like the distribution of the input data and the operator class implementation.
By default, a GiST index build switches to the buffering method when the index size reaches effective_cache_size. It can be manually turned on or off by the buffering
parameter to the CREATE INDEX command. The default behavior is good for most cases, but turning buffering off might speed up the build somewhat if the input data is ordered.
This section covers B-Tree index implementation details that may be of use to advanced users. See src/backend/access/nbtree/README
in the source distribution for a much more detailed, internals-focused description of the B-Tree implementation.
PostgreSQL B-Tree indexes are multi-level tree structures, where each level of the tree can be used as a doubly-linked list of pages. A single metapage is stored in a fixed position at the start of the first segment file of the index. All other pages are either leaf pages or internal pages. Leaf pages are the pages on the lowest level of the tree. All other levels consist of internal pages. Each leaf page contains tuples that point to table rows. Each internal page contains tuples that point to the next level down in the tree. Typically, over 99% of all pages are leaf pages. Both internal pages and leaf pages use the standard page format described in Section 68.6.
New leaf pages are added to a B-Tree index when an existing leaf page cannot fit an incoming tuple. A page split operation makes room for items that originally belonged on the overflowing page by moving a portion of the items to a new page. Page splits must also insert a new downlink to the new page in the parent page, which may cause the parent to split in turn. Page splits “cascade upwards” in a recursive fashion. When the root page finally cannot fit a new downlink, a root page split operation takes place. This adds a new level to the tree structure by creating a new root page that is one level above the original root page.
A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.
Deduplication works by periodically merging groups of duplicate tuples together, forming a single posting list tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of TIDs that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly.
B-Tree 的重複資料處理對包含 NULL 值的「重複內容」同樣有效,即使根據所有 B-Tree 運算子類的 = 成員,NULL 值不表示彼此相等。就實作中磁碟上的 B-Tree 結構而言,NULL 也只是索引值域中的其中一個值。
The deduplication process occurs lazily, when a new item is inserted that cannot fit on an existing leaf page. This prevents (or at least delays) leaf page splits. Unlike GIN posting list tuples, B-Tree posting list tuples do not need to expand every time a new duplicate is inserted; they are merely an alternative physical representation of the original logical contents of the leaf page. This design prioritizes consistent performance with mixed read-write workloads. Most client applications will at least see a moderate performance benefit from using deduplication. Deduplication is enabled by default.
CREATE INDEX
and REINDEX
apply deduplication to create posting list tuples, though the strategy they use is slightly different. Each group of duplicate ordinary tuples encountered in the sorted input taken from the table is merged into a posting list tuple before being added to the current pending leaf page. Individual posting list tuples are packed with as many TIDs as possible. Leaf pages are written out in the usual way, without any separate deduplication pass. This strategy is well-suited to CREATE INDEX
and REINDEX
because they are once-off batch operations.
Write-heavy workloads that don't benefit from deduplication due to having few or no duplicate values in indexes will incur a small, fixed performance penalty (unless deduplication is explicitly disabled). The deduplicate_items
storage parameter can be used to disable deduplication within individual indexes. There is never any performance penalty with read-only workloads, since reading posting list tuples is at least as efficient as reading the standard tuple representation. Disabling deduplication isn't usually helpful.
B-Tree indexes are not directly aware that under MVCC, there might be multiple extant versions of the same logical table row; to an index, each tuple is an independent object that needs its own index entry. “Version duplicates” may sometimes accumulate and adversely affect query latency and throughput. This typically occurs with UPDATE
-heavy workloads where most individual updates cannot apply the HOT optimization (often because at least one indexed column gets modified, necessitating a new set of index tuple versions — one new tuple for each and every index). In effect, B-Tree deduplication ameliorates index bloat caused by version churn. Note that even the tuples from a unique index are not necessarily physically unique when stored on disk due to version churn. The deduplication optimization is selectively applied within unique indexes. It targets those pages that appear to have version duplicates. The high level goal is to give VACUUM
more time to run before an “unnecessary” page split caused by version churn can take place.
A special heuristic is applied to determine whether a deduplication pass in a unique index should take place. It can often skip straight to splitting a leaf page, avoiding a performance penalty from wasting cycles on unhelpful deduplication passes. If you're concerned about the overhead of deduplication, consider setting deduplicate_items = off
selectively. Leaving deduplication enabled in unique indexes has little downside.
Deduplication cannot be used in all cases due to implementation-level restrictions. Deduplication safety is determined when CREATE INDEX
or REINDEX
is run.
Note that deduplication is deemed unsafe and cannot be used in the following cases involving semantically significant differences among equal datums:
text
, varchar
, and char
cannot use deduplication when a nondeterministic collation is used. Case and accent differences must be preserved among equal datums.
numeric
cannot use deduplication. Numeric display scale must be preserved among equal datums.
jsonb
cannot use deduplication, since the jsonb
B-Tree operator class uses numeric
internally.
float4
and float8
cannot use deduplication. These types have distinct representations for -0
and 0
, which are nevertheless considered equal. This difference must be preserved.
There is one further implementation-level restriction that may be lifted in a future version of PostgreSQL:
Container types (such as composite types, arrays, or range types) cannot use deduplication.
There is one further implementation-level restriction that applies regardless of the operator class or collation used:
INCLUDE
indexes can never use deduplication.
SP-GiST is an abbreviation for space-partitioned GiST. SP-GiST supports partitioned search trees, which facilitate development of a wide range of different non-balanced data structures, such as quad-trees, k-d trees, and radix trees (tries). The common feature of these structures is that they repeatedly divide the search space into partitions that need not be of equal size. Searches that are well matched to the partitioning rule can be very fast.
These popular data structures were originally developed for in-memory usage. In main memory, they are usually designed as a set of dynamically allocated nodes linked by pointers. This is not suitable for direct storing on disk, since these chains of pointers can be rather long which would require too many disk accesses. In contrast, disk-based data structures should have a high fanout to minimize I/O. The challenge addressed by SP-GiST is to map search tree nodes to disk pages in such a way that a search need access only a few disk pages, even if it traverses many nodes.
Like GiST, SP-GiST is meant to allow the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.
Some of the information here is derived from Purdue University's SP-GiST Indexing Project web site. The SP-GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their web site.
The PostgreSQL source distribution includes several examples of index methods implemented using GiST. The core system currently provides text search support (indexing for tsvector
and tsquery
) as well as R-Tree equivalent functionality for some of the built-in geometric data types (see src/backend/access/gist/gistproc.c
). The following contrib
modules also contain GiST operator classes:
btree_gist
B-tree equivalent functionality for several data types
cube
Indexing for multidimensional cubes
hstore
Module for storing (key, value) pairs
intarray
RD-Tree for one-dimensional array of int4 values
ltree
Indexing for tree-like structures
pg_trgm
Text similarity using trigram matching
seg
Indexing for “float ranges”
The core PostgreSQL distribution includes the SP-GiST operator classes shown in Table 63.1.
Name
Indexed Data Type
Indexable Operators
kd_point_ops
point
<<
<@
<^
>>
>^
~=
quad_point_ops
point
<<
<@
<^
>>
>^
~=
range_ops
any range type
&&
&<
&>
-|-
<<
<@
=
>>
@>
box_ops
box
<<
&<
&&
&>
>>
~=
@>
<@
&<|
<<|
|>>
|&>
text_ops
text
<
<=
=
>
>=
~<=~
~<~
~>=~
~>~
inet_ops
inet
, cidr
&&
>>
>>=
>
>=
<>
<<
<<=
<
<=
=
Of the two operator classes for type point
, quad_point_ops
is the default. kd_point_ops
supports the same operators but uses a different index data structure which may offer better performance in some applications.
Traditionally, implementing a new index access method meant a lot of difficult work. It was necessary to understand the inner workings of the database, such as the lock manager and Write-Ahead Log. The GiST interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GiST layer itself takes care of concurrency, logging and searching the tree structure.
This extensibility should not be confused with the extensibility of the other standard search trees in terms of the data they can handle. For example, PostgreSQL supports extensible B-trees and hash indexes. That means that you can use PostgreSQL to build a B-tree or hash over any data type you want. But B-trees only support range predicates (<
, =
, >
), and hash indexes only support equality queries.
So if you index, say, an image collection with a PostgreSQL B-tree, you can only issue queries such as “is imagex equal to imagey”, “is imagex less than imagey” and “is imagex greater than imagey”. Depending on how you define “equals”, “less than” and “greater than” in this context, this could be useful. However, by using a GiST based index, you could create ways to ask domain-specific questions, perhaps “find all images of horses” or “find all over-exposed images”.
All it takes to get a GiST access method up and running is to implement several user-defined methods, which define the behavior of keys in the tree. Of course these methods have to be pretty fancy to support fancy queries, but for all the standard queries (B-trees, R-trees, etc.) they're relatively straightforward. In short, GiST combines extensibility along with generality, code reuse, and a clean interface.
There are five methods that an index operator class for GiST must provide, and four that are optional. Correctness of the index is ensured by proper implementation of the same
, consistent
and union
methods, while efficiency (size and speed) of the index will depend on the penalty
and picksplit
methods. Two optional methods are compress
and decompress
, which allow an index to have internal tree data of a different type than the data it indexes. The leaves are to be of the indexed data type, while the other tree nodes can be of any C struct (but you still have to follow PostgreSQL data type rules here, see about varlena
for variable sized data). If the tree's internal data type exists at the SQL level, the STORAGE
option of the CREATE OPERATOR CLASS
command can be used. The optional eighth method is distance
, which is needed if the operator class wishes to support ordered scans (nearest-neighbor searches). The optional ninth method fetch
is needed if the operator class wishes to support index-only scans, except when the compress
method is omitted.
consistent
Given an index entry p
and a query value q
, this function determines whether the index entry is “consistent” with the query; that is, could the predicate “indexed_column
indexable_operator
q
” be true for any row represented by the index entry? For a leaf index entry this is equivalent to testing the indexable condition, while for an internal tree node this determines whether it is necessary to scan the subtree of the index represented by the tree node. When the result is true
, a recheck
flag must also be returned. This indicates whether the predicate is certainly true or only possibly true. If recheck
= false
then the index has tested the predicate condition exactly, whereas if recheck
= true
the row is only a candidate match. In that case the system will automatically evaluate the indexable_operator
against the actual row value to see if it is really a match. This convention allows GiST to support both lossless and lossy index structures.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
Here, key
is an element in the index and query
the value being looked up in the index. The StrategyNumber
parameter indicates which operator of your operator class is being applied — it matches one of the operator numbers in the CREATE OPERATOR CLASS
command.
Depending on which operators you have included in the class, the data type of query
could vary with the operator, since it will be whatever type is on the righthand side of the operator, which might be different from the indexed data type appearing on the lefthand side. (The above code skeleton assumes that only one type is possible; if not, fetching the query
argument value would have to depend on the operator.) It is recommended that the SQL declaration of the consistent
function use the opclass's indexed data type for the query
argument, even though the actual type might be something else depending on the operator.
union
This method consolidates information in the tree. Given a set of entries, this function generates a new index entry that represents all the given entries.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
As you can see, in this skeleton we're dealing with a data type where union(X, Y, Z) = union(union(X, Y), Z)
. It's easy enough to support data types where this is not the case, by implementing the proper union algorithm in this GiST support method.
The result of the union
function must be a value of the index's storage type, whatever that is (it might or might not be different from the indexed column's type). The union
function should return a pointer to newly palloc()
ed memory. You can't just return the input value as-is, even if there is no type change.
As shown above, the union
function's first internal
argument is actually a GistEntryVector
pointer. The second argument is a pointer to an integer variable, which can be ignored. (It used to be required that the union
function store the size of its result value into that variable, but this is no longer necessary.)
compress
Converts a data item into a format suitable for physical storage in an index page. If the compress
method is omitted, data items are stored in the index without modification.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
You have to adapt compressed_data_type
to the specific type you're converting to in order to compress your leaf nodes, of course.
decompress
Converts the stored representation of a data item into a format that can be manipulated by the other GiST methods in the operator class. If the decompress
method is omitted, it is assumed that the other GiST methods can work directly on the stored data format. (decompress
is not necessarily the reverse of the compress
method; in particular, if compress
is lossy then it's impossible for decompress
to exactly reconstruct the original data. decompress
is not necessarily equivalent to fetch
, either, since the other GiST methods might not require full reconstruction of the data.)
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
The above skeleton is suitable for the case where no decompression is needed. (But, of course, omitting the method altogether is even easier, and is recommended in such cases.)
penalty
Returns a value indicating the “cost” of inserting the new entry into a particular branch of the tree. Items will be inserted down the path of least penalty
in the tree. Values returned by penalty
should be non-negative. If a negative value is returned, it will be treated as zero.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
For historical reasons, the penalty
function doesn't just return a float
result; instead it has to store the value at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.
The penalty
function is crucial to good performance of the index. It'll get used at insertion time to determine which branch to follow when choosing where to add the new entry in the tree. At query time, the more balanced the index, the quicker the lookup.
picksplit
When an index page split is necessary, this function decides which entries on the page are to stay on the old page, and which are to move to the new page.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
Notice that the picksplit
function's result is delivered by modifying the passed-in v
structure. The return value per se is ignored, though it's conventional to pass back the address of v
.
Like penalty
, the picksplit
function is crucial to good performance of the index. Designing suitable penalty
and picksplit
implementations is where the challenge of implementing well-performing GiST indexes lies.
same
Returns true if two index entries are identical, false otherwise. (An “index entry” is a value of the index's storage type, not necessarily the original indexed column's type.)
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
For historical reasons, the same
function doesn't just return a Boolean result; instead it has to store the flag at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.
distance
Given an index entry p
and a query value q
, this function determines the index entry's “distance” from the query value. This function must be supplied if the operator class contains any ordering operators. A query using the ordering operator will be implemented by returning index entries with the smallest “distance” values first, so the results must be consistent with the operator's semantics. For a leaf index entry the result just represents the distance to the index entry; for an internal tree node, the result must be the smallest distance that any child entry could have.
The SQL declaration of the function must look like this:
And the matching code in the C module could then follow this skeleton:
The arguments to the distance
function are identical to the arguments of the consistent
function.
Some approximation is allowed when determining the distance, so long as the result is never greater than the entry's actual distance. Thus, for example, distance to a bounding box is usually sufficient in geometric applications. For an internal tree node, the distance returned must not be greater than the distance to any of the child nodes. If the returned distance is not exact, the function must set *recheck
to true. (This is not necessary for internal tree nodes; for them, the calculation is always assumed to be inexact.) In this case the executor will calculate the accurate distance after fetching the tuple from the heap, and reorder the tuples if necessary.
If the distance function returns *recheck = true
for any leaf node, the original ordering operator's return type must be float8
or float4
, and the distance function's result values must be comparable to those of the original ordering operator, since the executor will sort using both distance function results and recalculated ordering-operator results. Otherwise, the distance function's result values can be any finite float8
values, so long as the relative order of the result values matches the order returned by the ordering operator. (Infinity and minus infinity are used internally to handle cases such as nulls, so it is not recommended that distance
functions return these values.)
fetch
Converts the compressed index representation of a data item into the original data type, for index-only scans. The returned data must be an exact, non-lossy copy of the originally indexed value.
The SQL declaration of the function must look like this:
The argument is a pointer to a GISTENTRY
struct. On entry, its key
field contains a non-NULL leaf datum in compressed form. The return value is another GISTENTRY
struct, whose key
field contains the same datum in its original, uncompressed form. If the opclass's compress function does nothing for leaf entries, the fetch
method can return the argument as-is. Or, if the opclass does not have a compress function, the fetch
method can be omitted as well, since it would necessarily be a no-op.
The matching code in the C module could then follow this skeleton:
If the compress method is lossy for leaf entries, the operator class cannot support index-only scans, and must not define a fetch
function.
All the GiST support methods are normally called in short-lived memory contexts; that is, CurrentMemoryContext
will get reset after each tuple is processed. It is therefore not very important to worry about pfree'ing everything you palloc. However, in some cases it's useful for a support method to cache data across repeated calls. To do that, allocate the longer-lived data in fcinfo->flinfo->fn_mcxt
, and keep a pointer to it in fcinfo->flinfo->fn_extra
. Such data will survive for the life of the index operation (e.g., a single GiST index scan, index build, or index tuple insertion). Be careful to pfree the previous value when replacing a fn_extra
value, or the leak will accumulate for the duration of the operation.
SP-GiST offers an interface with a high level of abstraction, requiring the access method developer to implement only methods specific to a given data type. The SP-GiST core is responsible for efficient disk mapping and searching the tree structure. It also takes care of concurrency and logging considerations.
Leaf tuples of an SP-GiST tree contain values of the same data type as the indexed column. Leaf tuples at the root level will always contain the original indexed data value, but leaf tuples at lower levels might contain only a compressed representation, such as a suffix. In that case the operator class support functions must be able to reconstruct the original value using information accumulated from the inner tuples that are passed through to reach the leaf level.
Inner tuples are more complex, since they are branching points in the search tree. Each inner tuple contains a set of one or more nodes, which represent groups of similar leaf values. A node contains a downlink that leads either to another, lower-level inner tuple, or to a short list of leaf tuples that all lie on the same index page. Each node normally has a label that describes it; for example, in a radix tree the node label could be the next character of the string value. (Alternatively, an operator class can omit the node labels, if it works with a fixed set of nodes for all inner tuples; see Section 63.4.2.) Optionally, an inner tuple can have a prefix value that describes all its members. In a radix tree this could be the common prefix of the represented strings. The prefix value is not necessarily really a prefix, but can be any data needed by the operator class; for example, in a quad-tree it can store the central point that the four quadrants are measured with respect to. A quad-tree inner tuple would then also contain four nodes corresponding to the quadrants around this central point.
Some tree algorithms require knowledge of level (or depth) of the current tuple, so the SP-GiST core provides the possibility for operator classes to manage level counting while descending the tree. There is also support for incrementally reconstructing the represented value when that is needed, and for passing down additional data (called traverse values) during a tree descent.
The SP-GiST core code takes care of null entries. Although SP-GiST indexes do store entries for nulls in indexed columns, this is hidden from the index operator class code: no null index entries or search conditions will ever be passed to the operator class methods. (It is assumed that SP-GiST operators are strict and so cannot succeed for null values.) Null values are therefore not discussed further here.
There are five user-defined methods that an index operator class for SP-GiST must provide. All five follow the convention of accepting two internal
arguments, the first of which is a pointer to a C struct containing input values for the support method, while the second argument is a pointer to a C struct where output values must be placed. Four of the methods just return void
, since all their results appear in the output struct; but leaf_consistent
additionally returns a boolean
result. The methods must not modify any fields of their input structs. In all cases, the output struct is initialized to zeroes before calling the user-defined method.
The five user-defined methods are:config
Returns static information about the index implementation, including the data type OIDs of the prefix and node label data types.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgConfigIn
C struct, containing input data for the function. The second argument is a pointer to a spgConfigOut
C struct, which the function must fill with result data.
attType
is passed in order to support polymorphic index operator classes; for ordinary fixed-data-type operator classes, it will always have the same value and so can be ignored.
For operator classes that do not use prefixes, prefixType
can be set to VOIDOID
. Likewise, for operator classes that do not use node labels, labelType
can be set to VOIDOID
. canReturnData
should be set true if the operator class is capable of reconstructing the originally-supplied index value. longValuesOK
should be set true only when the attType
is of variable length and the operator class is capable of segmenting long values by repeated suffixing (see Section 63.4.1).choose
Chooses a method for inserting a new value into an inner tuple.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgChooseIn
C struct, containing input data for the function. The second argument is a pointer to a spgChooseOut
C struct, which the function must fill with result data.
datum
is the original datum that was to be inserted into the index. leafDatum
is initially the same as datum
, but can change at lower levels of the tree if the choose
or picksplit
methods change it. When the insertion search reaches a leaf page, the current value of leafDatum
is what will be stored in the newly created leaf tuple. level
is the current inner tuple's level, starting at zero for the root level. allTheSame
is true if the current inner tuple is marked as containing multiple equivalent nodes (see Section 63.4.3). hasPrefix
is true if the current inner tuple contains a prefix; if so, prefixDatum
is its value. nNodes
is the number of child nodes contained in the inner tuple, and nodeLabels
is an array of their label values, or NULL if there are no labels.
The choose
function can determine either that the new value matches one of the existing child nodes, or that a new child node must be added, or that the new value is inconsistent with the tuple prefix and so the inner tuple must be split to create a less restrictive prefix.
If the new value matches one of the existing child nodes, set resultType
to spgMatchNode
. Set nodeN
to the index (from zero) of that node in the node array. Set levelAdd
to the increment in level
caused by descending through that node, or leave it as zero if the operator class does not use levels. Set restDatum
to equal datum
if the operator class does not modify datums from one level to the next, or otherwise set it to the modified value to be used as leafDatum
at the next level.
If a new child node must be added, set resultType
to spgAddNode
. Set nodeLabel
to the label to be used for the new node, and set nodeN
to the index (from zero) at which to insert the node in the node array. After the node has been added, the choose
function will be called again with the modified inner tuple; that call should result in an spgMatchNode
result.
If the new value is inconsistent with the tuple prefix, set resultType
to spgSplitTuple
. This action moves all the existing nodes into a new lower-level inner tuple, and replaces the existing inner tuple with a tuple having a single downlink pointing to the new lower-level inner tuple. Set prefixHasPrefix
to indicate whether the new upper tuple should have a prefix, and if so set prefixPrefixDatum
to the prefix value. This new prefix value must be sufficiently less restrictive than the original to accept the new value to be indexed. Set prefixNNodes
to the number of nodes needed in the new tuple, and set prefixNodeLabels
to a palloc'd array holding their labels, or to NULL if node labels are not required. Note that the total size of the new upper tuple must be no more than the total size of the tuple it is replacing; this constrains the lengths of the new prefix and new labels. Set childNodeN
to the index (from zero) of the node that will downlink to the new lower-level inner tuple. Set postfixHasPrefix
to indicate whether the new lower-level inner tuple should have a prefix, and if so set postfixPrefixDatum
to the prefix value. The combination of these two prefixes and the downlink node's label (if any) must have the same meaning as the original prefix, because there is no opportunity to alter the node labels that are moved to the new lower-level tuple, nor to change any child index entries. After the node has been split, the choose
function will be called again with the replacement inner tuple. That call may return an spgAddNode
result, if no suitable node was created by the spgSplitTuple
action. Eventually choose
must return spgMatchNode
to allow the insertion to descend to the next level.picksplit
Decides how to create a new inner tuple over a set of leaf tuples.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgPickSplitIn
C struct, containing input data for the function. The second argument is a pointer to a spgPickSplitOut
C struct, which the function must fill with result data.
nTuples
is the number of leaf tuples provided. datums
is an array of their datum values. level
is the current level that all the leaf tuples share, which will become the level of the new inner tuple.
Set hasPrefix
to indicate whether the new inner tuple should have a prefix, and if so set prefixDatum
to the prefix value. Set nNodes
to indicate the number of nodes that the new inner tuple will contain, and set nodeLabels
to an array of their label values, or to NULL if node labels are not required. Set mapTuplesToNodes
to an array that gives the index (from zero) of the node that each leaf tuple should be assigned to. Set leafTupleDatums
to an array of the values to be stored in the new leaf tuples (these will be the same as the input datums
if the operator class does not modify datums from one level to the next). Note that the picksplit
function is responsible for palloc'ing the nodeLabels
,mapTuplesToNodes
and leafTupleDatums
arrays.
If more than one leaf tuple is supplied, it is expected that the picksplit
function will classify them into more than one node; otherwise it is not possible to split the leaf tuples across multiple pages, which is the ultimate purpose of this operation. Therefore, if the picksplit
function ends up placing all the leaf tuples in the same node, the core SP-GiST code will override that decision and generate an inner tuple in which the leaf tuples are assigned at random to several identically-labeled nodes. Such a tuple is marked allTheSame
to signify that this has happened. The choose
and inner_consistent
functions must take suitable care with such inner tuples. See Section 63.4.3 for more information.
picksplit
can be applied to a single leaf tuple only in the case that the config
function set longValuesOK
to true and a larger-than-a-page input value has been supplied. In this case the point of the operation is to strip off a prefix and produce a new, shorter leaf datum value. The call will be repeated until a leaf datum short enough to fit on a page has been produced. See Section 63.4.1 for more information.inner_consistent
Returns set of nodes (branches) to follow during tree search.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgInnerConsistentIn
C struct, containing input data for the function. The second argument is a pointer to a spgInnerConsistentOut
C struct, which the function must fill with result data.
The array scankeys
, of length nkeys
, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them are interesting. (Note that nkeys
= 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy
and sk_argument
fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags
to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue
is the value reconstructed for the parent tuple; it is (Datum) 0
at the root level or if the inner_consistent
function did not provide a value at the parent level. traversalValue
is a pointer to any traverse data passed down from the previous call of inner_consistent
on the parent index tuple, or NULL at the root level. traversalMemoryContext
is the memory context in which to store output traverse values (see below). level
is the current inner tuple's level, starting at zero for the root level. returnData
is true
if reconstructed data is required for this query; this will only be so if the config
function asserted canReturnData
. allTheSame
is true if the current inner tuple is marked “all-the-same”; in this case all the nodes have the same label (if any) and so either all or none of them match the query (see Section 63.4.3).hasPrefix
is true if the current inner tuple contains a prefix; if so, prefixDatum
is its value. nNodes
is the number of child nodes contained in the inner tuple, and nodeLabels
is an array of their label values, or NULL if the nodes do not have labels.
nNodes
must be set to the number of child nodes that need to be visited by the search, and nodeNumbers
must be set to an array of their indexes. If the operator class keeps track of levels, set levelAdds
to an array of the level increments required when descending to each node to be visited. (Often these increments will be the same for all the nodes, but that's not necessarily so, so an array is used.) If value reconstruction is needed, set reconstructedValues
to an array of the values reconstructed for each child node to be visited; otherwise, leave reconstructedValues
as NULL. If it is desired to pass down additional out-of-band information (“traverse values”) to lower levels of the tree search, settraversalValues
to an array of the appropriate traverse values, one for each child node to be visited; otherwise, leave traversalValues
as NULL. Note that the inner_consistent
function is responsible for palloc'ing the nodeNumbers
, levelAdds
, reconstructedValues
, and traversalValues
arrays in the current memory context. However, any output traverse values pointed to by the traversalValues
array should be allocated in traversalMemoryContext
. Each traverse value must be a single palloc'd chunk.leaf_consistent
Returns true if a leaf tuple satisfies a query.
The SQL declaration of the function must look like this:
The first argument is a pointer to a spgLeafConsistentIn
C struct, containing input data for the function. The second argument is a pointer to a spgLeafConsistentOut
C struct, which the function must fill with result data.
The array scankeys
, of length nkeys
, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them satisfy the query. (Note that nkeys
= 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy
and sk_argument
fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags
to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue
is the value reconstructed for the parent tuple; it is (Datum) 0
at the root level or if the inner_consistent
function did not provide a value at the parent level. traversalValue
is a pointer to any traverse data passed down from the previous call of inner_consistent
on the parent index tuple, or NULL at the root level. level
is the current leaf tuple's level, starting at zero for the root level. returnData
is true
if reconstructed data is required for this query; this will only be so if the config
function asserted canReturnData
. leafDatum
is the key value stored in the current leaf tuple.
The function must return true
if the leaf tuple matches the query, or false
if not. In the true
case, if returnData
is true
then leafValue
must be set to the value originally supplied to be indexed for this leaf tuple. Also, recheck
may be set to true
if the match is uncertain and so the operator(s) must be re-applied to the actual heap tuple to verify the match.
All the SP-GiST support methods are normally called in a short-lived memory context; that is, CurrentMemoryContext
will be reset after processing of each tuple. It is therefore not very important to worry about pfree'ing everything you palloc. (The config
method is an exception: it should try to avoid leaking memory. But usually the config
method need do nothing but assign constants into the passed parameter struct.)
If the indexed column is of a collatable data type, the index collation will be passed to all the support methods, using the standard PG_GET_COLLATION()
mechanism.
This section covers implementation details and other tricks that are useful for implementers of SP-GiST operator classes to know.
Individual leaf tuples and inner tuples must fit on a single index page (8kB by default). Therefore, when indexing values of variable-length data types, long values can only be supported by methods such as radix trees, in which each level of the tree includes a prefix that is short enough to fit on a page, and the final leaf level includes a suffix also short enough to fit on a page. The operator class should set longValuesOK
to TRUE only if it is prepared to arrange for this to happen. Otherwise, the SP-GiST core will reject any request to index a value that is too large to fit on an index page.
Likewise, it is the operator class's responsibility that inner tuples do not grow too large to fit on an index page; this limits the number of child nodes that can be used in one inner tuple, as well as the maximum size of a prefix value.
Another limitation is that when an inner tuple's node points to a set of leaf tuples, those tuples must all be in the same index page. (This is a design decision to reduce seeking and save space in the links that chain such tuples together.) If the set of leaf tuples grows too large for a page, a split is performed and an intermediate inner tuple is inserted. For this to fix the problem, the new inner tuple must divide the set of leaf values into more than one node group. If the operator class's picksplit
function fails to do that, the SP-GiST core resorts to extraordinary measures described in Section 63.4.3.
Some tree algorithms use a fixed set of nodes for each inner tuple; for example, in a quad-tree there are always exactly four nodes corresponding to the four quadrants around the inner tuple's centroid point. In such a case the code typically works with the nodes by number, and there is no need for explicit node labels. To suppress node labels (and thereby save some space), the picksplit
function can return NULL for the nodeLabels
array, and likewise the choose
function can return NULL for the prefixNodeLabels
array during a spgSplitTuple
action. This will in turn result in nodeLabels
being NULL during subsequent calls to choose
and inner_consistent
. In principle, node labels could be used for some inner tuples and omitted for others in the same index.
When working with an inner tuple having unlabeled nodes, it is an error for choose
to return spgAddNode
, since the set of nodes is supposed to be fixed in such cases.
The SP-GiST core can override the results of the operator class's picksplit
function when picksplit
fails to divide the supplied leaf values into at least two node categories. When this happens, the new inner tuple is created with multiple nodes that each have the same label (if any) that picksplit
gave to the one node it did use, and the leaf values are divided at random among these equivalent nodes. The allTheSame
flag is set on the inner tuple to warn the choose
and inner_consistent
functions that the tuple does not have the node set that they might otherwise expect.
When dealing with an allTheSame
tuple, a choose
result of spgMatchNode
is interpreted to mean that the new value can be assigned to any of the equivalent nodes; the core code will ignore the supplied nodeN
value and descend into one of the nodes at random (so as to keep the tree balanced). It is an error for choose
to return spgAddNode
, since that would make the nodes not all equivalent; the spgSplitTuple
action must be used if the value to be inserted doesn't match the existing nodes.
When dealing with an allTheSame
tuple, the inner_consistent
function should return either all or none of the nodes as targets for continuing the index search, since they are all equivalent. This may or may not require any special-case code, depending on how much the inner_consistent
function normally assumes about the meaning of the nodes.
As shown in Table 37.9, btree defines one required and four optional support functions. The five user-defined methods are:order
For each combination of data types that a btree operator family provides comparison operators for, it must provide a comparison support function, registered in pg_amproc
with support function number 1 and amproclefttype
/amprocrighttype
equal to the left and right data types for the comparison (i.e., the same data types that the matching operators are registered with in pg_amop
). The comparison function must take two non-null values A
and B
and return an int32
value that is <
0
, 0
, or >
0
when A
<
B
, A
=
B
, or A
>
B
, respectively. A null result is disallowed: all values of the data type must be comparable. See src/backend/access/nbtree/nbtcompare.c
for examples.
If the compared values are of a collatable data type, the appropriate collation OID will be passed to the comparison support function, using the standard PG_GET_COLLATION()
mechanism.sortsupport
Optionally, a btree operator family may provide sort support function(s), registered under support function number 2. These functions allow implementing comparisons for sorting purposes in a more efficient way than naively calling the comparison support function. The APIs involved in this are defined in src/include/utils/sortsupport.h
.in_range
Optionally, a btree operator family may provide in_range support function(s), registered under support function number 3. These are not used during btree index operations; rather, they extend the semantics of the operator family so that it can support window clauses containing the RANGE
offset
PRECEDING
and RANGE
offset
FOLLOWING
frame bound types (see Section 4.2.8). Fundamentally, the extra information provided is how to add or subtract an offset
value in a way that is compatible with the family's data ordering.
An in_range
function must have the signature
val
and base
must be of the same type, which is one of the types supported by the operator family (i.e., a type for which it provides an ordering). However, offset
could be of a different type, which might be one otherwise unsupported by the family. An example is that the built-in time_ops
family provides an in_range
function that has offset
of type interval
. A family can provide in_range
functions for any of its supported types and one or more offset
types. Each in_range
function should be entered in pg_amproc
with amproclefttype
equal to type1
and amprocrighttype
equal to type2
.
The essential semantics of an in_range
function depend on the two Boolean flag parameters. It should add or subtract base
and offset
, then compare val
to the result, as follows:
if !
sub
and !
less
, return val
>=
(base
+
offset
)
if !
sub
and less
, return val
<=
(base
+
offset
)
if sub
and !
less
, return val
>=
(base
-
offset
)
if sub
and less
, return val
<=
(base
-
offset
)
Before doing so, the function should check the sign of offset
: if it is less than zero, raise error ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE
(22013) with error text like “invalid preceding or following size in window function”. (This is required by the SQL standard, although nonstandard operator families might perhaps choose to ignore this restriction, since there seems to be little semantic necessity for it.) This requirement is delegated to the in_range
function so that the core code needn't understand what “less than zero” means for a particular data type.
An additional expectation is that in_range
functions should, if practical, avoid throwing an error if base
+
offset
or base
-
offset
would overflow. The correct comparison result can be determined even if that value would be out of the data type's range. Note that if the data type includes concepts such as “infinity” or “NaN”, extra care may be needed to ensure that in_range
's results agree with the normal sort order of the operator family.
The results of the in_range
function must be consistent with the sort ordering imposed by the operator family. To be precise, given any fixed values of offset
and sub
, then:
If in_range
with less
= true is true for some val1
and base
, it must be true for every val2
<=
val1
with the same base
.
If in_range
with less
= true is false for some val1
and base
, it must be false for every val2
>=
val1
with the same base
.
If in_range
with less
= true is true for some val
and base1
, it must be true for every base2
>=
base1
with the same val
.
If in_range
with less
= true is false for some val
and base1
, it must be false for every base2
<=
base1
with the same val
.
Analogous statements with inverted conditions hold when less
= false.
If the type being ordered (type1
) is collatable, the appropriate collation OID will be passed to the in_range
function, using the standard PG_GET_COLLATION() mechanism.
in_range
functions need not handle NULL inputs, and typically will be marked strict.equalimage
Optionally, a btree operator family may provide equalimage
(“equality implies image equality”) support functions, registered under support function number 4. These functions allow the core code to determine when it is safe to apply the btree deduplication optimization. Currently, equalimage
functions are only called when building or rebuilding an index.
An equalimage
function must have the signature
The return value is static information about an operator class and collation. Returning true
indicates that the order
function for the operator class is guaranteed to only return 0
(“arguments are equal”) when its A
and B
arguments are also interchangeable without any loss of semantic information. Not registering an equalimage
function or returning false
indicates that this condition cannot be assumed to hold.
The opcintype
argument is the pg_type
.oid of the data type that the operator class indexes. This is a convenience that allows reuse of the same underlying equalimage
function across operator classes. If opcintype
is a collatable data type, the appropriate collation OID will be passed to the equalimage
function, using the standard PG_GET_COLLATION()
mechanism.
As far as the operator class is concerned, returning true
indicates that deduplication is safe (or safe for the collation whose OID was passed to its equalimage
function). However, the core code will only deem deduplication safe for an index when every indexed column uses an operator class that registers an equalimage
function, and each function actually returns true
when called.
Image equality is almost the same condition as simple bitwise equality. There is one subtle difference: When indexing a varlena data type, the on-disk representation of two image equal datums may not be bitwise equal due to inconsistent application of TOAST compression on input. Formally, when an operator class's equalimage
function returns true
, it is safe to assume that the datum_image_eq()
C function will always agree with the operator class's order
function (provided that the same collation OID is passed to both the equalimage
and order
functions).
The core code is fundamentally unable to deduce anything about the “equality implies image equality” status of an operator class within a multiple-data-type family based on details from other operator classes in the same family. Also, it is not sensible for an operator family to register a cross-type equalimage
function, and attempting to do so will result in an error. This is because “equality implies image equality” status does not just depend on sorting/equality semantics, which are more or less defined at the operator family level. In general, the semantics that one particular data type implements must be considered separately.
The convention followed by the operator classes included with the core PostgreSQL distribution is to register a stock, generic equalimage
function. Most operator classes register btequalimage()
, which indicates that deduplication is safe unconditionally. Operator classes for collatable data types such as text
register btvarstrequalimage()
, which indicates that deduplication is safe with deterministic collations. Best practice for third-party extensions is to register their own custom function to retain control.
options
Optionally, a B-tree operator family may provide options
(“operator class specific options”) support functions, registered under support function number 5. These functions define a set of user-visible parameters that control operator class behavior.
An options
support function must have the signature
The function is passed a pointer to a local_relopts
struct, which needs to be filled with a set of operator class specific options. The options can be accessed from other support functions using the PG_HAS_OPCLASS_OPTIONS()
and PG_GET_OPCLASS_OPTIONS()
macros.
Currently, no B-Tree operator class has an options
support function. B-tree doesn't allow flexible representation of keys like GiST, SP-GiST, GIN and BRIN do. So, options
probably doesn't have much application in the current B-tree index access method. Nevertheless, this support function was added to B-tree for uniformity, and will probably find uses during further evolution of B-tree in PostgreSQL.
The PostgreSQL source distribution includes several examples of index operator classes for SP-GiST, as described in . Look into src/backend/access/spgist/
and src/backend/utils/adt/
to see the code.