1 of 100

VII. 資料庫進階

本部分包含可能用於 PostgreSQL 研發人員的各種內容。

50. PostgreSQL 的內部架構

Author

This chapter originated as part of [sim98], Stefan Simkovics' Master's Thesis prepared at Vienna University of Technology under the direction of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

本章概述了 PostgreSQL 的內部架構。閱讀以下各節後，你應該可以了解一個查詢是如何被處理的。本章的目的不是詳細描述 PostgreSQL 的內部操作，因為那樣的說明太過於詳盡。本章旨在幫助讀者理解資料庫後端內部發生的一些操作程序，從接收查詢的開始到將結果回傳給用戶端之間所發生的事。

50.1. The Path of a Query

Here we give a short overview of the stages a query has to pass in order to obtain a result.

A connection from an application program to the PostgreSQL server has to be established. The application program transmits a query to the server and waits to receive the results sent back by the server.
The parser stage checks the query transmitted by the application program for correct syntax and creates a query tree.
The rewrite system takes the query tree created by the parser stage and looks for any rules (stored in the system catalogs) to apply to the query tree. It performs the transformations given in the rule bodies.
One application of the rewrite system is in the realization of views. Whenever a query against a view (i.e., a virtual table) is made, the rewrite system rewrites the user's query to a query that accesses the base tables given in the view definition instead.
The planner/optimizer takes the (rewritten) query tree and creates a query plan that will be the input to the executor.
It does so by first creating all possible paths leading to the same result. For example if there is an index on a relation to be scanned, there are two paths for the scan. One possibility is a simple sequential scan and the other possibility is to use the index. Next the cost for the execution of each path is estimated and the cheapest path is chosen. The cheapest path is expanded into a complete plan that the executor can use.
The executor recursively steps through the plan tree and retrieves rows in the way represented by the plan. The executor makes use of the storage system while scanning relations, performs sorts and joins, evaluates qualifications and finally hands back the rows derived.

In the following sections we will cover each of the above listed items in more detail to give a better understanding of PostgreSQL's internal control and data structures.

50.2. How Connections Are Established

PostgreSQL is implemented using a simple “process per user” client/server model. In this model there is one client process connected to exactly one server process. As we do not know ahead of time how many connections will be made, we have to use a master process that spawns a new server process every time a connection is requested. This master process is called postgres and listens at a specified TCP/IP port for incoming connections. Whenever a request for a connection is detected the postgres process spawns a new server process. The server tasks communicate with each other using semaphores and shared memory to ensure data integrity throughout concurrent data access.

The client process can be any program that understands the PostgreSQL protocol described in Chapter 52. Many clients are based on the C-language library libpq, but several independent implementations of the protocol exist, such as the Java JDBC driver.

Once a connection is established the client process can send a query to the backend (server). The query is transmitted using plain text, i.e., there is no parsing done in the frontend (client). The server parses the query, creates an execution plan, executes the plan and returns the retrieved rows to the client by transmitting them over the established connection.

50.3. The Parser Stage

The parser stage consists of two parts:

The parser defined in gram.y and scan.l is built using the Unix tools bison and flex.
The transformation process does modifications and augmentations to the data structures returned by the parser.

50.3.1. Parser

The parser has to check the query string (which arrives as plain text) for valid syntax. If the syntax is correct a parse tree is built up and handed back; otherwise an error is returned. The parser and lexer are implemented using the well-known Unix tools bison and flex.

The lexer is defined in the file scan.l and is responsible for recognizing identifiers, the SQL key words etc. For every key word or identifier that is found, a token is generated and handed to the parser.

The parser is defined in the file gram.y and consists of a set of grammar rules and actions that are executed whenever a rule is fired. The code of the actions (which is actually C code) is used to build up the parse tree.

The file scan.l is transformed to the C source file scan.c using the program flex and gram.y is transformed to gram.c using bison. After these transformations have taken place a normal C compiler can be used to create the parser. Never make any changes to the generated C files as they will be overwritten the next time flex or bison is called.

Note

The mentioned transformations and compilations are normally done automatically using the makefiles shipped with the PostgreSQL source distribution.

A detailed description of bison or the grammar rules given in gram.y would be beyond the scope of this paper. There are many books and documents dealing with flex and bison. You should be familiar with bison before you start to study the grammar given in gram.y otherwise you won't understand what happens there.

50.3.2. Transformation Process

The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs, so there is no possibility to understand the detailed semantics of the requested operations. After the parser completes, the transformation process takes the tree handed back by the parser as input and does the semantic interpretation needed to understand which tables, functions, and operators are referenced by the query. The data structure that is built to represent this information is called the query tree.

The reason for separating raw parsing from semantic analysis is that system catalog lookups can only be done within a transaction, and we do not wish to start a transaction immediately upon receiving a query string. The raw parsing stage is sufficient to identify the transaction control commands (BEGIN, ROLLBACK, etc), and these can then be correctly executed without any further analysis. Once we know that we are dealing with an actual query (such as SELECT or UPDATE), it is okay to start a transaction if we're not already in one. Only then can the transformation process be invoked.

The query tree created by the transformation process is structurally similar to the raw parse tree in most places, but it has many differences in detail. For example, a FuncCall node in the parse tree represents something that looks syntactically like a function call. This might be transformed to either a FuncExpr or Aggref node depending on whether the referenced name turns out to be an ordinary function or an aggregate function. Also, information about the actual data types of columns and expression results is added to the query tree.

50.4. The PostgreSQL Rule System

PostgreSQL supports a powerful rule system for the specification of views and ambiguous view updates. Originally the PostgreSQL rule system consisted of two implementations:

The first one worked using row level processing and was implemented deep in the executor. The rule system was called whenever an individual row had been accessed. This implementation was removed in 1995 when the last official release of the Berkeley Postgres project was transformed into Postgres95.
The second implementation of the rule system is a technique called query rewriting. The rewrite system is a module that exists between the parser stage and the planner/optimizer. This technique is still implemented.

The query rewriter is discussed in some detail in Chapter 40, so there is no need to cover it here. We will only point out that both the input and the output of the rewriter are query trees, that is, there is no change in the representation or level of semantic detail in the trees. Rewriting can be thought of as a form of macro expansion.

50.5. Planner/Optimizer

The task of the planner/optimizer is to create an optimal execution plan. A given SQL query (and hence, a query tree) can be actually executed in a wide variety of different ways, each of which will produce the same set of results. If it is computationally feasible, the query optimizer will examine each of these possible execution plans, ultimately selecting the execution plan that is expected to run the fastest.

Note

In some situations, examining each possible way in which a query can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine a reasonable (not necessarily optimal) query plan in a reasonable amount of time, PostgreSQL uses a Genetic Query Optimizer (see Chapter 59) when the number of joins exceeds a threshold (see geqo_threshold).

The planner's search procedure actually works with data structures called paths, which are simply cut-down representations of plans containing only as much information as the planner needs to make its decisions. After the cheapest path is determined, a full-fledged plan tree is built to pass to the executor. This represents the desired execution plan in sufficient detail for the executor to run it. In the rest of this section we'll ignore the distinction between paths and plans.

50.5.1. Generating Possible Plans

The planner/optimizer starts by generating plans for scanning each individual relation (table) used in the query. The possible plans are determined by the available indexes on each relation. There is always the possibility of performing a sequential scan on a relation, so a sequential scan plan is always created. Assume an index is defined on a relation (for example a B-tree index) and a query contains the restriction relation.attribute OPR constant. If relation.attribute happens to match the key of the B-tree index and OPR is one of the operators listed in the index's operator class, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an index, further plans will be considered. Index scan plans are also generated for indexes that have a sort ordering that can match the query's ORDER BY clause (if any), or a sort ordering that might be useful for merge joining (see below).

If the query requires joining two or more relations, plans for joining relations are considered after all feasible plans have been found for scanning single relations. The three available join strategies are:

nested loop join: The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

When the query involves more than two relations, the final result must be built up by a tree of join steps, each with two inputs. The planner examines different possible join sequences to find the cheapest one.

If the query uses fewer than geqo_threshold relations, a near-exhaustive search is conducted to find the best join sequence. The planner preferentially considers joins between any two relations for which there exist a corresponding join clause in the WHERE qualification (i.e., for which a restriction like where rel1.attr1=rel2.attr2 exists). Join pairs with no join clause are considered only when there is no other choice, that is, a particular relation has no available join clauses to any other relation. All possible plans are generated for every join pair considered by the planner, and the one that is (estimated to be) the cheapest is chosen.

When geqo_threshold is exceeded, the join sequences considered are determined by heuristics, as described in Chapter 59. Otherwise the process is the same.

The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as needed, plus any auxiliary steps needed, such as sort nodes or aggregate-function calculation nodes. Most of these plan node types have the additional ability to do selection (discarding rows that do not meet a specified Boolean condition) and projection (computation of a derived column set based on given column values, that is, evaluation of scalar expressions where needed). One of the responsibilities of the planner is to attach selection conditions from the WHERE clause and computation of required output expressions to the most appropriate nodes of the plan tree.

50.6. Executor

The executor takes the plan created by the planner/optimizer and recursively processes it to extract the required set of rows. This is essentially a demand-pull pipeline mechanism. Each time a plan node is called, it must deliver one more row, or report that it is done delivering rows.

To provide a concrete example, assume that the top node is a MergeJoin node. Before any merge can be done two rows have to be fetched (one from each subplan). So the executor recursively calls itself to process the subplans (it starts with the subplan attached to lefttree). The new top node (the top node of the left subplan) is, let's say, a Sort node and again recursion is needed to obtain an input row. The child node of the Sort might be a SeqScan node, representing actual reading of a table. Execution of this node causes the executor to fetch a row from the table and return it up to the calling node. The Sort node will repeatedly call its child to obtain all the rows to be sorted. When the input is exhausted (as indicated by the child node returning a NULL instead of a row), the Sort code performs the sort, and finally is able to return its first output row, namely the first one in sorted order. It keeps the remaining rows stored so that it can deliver them in sorted order in response to later demands.

The MergeJoin node similarly demands the first row from its right subplan. Then it compares the two rows to see if they can be joined; if so, it returns a join row to its caller. On the next call, or immediately if it cannot join the current pair of inputs, it advances to the next row of one table or the other (depending on how the comparison came out), and again checks for a match. Eventually, one subplan or the other is exhausted, and the MergeJoin node returns NULL to indicate that no more join rows can be formed.

Complex queries can involve many levels of plan nodes, but the general approach is the same: each node computes and returns its next output row each time it is called. Each node is also responsible for applying any selection or projection expressions that were assigned to it by the planner.

The executor mechanism is used to evaluate all four basic SQL query types: SELECT, INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs to send each row returned by the query plan tree off to the client. For INSERT, each returned row is inserted into the target table specified for the INSERT. This is done in a special top-level plan node called ModifyTable. (A simple INSERT ... VALUES command creates a trivial plan tree consisting of a single Result node, which computes just one result row, and ModifyTable above it to perform the insertion. But INSERT ... SELECT can demand the full power of the executor mechanism.) For UPDATE, the planner arranges that each computed row includes all the updated column values, plus the TID (tuple ID, or row ID) of the original target row; this data is fed into a ModifyTable node, which uses the information to create a new updated row and mark the old row deleted. For DELETE, the only column that is actually returned by the plan is the TID, and the ModifyTable node simply uses the TID to visit each target row and mark it deleted.

51. 系統目錄

系統目錄（system catalog）是記錄資料庫管理系統儲存結構原始資料的地方，例如關於資料表和欄位的訊息以及內部日誌記錄訊息。PostgreSQL 的系統目錄是一般的資料表。您可以刪除並重新建立資料表、增加欄位、插入和更新內容，並以這種方式嚴重混淆您的系統。當然，通常情況下，不應該手動更改系統目錄，通常有 SQL 命令來執行此操作。（例如，CREATE DATABASE 向 pg_database 系統目錄插入一行 - 實際上是在磁碟上建立數據庫）。對於特別深奧的操作有一些例外，但其中很多已經隨著時間的推移而變為 SQL 命令，因此需要系統目錄的直接操作正在不斷減少。

51.3. pg_am

The catalog pg_am stores information about relation access methods. There is one row for each access method supported by the system. Currently, only indexes have access methods. The requirements for index access methods are discussed in detail in Chapter 60.

Table 51.3. pg_am Columns

Note

Before PostgreSQL 9.6, pg_am contained many additional columns representing properties of index access methods. That data is now only directly visible at the C code level. However,pg_index_column_has_property() and related functions have been added to allow SQL queries to inspect index access method properties; see Table 9.63.

51.7. pg_attribute

The catalog pg_attribute stores information about table columns. There will be exactly one pg_attribute row for every column in every table in the database. (There will also be attribute entries for indexes, and indeed all objects that have pg_class entries.)

The term attribute is equivalent to column and is used for historical reasons.

Table 51.7. `pg_attribute` Columns

In a dropped column's pg_attribute entry, atttypid is reset to zero, but attlen and the other fields copied from pg_type are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was later dropped, and so there is no pg_type row anymore. attlen and the other fields can be used to interpret the contents of a row of the table.

51.8. pg_authid

目錄 pg_authid 包含有關資料庫認證識別（角色）的資訊。角色包含「使用者」和「群組」的概念。使用者基本上只是設置了 rolcanlogin 識別的角色。任何角色（無論有或沒有 rolcanlogin）都可以擁有其他角色作為成員；詳見。

由於此目錄包含密碼，因此不得公開讀取。是 pg_authid 上的一個公開可讀的檢視表，它隱藏了密碼字串。

包含有關使用者和權限管理的詳細訊息。

由於使用者身份是叢集範圍的，因此 pg_authid 在叢集的所有資料庫之間共享：每個叢集只有一個 pg_authid 副本，而不是每個資料庫一個副本。

Table 51.8. pg_authid Columns

對於 MD5 加密密碼，rolpassword 欄位將以字串 md5 開頭，之後跟 32 個字元的十六進位 MD5 hash。MD5 hash 將是使用者的密碼連接到他們的使用者名稱。例如，如果使用者 joe 的密碼為 xyzzy，則 PostgreSQL 將儲存 xyzzyjoe 的 md5 hash 值。

如果使用 SCRAM-SHA-256 加密密碼，則其格式為：

其中 salt，StoredKey 和 ServerKey 採用 Base64 編碼格式。此格式與 RFC 5803 指定的格式相同。

未遵循這些格式之一的密碼就會被認為是未加密的。

51.9. pg_auth_members

目錄 pg_auth_members 顯示角色之間的成員資格關連。允許任何非循環的關連。

由於角色身份是叢集範圍的，因此 pg_auth_members 在叢集的所有資料庫之間共享：每個叢集只有一個 pg_auth_members 副本，而不是每個資料庫一個副本。

Table 51.9. pg_auth_members Columns

51.11 pg_class

目錄 pg_class 對資料表和大多數具有欄位或其他類似於資料表的內容進行彙整。包括索引（但也參閱 pg_index）、序列（但請參閱 pg_sequence）、檢視表、具體化檢視表、複合型別和 TOAST 資料表；另請查看 relkind 欄位。以下，當我們指的是所有這些類型的物件時，我們都會說「關連（relation）」。並非所有欄位對所有關連類型都有意義。

Table 52.11. `pg_class` Columns

pg_class 中的幾個布林欄位的維護是鬆散的：如果這是正確的狀態，那它們保證為 true，但是當條件不再為真時，可能不會立即重置為 false。例如，relhasindex 由CREATE INDEX 設定，但它永遠不會被 DROP INDEX 清除。相反地，如果 VACUUM 發現資料表沒有索引，則清除 relhasindex。這種安排避免了競爭條件並改善了一致性。

51.12. pg_collation

The catalog pg_collation describes the available collations, which are essentially mappings from an SQL name to operating system locale categories. See Section 23.2 for more information.

Table 51.12. pg_collation Columns

Note that the unique key on this catalog is (collname, collencoding, collnamespace) not just (collname, collnamespace). PostgreSQL generally ignores all collations that do not have collencoding equal to either the current database's encoding or -1, and creation of new entries with the same name as an entry with collencoding = -1 is forbidden. Therefore it is sufficient to use a qualified SQL name (schema.name) to identify a collation, even though this is not unique according to the catalog definition. The reason for defining the catalog this way is that initdb fills it in at cluster initialization time with entries for all locales available on the system, so it must be able to hold entries for all encodings that might ever be used in the cluster.

In the template0 database, it could be useful to create collations whose encoding does not match the database encoding, since they could match the encodings of databases later cloned from template0. This would currently have to be done manually.

51.13. pg_constraint

The catalog pg_constraint stores check, primary key, unique, foreign key, and exclusion constraints on tables. (Column constraints are not treated specially. Every column constraint is equivalent to some table constraint.) Not-null constraints are represented in the pg_attribute catalog, not here.

User-defined constraint triggers (created with CREATE CONSTRAINT TRIGGER) also give rise to an entry in this table.

Check constraints on domains are stored here, too.

Table 51.13. pg_constraint Columns

In the case of an exclusion constraint, conkey is only useful for constraint elements that are simple column references. For other cases, a zero appears in conkey and the associated index must be consulted to discover the expression that is constrained. (conkey thus has the same contents as pg_index.indkey for the index.)

Note

consrc is not updated when referenced objects change; for example, it won't track renaming of columns. Rather than relying on this field, it's best to use pg_get_constraintdef() to extract the definition of a check constraint.

Note

pg_class.relchecks needs to agree with the number of check-constraint entries found in this table for each relation.

51.15 pg_database

目錄 pg_database 儲存有關資料庫一些可用的訊息。資料庫是使用 CREATE DATABASE 命令建立的。關於某些參數的含義的詳細訊息，請參閱第 22 章。

與大多數系統目錄不同，pg_database 在叢集的所有資料庫之間共享：每個叢集只有一個 pg_database 副本，而不是每個資料庫一個副本。

Table 51.15. pg_database 欄位

51.22. pg_extension

目錄 pg_extension 儲存有關已安裝延伸功能的資訊。有關延伸功能的詳細資訊，請參閱第 37.17 節。

Table 51.22. pg_extension Columns

請注意，與大多數帶有「namespace」欄位的目錄不同，extnamespace 並不暗指該延伸功能屬於該綱要(schema)。延伸功能並不在任何綱要之中。不過 extnamespace 指示包含大多數或所有延伸功能所屬物件的綱要。如果 extrelocatable 為 true，則此綱要實際上必須包含屬於該延伸功能的所有需要綱要的物件。

51.26 pg_index

目錄 pg_index 包含有關索引的部分信息。其餘的大多數是在 pg_class 中。

Table 51.26. pg_index Columns

51.29. pg_language

目錄 pg_language 註冊了可以撰寫函數或 stored procedure 的語言。有關語言處理程序的更多訊息，請參閱 CREATE LANGUAGE 和第 41 章。

Table 51.29. `pg_language` Columns

51.32. pg_namespace

目錄 pg_namespace 儲存命名空間。命名空間是 SQL 綱要的基礎結構：每個命名空間可以有一個獨立的關連，型別等集合，而不會有名稱衝突。

Table 52.32. `pg_namespace` Columns

51.33. pg_opclass

The catalog pg_opclass defines index access method operator classes. Each operator class defines semantics for index columns of a particular data type and a particular index access method. An operator class essentially specifies that a particular operator family is applicable to a particular indexable column data type. The set of operators from the family that are actually usable with the indexed column are whichever ones accept the column's data type as their left-hand input.

Operator classes are described at length in Section 37.14.

Table 51.33. pg_opclass Columns

An operator class's opcmethod must match the opfmethod of its containing operator family. Also, there must be no more than one pg_opclass row having opcdefault true for any given combination of opcmethod and opcintype.

51.38. pg_policy

The catalog pg_policy stores row level security policies for tables. A policy includes the kind of command that it applies to (possibly all commands), the roles that it applies to, the expression to be added as a security-barrier qualification to queries that include the table, and the expression to be added as a WITH CHECK option for queries that attempt to add new records to the table.

Table 51.38. pg_policy Columns

Note

Policies stored in pg_policy are applied only when pg_class.relrowsecurity is set for their table.

51.39. pg_proc

版本：11

目錄 pg_proc 儲存有關函數、程序函數、彙總函數和窗函數（或統稱為 routines）的資訊。有關更多資訊，請參閱 CREATE FUNCTION，CREATE PROCEDURE 和第 37.3 節。

如果 prokind 指示該項目用於彙總函數，則 pg_aggregate 中應有相對應的資料列。

Table 51.39. `pg_proc` Columns

對於內建和動態載入的已編譯函數，prosrc 包含函數的 C 語言名稱（link symbol）。對於所有其他目前已知的語言類型，prosrc 包含函數的原始碼。除了動態載入的 C 函數外，probin 均未使用，因為它用於記錄該函數的共享函式庫檔案的名稱。

51.44. pg_rewrite

The catalog pg_rewrite stores rewrite rules for tables and views.

Table 51.44. pg_rewrite Columns

Note

pg_class.relhasrules must be true if a table has any rules in this catalog.

51.50. pg_statistic

系統目錄 pg_statistic 儲存有關資料庫內容的統計數據。內容項目是由 ANALYZE 所建立的，隨後由查詢計劃程序使用。請注意，所有統計數據本質上都是趨近的，即使假設上它是最新的數據。

Normally there is one entry, with stainherit = false, for each table column that has been analyzed. If the table has inheritance children, a second entry with stainherit = true is also created. This row represents the column's statistics over the inheritance tree, i.e., statistics for the data you'd see with SELECT column FROM table*, whereas the stainherit = false row represents the results of SELECT column FROM ONLY table.

pg_statistic also stores statistical data about the values of index expressions. These are described as if they were actual data columns; in particular, starelid references the index. No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column. Currently, entries for index expressions always have stainherit = false.

Since different kinds of statistics might be appropriate for different kinds of data, pg_statistic is designed not to assume very much about what sort of statistics it stores. Only extremely general statistics (such as nullness) are given dedicated columns in pg_statistic. Everything else is stored in “slots”, which are groups of associated columns whose content is identified by a code number in one of the slot's columns. For more information see src/include/catalog/pg_statistic.h.

pg_statistic should not be readable by the public, since even statistical information about a table's contents might be considered sensitive. (Example: minimum and maximum values of a salary column might be quite interesting.) pg_stats is a publicly readable view on pg_statistic that only exposes information about those tables that are readable by the current user.

Table 51.50. `pg_statistic` Columns

51.51. pg_statistic_ext

目錄 pg_statistic_ext 包含了延伸的計劃程序統計資訊。此目錄中的每一個資料列相對應於使用 CREATE STATISTICS 所建立的統計資訊物件。

Table 52.51. `pg_statistic_ext` Columns

stxkind 欄位會在建立統計物件時填入，指示需要哪種統計類型。在它之後的欄位最初為 NULL，僅在 ANALYZE 計算出相對應的統計量時才填入內容。

51.54. pg_tablespace

系統目錄 pg_tablespace 儲存有關可用資料表空間的訊息。可以將資料表放置在特定的資料表空間中以幫助管理磁碟規畫。

與大多數系統目錄不同，pg_tablespace 在叢集的所有資料庫之間共享：每個叢集只有一個 pg_tablespace 副本，而不是每個資料庫一個副本。

Table 51.54. pg_tablespace 欄位

51.56. pg_trigger

The catalog pg_trigger stores triggers on tables and views. See CREATE TRIGGER for more information.

Table 51.56. pg_trigger Columns

Currently, column-specific triggering is supported only for UPDATE events, and so tgattr is relevant only for that event type. tgtype might contain bits for other event types as well, but those are presumed to be table-wide regardless of what is in tgattr.

Note

When tgconstraint is nonzero, tgconstrrelid, tgconstrindid, tgdeferrable, and tginitdeferred are largely redundant with the referenced pg_constraint entry. However, it is possible for a non-deferrable trigger to be associated with a deferrable constraint: foreign key constraints can have some deferrable and some non-deferrable triggers.

Note

pg_class.relhastriggers must be true if a relation has any triggers in this catalog.

51.62. pg_type

The catalog pg_type stores information about data types. Base types and enum types (scalar types) are created with CREATE TYPE, and domains with CREATE DOMAIN. A composite type is automatically created for each table in the database, to represent the row structure of the table. It is also possible to create composite types with CREATE TYPE AS.

Table 51.62. pg_type Columns

Table 51.63 lists the system-defined values of typcategory. Any future additions to this list will also be upper-case ASCII letters. All other ASCII characters are reserved for user-defined categories.

Table 51.63. typcategory Codes

51.66. pg_available_extensions

pg_available_extensions 檢視表列出可以安裝的延伸功能。另請參閱 pg_extension 系統目錄，該目錄顯示了目前已安裝的延伸功能。

Table 51.67. `pg_available_extensions` Columns

pg_available_extensions 檢視表是唯讀的。

51.67. pg_available_extension_versions

pg_available_extension_versions 檢視圖列出可用於安裝的特定延伸功能版本。另請參閱 pg_extension 目錄，該目錄列出了目前已安裝的延伸功能。

Table 51.68. `pg_available_extension_versions` Columns

pg_available_extension_versions 檢視表是唯讀的。

51.72. pg_hba_file_rules

The view pg_hba_file_rules provides a summary of the contents of the client authentication configuration file, pg_hba.conf. A row appears in this view for each non-empty, non-comment line in the file, with annotations indicating whether the rule could be applied successfully.

This view can be helpful for checking whether planned changes in the authentication configuration file will work, or for diagnosing a previous failure. Note that this view reports on the current contents of the file, not on what was last loaded by the server.

By default, the pg_hba_file_rules view can be read only by superusers.

Table 51.73. `pg_hba_file_rules` Columns

Usually, a row reflecting an incorrect entry will have values for only the line_number and error fields.

有關用戶端身份驗證設定的更多資訊，請參閱第 20 章。

51.73. pg_indexes

The view pg_indexes provides access to useful information about each index in the database.

Table 51.74. `pg_indexes` Columns

51.74. pg_locks

The view pg_locks provides access to information about the locks held by active processes within the database server. See Chapter 13 for more discussion of locking.

pg_locks contains one row per active lockable object, requested lock mode, and relevant process. Thus, the same lockable object might appear many times, if multiple processes are holding or waiting for locks on it. However, an object that currently has no locks on it will not appear at all.

There are several distinct types of lockable objects: whole relations (e.g., tables), individual pages of relations, individual tuples of relations, transaction IDs (both virtual and permanent IDs), and general database objects (identified by class OID and object OID, in the same way as in pg_description or pg_depend). Also, the right to extend a relation is represented as a separate lockable object. Also, “advisory” locks can be taken on numbers that have user-defined meanings.

Table 51.75. `pg_locks` Columns

granted is true in a row representing a lock held by the indicated process. False indicates that this process is currently waiting to acquire this lock, which implies that at least one other process is holding or waiting for a conflicting lock mode on the same lockable object. The waiting process will sleep until the other lock is released (or a deadlock situation is detected). A single process can be waiting to acquire at most one lock at a time.

Throughout running a transaction, a server process holds an exclusive lock on the transaction's virtual transaction ID. If a permanent ID is assigned to the transaction (which normally happens only if the transaction changes the state of the database), it also holds an exclusive lock on the transaction's permanent transaction ID until it ends. When a process finds it necessary to wait specifically for another transaction to end, it does so by attempting to acquire share lock on the other transaction's ID (either virtual or permanent ID depending on the situation). That will succeed only when the other transaction terminates and releases its locks.

Although tuples are a lockable type of object, information about row-level locks is stored on disk, not in memory, and therefore row-level locks normally do not appear in this view. If a process is waiting for a row-level lock, it will usually appear in the view as waiting for the permanent transaction ID of the current holder of that row lock.

Advisory locks can be acquired on keys consisting of either a single bigint value or two integer values. A bigint key is displayed with its high-order half in the classid column, its low-order half in the objid column, and objsubid equal to 1. The original bigint value can be reassembled with the expression (classid::bigint << 32) | objid::bigint. Integer keys are displayed with the first key in the classid column, the second key in the objid column, and objsubid equal to 2. The actual meaning of the keys is up to the user. Advisory locks are local to each database, so the database column is meaningful for an advisory lock.

pg_locks provides a global view of all locks in the database cluster, not only those relevant to the current database. Although its relation column can be joined against pg_class.oid to identify locked relations, this will only work correctly for relations in the current database (those for which the database column is either the current database's OID or zero).

The pid column can be joined to the pid column of the pg_stat_activity view to get more information on the session holding or awaiting each lock, for example

SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa
    ON pl.pid = psa.pid;

Also, if you are using prepared transactions, the virtualtransaction column can be joined to the transaction column of the pg_prepared_xacts view to get more information on prepared transactions that hold locks. (A prepared transaction can never be waiting for a lock, but it continues to hold the locks it acquired while running.) For example:

SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
    ON pl.virtualtransaction = '-1/' || ppx.transaction;

While it is possible to obtain information about which processes block which other processes by joining pg_locks against itself, this is very difficult to get right in detail. Such a query would have to encode knowledge about which lock modes conflict with which others. Worse, the pg_locks view does not expose information about which processes are ahead of which others in lock wait queues, nor information about which processes are parallel workers running on behalf of which other client sessions. It is better to use the pg_blocking_pids() function (see Table 9.63) to identify which process(es) a waiting process is blocked behind.

The pg_locks view displays data from both the regular lock manager and the predicate lock manager, which are separate systems; in addition, the regular lock manager subdivides its locks into regular and fast-path locks. This data is not guaranteed to be entirely consistent. When the view is queried, data on fast-path locks (with fastpath = true) is gathered from each backend one at a time, without freezing the state of the entire lock manager, so it is possible for locks to be taken or released while information is gathered. Note, however, that these locks are known not to conflict with any other lock currently in place. After all backends have been queried for fast-path locks, the remainder of the regular lock manager is locked as a unit, and a consistent snapshot of all remaining locks is collected as an atomic action. After unlocking the regular lock manager, the predicate lock manager is similarly locked and all predicate locks are collected as an atomic action. Thus, with the exception of fast-path locks, each lock manager will deliver a consistent set of results, but as we do not lock both lock managers simultaneously, it is possible for locks to be taken or released after we interrogate the regular lock manager and before we interrogate the predicate lock manager.

Locking the regular and/or predicate lock manager could have some impact on database performance if this view is very frequently accessed. The locks are held only for the minimum amount of time necessary to obtain data from the lock managers, but this does not completely eliminate the possibility of a performance impact.

51.79. pg_replication_origin_status

The pg_replication_origin_status view contains information about how far replay for a certain origin has progressed. For more on replication origins see Chapter 49.

Table 51.80. pg_replication_origin_status Columns

51.82 pg_roles

pg_roles 這個 view 提供對資料庫角色的資訊。這只是 pg_authid 的一個公共可讀的 view，它將密碼部份予以屏蔽。

此 view 列出底層的 OID 欄位，因此可能需要執行與其他目錄的交叉查詢。

Table 51.83. pg_roles 欄位

51.85. pg_settings

檢視表 pg_settings 提供對伺服器的執行時參數的存取。它本質上是 SHOW 和 SET 指令的替代介面。它也提供 SHOW 無法直接獲得的一些資訊存取，例如最小值和最大值。

Table 52.86. `pg_settings` Columns

設定內容有幾種可能的值，是為了降低變更組態的複雜度，它們是：

internal

這些設定無法直接更改；它們反映了內部所決定的值，其中一些可以透過使用不同的組態選項重建伺服器，或透過更改提供給 initdb 的選項來調整。

postmaster

這些設定只能在伺服器啟動時套用，因此任何變更都需要重新啟動伺服器。這些設定的值通常儲存在 postgresql.conf 檔案中，或在啟動伺服器時在命令列中給予。當然，也可以在伺服器啟動時設定任何層級較低的設定。

sighup

可以在 postgresql.conf 中對這些設定進行變更，而毌須重新啟動伺服器。只要向 postmaster 發送一個 SIGHUP 信號，使其重新讀取 postgresql.conf 並套用變更。postmaster 還會將 SIGHUP 信號轉發給其子程序，以便它們都獲取新值。

superuser-backend

可以在 postgresql.conf 中對這些設定進行變更，而毌須重新啟動伺服器。它們也可以在連線要求的封包中設定為特別連線（例如，透過 libpq 的 PGOPTIONS 環境變數），但前提是連線使用者是超級使用者。但是，這些設定在啟動後的連線中永遠不會變更。如果你在 postgresql.conf 中更改它們，請向 postmaster 發送一個 SIGHUP 信號，使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。

backend

可以在 postgresql.conf 中對這些設定進行變更，而毌須重新啟動伺服器。它們也可以在連線請求封包中設定為特別連線（例如，透過 libpq 的 PGOPTIONS 環境變數）；任何使用者都可以為他們的連線進行這樣的變更。但是，這些設定在啟動後的連線中永遠無法變更。如果你在 postgresql.conf 中更改它們，請向 postmaster 發送一個 SIGHUP 信號，使其重新讀取 postgresql.conf。新值只會影響隨後啟動的連線。

superuser

這些設定可以從 postgresql.conf 設定，也可以透過 SET 指令在連線中設定；但只有超級使用者可以透過 SET 來更改。僅當沒有使用 SET 建立連線專用的值時，postgresql.conf 中的變更才會影響現有連線。

user

這些設定可以從 postgresql.conf 設定，也可以透過 SET 指令在連線中設定。允許任何使用者變更其連線中所使用的值。僅當未使用 SET 未建立連線專用值時，postgresql.conf 中的變更才會影響現有連線。

有關變更這些參數的各種方法和更多資訊，請參閱第 19.1 節。

pg_settings 檢視表無法INSERT 或 DELETE，但可以 UPDATE。套用於一行 pg_settings 的 UPDATE 相當於對該參數執行 SET 指令。此變更僅影響目前連線所使用的值。如果在稍後中止的交易事務中發出 UPDATE，則在回溯事務時 UPDATE 指令的效果會消失。一旦提交了相關的事務，則效果將持續到連線結束，除非被另一個 UPDATE 或 SET 覆蓋。

51.87. pg_stats

The view pg_stats provides access to the information stored in the pg_statistic catalog. This view allows access only to rows of pg_statistic that correspond to tables the user has permission to read, and therefore it is safe to allow public read access to this view.

pg_stats is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic.

Table 51.88. pg_stats Columns

The maximum number of entries in the array fields can be controlled on a column-by-column basis using the ALTER TABLE SET STATISTICS command, or globally by setting the default_statistics_target run-time parameter.

51.90. pg_tables

The view pg_tables provides access to useful information about each table in the database.

Table 51.91. `pg_tables` Columns

51.95. pg_views

檢視表 pg_views 提供對資料庫中每個檢視表有用的資訊。

Table 51.96. `pg_views` Columns

52. Frontend/Backend Protocol

PostgreSQLuses a message-based protocol for communication between frontends and backends (clients and servers). The protocol is supported overTCP/IPand also over Unix-domain sockets. Port number 5432 has been registered with IANA as the customary TCP port number for servers supporting this protocol, but in practice any non-privileged port number can be used.

This document describes version 3.0 of the protocol, implemented inPostgreSQL7.4 and later. For descriptions of the earlier protocol versions, see previous releases of thePostgreSQLdocumentation. A single server can support multiple protocol versions. The initial startup-request message tells the server which protocol version the client is attempting to use, and then the server follows that protocol if it is able.

In order to serve multiple clients efficiently, the server launches a new“backend”process for each client. In the current implementation, a new child process is created immediately after an incoming connection is detected. This is transparent to the protocol, however. For purposes of the protocol, the terms“backend”and“server”are interchangeable; likewise“frontend”and“client”are interchangeable.

52.1. Overview

The protocol has separate phases for startup and normal operation. In the startup phase, the frontend opens a connection to the server and authenticates itself to the satisfaction of the server. (This might involve a single message, or multiple messages depending on the authentication method being used.) If all goes well, the server then sends status information to the frontend, and finally enters normal operation. Except for the initial startup-request message, this part of the protocol is driven by the server.

During normal operation, the frontend sends queries and other commands to the backend, and the backend sends back query results and other responses. There are a few cases (such as NOTIFY) wherein the backend will send unsolicited messages, but for the most part this portion of a session is driven by frontend requests.

Termination of the session is normally by frontend choice, but can be forced by the backend in certain cases. In any case, when the backend closes the connection, it will roll back any open (incomplete) transaction before exiting.

Within normal operation, SQL commands can be executed through either of two sub-protocols. In the “simple query” protocol, the frontend just sends a textual query string, which is parsed and immediately executed by the backend. In the “extended query” protocol, processing of queries is separated into multiple steps: parsing, binding of parameter values, and execution. This offers flexibility and performance benefits, at the cost of extra complexity.

Normal operation has additional sub-protocols for special operations such as COPY.

52.1.1. Messaging Overview

All communication is through a stream of messages. The first byte of a message identifies the message type, and the next four bytes specify the length of the rest of the message (this length count includes itself, but not the message-type byte). The remaining contents of the message are determined by the message type. For historical reasons, the very first message sent by the client (the startup message) has no initial message-type byte.

To avoid losing synchronization with the message stream, both servers and clients typically read an entire message into a buffer (using the byte count) before attempting to process its contents. This allows easy recovery if an error is detected while processing the contents. In extreme situations (such as not having enough memory to buffer the message), the receiver can use the byte count to determine how much input to skip before it resumes reading messages.

Conversely, both servers and clients must take care never to send an incomplete message. This is commonly done by marshaling the entire message in a buffer before beginning to send it. If a communications failure occurs partway through sending or receiving a message, the only sensible response is to abandon the connection, since there is little hope of recovering message-boundary synchronization.

52.1.2. Extended Query Overview

In the extended-query protocol, execution of SQL commands is divided into multiple steps. The state retained between steps is represented by two types of objects: prepared statements and portals. A prepared statement represents the result of parsing and semantic analysis of a textual query string. A prepared statement is not in itself ready to execute, because it might lack specific values for parameters. A portal represents a ready-to-execute or already-partially-executed statement, with any missing parameter values filled in. (For SELECT statements, a portal is equivalent to an open cursor, but we choose to use a different term since cursors don't handle non-SELECT statements.)

The overall execution cycle consists of a parse step, which creates a prepared statement from a textual query string; a bind step, which creates a portal given a prepared statement and values for any needed parameters; and an execute step that runs a portal's query. In the case of a query that returns rows (SELECT, SHOW, etc), the execute step can be told to fetch only a limited number of rows, so that multiple execute steps might be needed to complete the operation.

The backend can keep track of multiple prepared statements and portals (but note that these exist only within a session, and are never shared across sessions). Existing prepared statements and portals are referenced by names assigned when they were created. In addition, an “unnamed” prepared statement and portal exist. Although these behave largely the same as named objects, operations on them are optimized for the case of executing a query only once and then discarding it, whereas operations on named objects are optimized on the expectation of multiple uses.

52.1.3. Formats and Format Codes

Data of a particular data type might be transmitted in any of several different formats. As of PostgreSQL 7.4 the only supported formats are “text” and “binary”, but the protocol makes provision for future extensions. The desired format for any value is specified by a format code. Clients can specify a format code for each transmitted parameter value and for each column of a query result. Text has format code zero, binary has format code one, and all other format codes are reserved for future definition.

The text representation of values is whatever strings are produced and accepted by the input/output conversion functions for the particular data type. In the transmitted representation, there is no trailing null character; the frontend must add one to received values if it wants to process them as C strings. (The text format does not allow embedded nulls, by the way.)

Binary representations for integers use network byte order (most significant byte first). For other data types consult the documentation or source code to learn about the binary representation. Keep in mind that binary representations for complex data types might change across server versions; the text format is usually the more portable choice.

52.2. Message Flow

52.3. SASL Authentication

52.4. Streaming Replication Protocol

52.5. Logical Streaming Replication Protocol

This section describes the logical replication protocol, which is the message flow started by the START_REPLICATION SLOT slot_name LOGICAL replication command.

The logical streaming replication protocol builds on the primitives of the physical streaming replication protocol.

52.5.1. Logical Streaming Replication Parameters

The logical replication START_REPLICATION command accepts following parameters:proto_version

Protocol version. Currently only version 1 is supported.publication_names

Comma separated list of publication names for which to subscribe (receive changes). The individual publication names are treated as standard objects names and can be quoted the same as needed.

52.5.2. Logical Replication Protocol Messages

The individual protocol messages are discussed in the following subsections. Individual messages are described in Section 52.9.

All top-level protocol messages begin with a message type byte. While represented in code as a character, this is a signed byte with no associated encoding.

Since the streaming replication protocol supplies a message length there is no need for top-level protocol messages to embed a length in their header.

52.5.3. Logical Replication Protocol Message Flow

With the exception of the START_REPLICATION command and the replay progress messages, all information flows only from the backend to the frontend.

The logical replication protocol sends individual transactions one by one. This means that all messages between a pair of Begin and Commit messages belong to the same transaction.

Every sent transaction contains zero or more DML messages (Insert, Update, Delete). In case of a cascaded setup it can also contain Origin messages. The origin message indicated that the transaction originated on different replication node. Since a replication node in the scope of logical replication protocol can be pretty much anything, the only identifier is the origin name. It's downstream's responsibility to handle this as needed (if needed). The Origin message is always sent before any DML messages in the transaction.

Every DML message contains an arbitrary relation ID, which can be mapped to an ID in the Relation messages. The Relation messages describe the schema of the given relation. The Relation message is sent for a given relation either because it is the first time we send a DML message for given relation in the current session or because the relation definition has changed since the last Relation message was sent for it. The protocol assumes that the client is capable of caching the metadata for as many relations as needed.

52.6. Message Data Types

52.7. Message Formats

52.8. Error and Notice Message Fields

52.9. Logical Replication Message Formats

52.10. Summary of Changes since Protocol 2.0

53. PostgreSQL 程式撰寫慣例

53.1. Formatting

Source code formatting uses 4 column tab spacing, with tabs preserved (i.e., tabs are not expanded to spaces). Each logical indentation level is one additional tab stop.

Layout rules (brace positioning, etc) follow BSD conventions. In particular, curly braces for the controlled blocks of if, while, switch, etc go on their own lines.

Limit line lengths so that the code is readable in an 80-column window. (This doesn't mean that you must never go past 80 columns. For instance, breaking a long error message string in arbitrary places just to keep the code within 80 columns is probably not a net gain in readability.)

Do not use C++ style comments (// comments). Strict ANSI C compilers do not accept them. For the same reason, do not use C++ extensions such as declaring new variables mid-block.

The preferred style for multi-line comment blocks is

/*
 * comment text begins here
 * and continues here
 */

Note that comment blocks that begin in column 1 will be preserved as-is by pgindent, but it will re-flow indented comment blocks as though they were plain text. If you want to preserve the line breaks in an indented block, add dashes like this:

    /*----------
     * comment text begins here
     * and continues here
     *----------
     */

While submitted patches do not absolutely have to follow these formatting rules, it's a good idea to do so. Your code will get run through pgindent before the next release, so there's no point in making it look nice under some other set of formatting conventions. A good rule of thumb for patches is “make the new code look like the existing code around it”.

The src/tools directory contains sample settings files that can be used with the emacs, xemacs or vim editors to help ensure that they format code according to these conventions.

The text browsing tools more and less can be invoked as:

more -x4
less -x4

to make them show tabs appropriately.

53.2. Reporting Errors Within the Server

Error, warning, and log messages generated within the server code should be created using ereport, or its older cousin elog. The use of this function is complex enough to require some explanation.

There are two required elements for every message: a severity level (ranging from DEBUG to PANIC) and a primary message text. In addition there are optional elements, the most common of which is an error identifier code that follows the SQL spec's SQLSTATE conventions. ereport itself is just a shell function, that exists mainly for the syntactic convenience of making message generation look like a function call in the C source code. The only parameter accepted directly by ereport is the severity level. The primary message text and any optional message elements are generated by calling auxiliary functions, such as errmsg, within the ereport call.

A typical call to ereport might look like this:

ereport(ERROR,
        (errcode(ERRCODE_DIVISION_BY_ZERO),
         errmsg("division by zero")));

This specifies error severity level ERROR (a run-of-the-mill error). The errcode call specifies the SQLSTATE error code using a macro defined in src/include/utils/errcodes.h. The errmsg call provides the primary message text. Notice the extra set of parentheses surrounding the auxiliary function calls — these are annoying but syntactically necessary.

Here is a more complex example:

ereport(ERROR,
        (errcode(ERRCODE_AMBIGUOUS_FUNCTION),
         errmsg("function %s is not unique",
                func_signature_string(funcname, nargs,
                                      NIL, actual_arg_types)),
         errhint("Unable to choose a best candidate function. "
                 "You might need to add explicit typecasts.")));

This illustrates the use of format codes to embed run-time values into a message text. Also, an optional “hint” message is provided.

If the severity level is ERROR or higher, ereport aborts the execution of the user-defined function and does not return to the caller. If the severity level is lower than ERROR, ereport returns normally.

The available auxiliary routines for ereport are:

errcode(sqlerrcode) specifies the SQLSTATE error identifier code for the condition. If this routine is not called, the error identifier defaults to ERRCODE_INTERNAL_ERROR when the error severity level is ERROR or higher, ERRCODE_WARNING when the error level is WARNING, otherwise (for NOTICE and below) ERRCODE_SUCCESSFUL_COMPLETION. While these defaults are often convenient, always think whether they are appropriate before omitting the errcode() call.
errmsg(const char *msg, ...) specifies the primary error message text, and possibly run-time values to insert into it. Insertions are specified by sprintf-style format codes. In addition to the standard format codes accepted by sprintf, the format code %m can be used to insert the error message returned by strerror for the current value of errno. [13] %m does not require any corresponding entry in the parameter list for errmsg. Note that the message string will be run through gettext for possible localization before format codes are processed.
errmsg_internal(const char *msg, ...) is the same as errmsg, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for “cannot happen” cases that are probably not worth expending translation effort on.
errmsg_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errmsg, but with support for various plural forms of the message. fmt_singular is the English singular format, fmt_plural is the English plural format, n is the integer value that determines which plural form is needed, and the remaining arguments are formatted according to the selected format string. For more information see Section 55.2.2.
errdetail(const char *msg, ...) supplies an optional “detail” message; this is to be used when there is additional information that seems inappropriate to put in the primary message. The message string is processed in just the same way as for errmsg.
errdetail_internal(const char *msg, ...) is the same as errdetail, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for detail messages that are not worth expending translation effort on, for instance because they are too technical to be useful to most users.
errdetail_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errdetail, but with support for various plural forms of the message. For more information see Section 55.2.2.
errdetail_log(const char *msg, ...) is the same as errdetail except that this string goes only to the server log, never to the client. If both errdetail (or one of its equivalents above) and errdetail_log are used then one string goes to the client and the other to the log. This is useful for error details that are too security-sensitive or too bulky to include in the report sent to the client.
errdetail_log_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errdetail_log, but with support for various plural forms of the message. For more information see Section 55.2.2.
errhint(const char *msg, ...) supplies an optional “hint” message; this is to be used when offering suggestions about how to fix the problem, as opposed to factual details about what went wrong. The message string is processed in just the same way as for errmsg.
errcontext(const char *msg, ...) is not normally called directly from an ereport message site; rather it is used in error_context_stack callback functions to provide information about the context in which an error occurred, such as the current location in a PL function. The message string is processed in just the same way as for errmsg. Unlike the other auxiliary functions, this can be called more than once per ereport call; the successive strings thus supplied are concatenated with separating newlines.
errposition(int cursorpos) specifies the textual location of an error within a query string. Currently it is only useful for errors detected in the lexical and syntactic analysis phases of query processing.
errtable(Relation rel) specifies a relation whose name and schema name should be included as auxiliary fields in the error report.
errtablecol(Relation rel, int attnum) specifies a column whose name, table name, and schema name should be included as auxiliary fields in the error report.
errtableconstraint(Relation rel, const char *conname) specifies a table constraint whose name, table name, and schema name should be included as auxiliary fields in the error report. Indexes should be considered to be constraints for this purpose, whether or not they have an associated pg_constraint entry. Be careful to pass the underlying heap relation, not the index itself, as rel.
errdatatype(Oid datatypeOid) specifies a data type whose name and schema name should be included as auxiliary fields in the error report.
errdomainconstraint(Oid datatypeOid, const char *conname) specifies a domain constraint whose name, domain name, and schema name should be included as auxiliary fields in the error report.
errcode_for_file_access() is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a file-access-related system call. It uses the saved errno to determine which error code to generate. Usually this should be used in combination with %m in the primary error message text.
errcode_for_socket_access() is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a socket-related system call.
errhidestmt(bool hide_stmt) can be called to specify suppression of the STATEMENT: portion of a message in the postmaster log. Generally this is appropriate if the message text includes the current statement already.
errhidecontext(bool hide_ctx) can be called to specify suppression of the CONTEXT: portion of a message in the postmaster log. This should only be used for verbose debugging messages where the repeated inclusion of context would bloat the log volume too much.

Note

At most one of the functions errtable, errtablecol, errtableconstraint, errdatatype, or errdomainconstraint should be used in an ereport call. These functions exist to allow applications to extract the name of a database object associated with the error condition without having to examine the potentially-localized error message text. These functions should be used in error reports for which it's likely that applications would wish to have automatic error handling. As of PostgreSQL 9.3, complete coverage exists only for errors in SQLSTATE class 23 (integrity constraint violation), but this is likely to be expanded in future.

There is an older function elog that is still heavily used. An elog call:

elog(level, "format string", ...);

is exactly equivalent to:

ereport(level, (errmsg_internal("format string", ...)));

Notice that the SQLSTATE error code is always defaulted, and the message string is not subject to translation. Therefore, elog should be used only for internal errors and low-level debug logging. Any message that is likely to be of interest to ordinary users should go through ereport. Nonetheless, there are enough internal “cannot happen” error checks in the system that elog is still widely used; it is preferred for those messages for its notational simplicity.

Advice about writing good error messages can be found in Section 54.3.

[13] That is, the value that was current when the ereport call was reached; changes of errno within the auxiliary reporting routines will not affect it. That would not be true if you were to write strerror(errno) explicitly in errmsg's parameter list; accordingly, do not do so.

53.3. Error Message Style Guide

This style guide is offered in the hope of maintaining a consistent, user-friendly style throughout all the messages generated by PostgreSQL.

What Goes Where

The primary message should be short, factual, and avoid reference to implementation details such as specific function names. “Short” means “should fit on one line under normal conditions”. Use a detail message if needed to keep the primary message short, or if you feel a need to mention implementation details such as the particular system call that failed. Both primary and detail messages should be factual. Use a hint message for suggestions about what to do to fix the problem, especially if the suggestion might not always be applicable.

For example, instead of:

IpcMemoryCreate: shmget(key=%d, size=%u, 0%o) failed: %m
(plus a long addendum that is basically a hint)

write:

Primary:    could not create shared memory segment: %m
Detail:     Failed syscall was shmget(key=%d, size=%u, 0%o).
Hint:       the addendum

Rationale: keeping the primary message short helps keep it to the point, and lets clients lay out screen space on the assumption that one line is enough for error messages. Detail and hint messages can be relegated to a verbose mode, or perhaps a pop-up error-details window. Also, details and hints would normally be suppressed from the server log to save space. Reference to implementation details is best avoided since users aren't expected to know the details.

Formatting

Don't put any specific assumptions about formatting into the message texts. Expect clients and the server log to wrap lines to fit their own needs. In long messages, newline characters (\n) can be used to indicate suggested paragraph breaks. Don't end a message with a newline. Don't use tabs or other formatting characters. (In error context displays, newlines are automatically added to separate levels of context such as function calls.)

Rationale: Messages are not necessarily displayed on terminal-type displays. In GUI displays or browsers these formatting instructions are at best ignored.

Quotation Marks

English text should use double quotes when quoting is appropriate. Text in other languages should consistently use one kind of quotes that is consistent with publishing customs and computer output of other programs.

Rationale: The choice of double quotes over single quotes is somewhat arbitrary, but tends to be the preferred use. Some have suggested choosing the kind of quotes depending on the type of object according to SQL conventions (namely, strings single quoted, identifiers double quoted). But this is a language-internal technical issue that many users aren't even familiar with, it won't scale to other kinds of quoted terms, it doesn't translate to other languages, and it's pretty pointless, too.

Use of Quotes

Use quotes always to delimit file names, user-supplied identifiers, and other variables that might contain words. Do not use them to mark up variables that will not contain words (for example, operator names).

There are functions in the backend that will double-quote their own output at need (for example, format_type_be()). Do not put additional quotes around the output of such functions.

Rationale: Objects can have names that create ambiguity when embedded in a message. Be consistent about denoting where a plugged-in name starts and ends. But don't clutter messages with unnecessary or duplicate quote marks.

Grammar and Punctuation

The rules are different for primary error messages and for detail/hint messages:

Primary error messages: Do not capitalize the first letter. Do not end a message with a period. Do not even think about ending a message with an exclamation point.

Detail and hint messages: Use complete sentences, and end each with a period. Capitalize the first word of sentences. Put two spaces after the period if another sentence follows (for English text; might be inappropriate in other languages).

Error context strings: Do not capitalize the first letter and do not end the string with a period. Context strings should normally not be complete sentences.

Rationale: Avoiding punctuation makes it easier for client applications to embed the message into a variety of grammatical contexts. Often, primary messages are not grammatically complete sentences anyway. (And if they're long enough to be more than one sentence, they should be split into primary and detail parts.) However, detail and hint messages are longer and might need to include multiple sentences. For consistency, they should follow complete-sentence style even when there's only one sentence.

Upper Case vs. Lower Case

Use lower case for message wording, including the first letter of a primary error message. Use upper case for SQL commands and key words if they appear in the message.

Rationale: It's easier to make everything look more consistent this way, since some messages are complete sentences and some not.

Avoid Passive Voice

Use the active voice. Use complete sentences when there is an acting subject (“A could not do B”). Use telegram style without subject if the subject would be the program itself; do not use “I” for the program.

Rationale: The program is not human. Don't pretend otherwise.

Present vs. Past Tense

Use past tense if an attempt to do something failed, but could perhaps succeed next time (perhaps after fixing some problem). Use present tense if the failure is certainly permanent.

There is a nontrivial semantic difference between sentences of the form:

could not open file "%s": %m

and:

cannot open file "%s"

The first one means that the attempt to open the file failed. The message should give a reason, such as “disk full” or “file doesn't exist”. The past tense is appropriate because next time the disk might not be full anymore or the file in question might exist.

The second form indicates that the functionality of opening the named file does not exist at all in the program, or that it's conceptually impossible. The present tense is appropriate because the condition will persist indefinitely.

Rationale: Granted, the average user will not be able to draw great conclusions merely from the tense of the message, but since the language provides us with a grammar we should use it correctly.

Type of the Object

When citing the name of an object, state what kind of object it is.

Rationale: Otherwise no one will know what “foo.bar.baz” refers to.

Brackets

Square brackets are only to be used (1) in command synopses to denote optional arguments, or (2) to denote an array subscript.

Rationale: Anything else does not correspond to widely-known customary usage and will confuse people.

Assembling Error Messages

When a message includes text that is generated elsewhere, embed it in this style:

could not open file %s: %m

Rationale: It would be difficult to account for all possible error codes to paste this into a single smooth sentence, so some sort of punctuation is needed. Putting the embedded text in parentheses has also been suggested, but it's unnatural if the embedded text is likely to be the most important part of the message, as is often the case.

Reasons for Errors

Messages should always state the reason why an error occurred. For example:

BAD:    could not open file %s
BETTER: could not open file %s (I/O failure)

If no reason is known you better fix the code.

Function Names

Don't include the name of the reporting routine in the error text. We have other mechanisms for finding that out when needed, and for most users it's not helpful information. If the error text doesn't make as much sense without the function name, reword it.

BAD:    pg_atoi: error in "z": cannot parse "z"
BETTER: invalid input syntax for integer: "z"

Avoid mentioning called function names, either; instead say what the code was trying to do:

BAD:    open() failed: %m
BETTER: could not open file %s: %m

If it really seems necessary, mention the system call in the detail message. (In some cases, providing the actual values passed to the system call might be appropriate information for the detail message.)

Rationale: Users don't know what all those functions do.

Tricky Words to Avoid

Unable. “Unable” is nearly the passive voice. Better use “cannot” or “could not”, as appropriate.

Bad. Error messages like “bad result” are really hard to interpret intelligently. It's better to write why the result is “bad”, e.g., “invalid format”.

Illegal. “Illegal” stands for a violation of the law, the rest is “invalid”. Better yet, say why it's invalid.

Unknown. Try to avoid “unknown”. Consider “error: unknown response”. If you don't know what the response is, how do you know it's erroneous? “Unrecognized” is often a better choice. Also, be sure to include the value being complained of.

BAD:    unknown node type
BETTER: unrecognized node type: 42

Find vs. Exists. If the program uses a nontrivial algorithm to locate a resource (e.g., a path search) and that algorithm fails, it is fair to say that the program couldn't “find” the resource. If, on the other hand, the expected location of the resource is known but the program cannot access it there then say that the resource doesn't “exist”. Using “find” in this case sounds weak and confuses the issue.

May vs. Can vs. Might. “May” suggests permission (e.g., "You may borrow my rake."), and has little use in documentation or error messages. “Can” suggests ability (e.g., "I can lift that log."), and “might” suggests possibility (e.g., "It might rain today."). Using the proper word clarifies meaning and assists translation.

Contractions. Avoid contractions, like “can't”; use “cannot” instead.

Proper Spelling

Spell out words in full. For instance, avoid:

spec
stats
parens
auth
xact

Rationale: This will improve consistency.

Localization

Keep in mind that error message texts need to be translated into other languages. Follow the guidelines in Section 55.2.2 to avoid making life difficult for translators.

53.4. Miscellaneous Coding Conventions

C Standard

Code in PostgreSQL should only rely on language features available in the C89 standard. That means a conforming C89 compiler has to be able to compile postgres, at least aside from a few platform dependent pieces. Features from later revision of the C standard or compiler specific features can be used, if a fallback is provided.

For example static inline and _StaticAssert() are currently used, even though they are from newer revisions of the C standard. If not available we respectively fall back to defining the functions without inline, and to using a C89 compatible replacement that performs the same checks, but emits rather cryptic messages.

Function-Like Macros and Inline Functions

Both, macros with arguments and static inline functions, may be used. The latter are preferable if there are multiple-evaluation hazards when written as a macro, as e.g. the case with

#define Max(x, y)       ((x) > (y) ? (x) : (y))

or when the macro would be very long. In other cases it's only possible to use macros, or at least easier. For example because expressions of various types need to be passed to the macro.

When the definition of an inline function references symbols (i.e. variables, functions) that are only available as part of the backend, the function may not be visible when included from frontend code.

#ifndef FRONTEND
static inline MemoryContext
MemoryContextSwitchTo(MemoryContext context)
{
    MemoryContext old = CurrentMemoryContext;

    CurrentMemoryContext = context;
    return old;
}
#endif   /* FRONTEND */

In this example CurrentMemoryContext, which is only available in the backend, is referenced and the function thus hidden with a #ifndef FRONTEND. This rule exists because some compilers emit references to symbols contained in inline functions even if the function is not used.

Writing Signal Handlers

To be suitable to run inside a signal handler code has to be written very carefully. The fundamental problem is that, unless blocked, a signal handler can interrupt code at any time. If code inside the signal handler uses the same state as code outside chaos may ensue. As an example consider what happens if a signal handler tries to acquire a lock that's already held in the interrupted code.

Barring special arrangements code in signal handlers may only call async-signal safe functions (as defined in POSIX) and access variables of type volatile sig_atomic_t. A few functions in postgres are also deemed signal safe, importantly SetLatch().

In most cases signal handlers should do nothing more than note that a signal has arrived, and wake up code running outside of the handler using a latch. An example of such a handler is the following:

static void
handle_sighup(SIGNAL_ARGS)
{
    int         save_errno = errno;

    got_SIGHUP = true;
    SetLatch(MyLatch);

    errno = save_errno;
}

errno is saved and restored because SetLatch() might change it. If that were not done interrupted code that's currently inspecting errno might see the wrong value.

Calling Function Pointers

For clarity, it is preferred to explicitly dereference a function pointer when calling the pointed-to function if the pointer is a simple variable, for example:

(*emit_log_hook) (edata);

(even though emit_log_hook(edata) would also work). When the function pointer is part of a structure, then the extra punctuation can and usually should be omitted, for example:

paramInfo->paramFetch(paramInfo, paramId);

54. Native Language Support

54.1. For the Translator

PostgreSQL programs (server and client) can issue their messages in your favorite language — if the messages have been translated. Creating and maintaining translated message sets needs the help of people who speak their own language well and want to contribute to the PostgreSQL effort. You do not have to be a programmer at all to do this. This section explains how to help.

54.1.1. Requirements

We won't judge your language skills — this section is about software tools. Theoretically, you only need a text editor. But this is only in the unlikely event that you do not want to try out your translated messages. When you configure your source tree, be sure to use the --enable-nls option. This will also check for the libintl library and the msgfmt program, which all end users will need anyway. To try out your work, follow the applicable portions of the installation instructions.

If you want to start a new translation effort or want to do a message catalog merge (described later), you will need the programs xgettext and msgmerge, respectively, in a GNU-compatible implementation. Later, we will try to arrange it so that if you use a packaged source distribution, you won't need xgettext. (If working from Git, you will still need it.) GNU Gettext 0.10.36 or later is currently recommended.

Your local gettext implementation should come with its own documentation. Some of that is probably duplicated in what follows, but for additional details you should look there.

54.1.2. Concepts

The pairs of original (English) messages and their (possibly) translated equivalents are kept in message catalogs, one for each program (although related programs can share a message catalog) and for each target language. There are two file formats for message catalogs: The first is the “PO” file (for Portable Object), which is a plain text file with special syntax that translators edit. The second is the “MO” file (for Machine Object), which is a binary file generated from the respective PO file and is used while the internationalized program is run. Translators do not deal with MO files; in fact hardly anyone does.

The extension of the message catalog file is to no surprise either .po or .mo. The base name is either the name of the program it accompanies, or the language the file is for, depending on the situation. This is a bit confusing. Examples are psql.po (PO file for psql) or fr.mo (MO file in French).

The file format of the PO files is illustrated here:

# comment

msgid "original string"
msgstr "translated string"

msgid "more original"
msgstr "another translated"
"string can be broken up like this"

...

The msgid's are extracted from the program source. (They need not be, but this is the most common way.) The msgstr lines are initially empty and are filled in with useful strings by the translator. The strings can contain C-style escape characters and can be continued across lines as illustrated. (The next line must start at the beginning of the line.)

The # character introduces a comment. If whitespace immediately follows the # character, then this is a comment maintained by the translator. There can also be automatic comments, which have a non-whitespace character immediately following the #. These are maintained by the various tools that operate on the PO files and are intended to aid the translator.

#. automatic comment
#: filename.c:1023
#, flags, flags

The #. style comments are extracted from the source file where the message is used. Possibly the programmer has inserted information for the translator, such as about expected alignment. The #: comment indicates the exact location(s) where the message is used in the source. The translator need not look at the program source, but can if there is doubt about the correct translation. The #, comments contain flags that describe the message in some way. There are currently two flags: fuzzy is set if the message has possibly been outdated because of changes in the program source. The translator can then verify this and possibly remove the fuzzy flag. Note that fuzzy messages are not made available to the end user. The other flag is c-format, which indicates that the message is a printf-style format template. This means that the translation should also be a format string with the same number and type of placeholders. There are tools that can verify this, which key off the c-format flag.

54.1.3. Creating and Maintaining Message Catalogs

OK, so how does one create a “blank” message catalog? First, go into the directory that contains the program whose messages you want to translate. If there is a file nls.mk, then this program has been prepared for translation.

If there are already some .po files, then someone has already done some translation work. The files are named language.po, where language is the ISO 639-1 two-letter language code (in lower case), e.g., fr.po for French. If there is really a need for more than one translation effort per language then the files can also be named language_region.po where region is the ISO 3166-1 two-letter country code (in upper case), e.g., pt_BR.po for Portuguese in Brazil. If you find the language you wanted you can just start working on that file.

If you need to start a new translation effort, then first run the command:

make init-po

This will create a file progname.pot. (.pot to distinguish it from PO files that are “in production”. The T stands for “template”.) Copy this file to language.po and edit it. To make it known that the new language is available, also edit the file nls.mk and add the language (or language and country) code to the line that looks like:

AVAIL_LANGUAGES := de fr

(Other languages can appear, of course.)

As the underlying program or library changes, messages might be changed or added by the programmers. In this case you do not need to start from scratch. Instead, run the command:

make update-po

which will create a new blank message catalog file (the pot file you started with) and will merge it with the existing PO files. If the merge algorithm is not sure about a particular message it marks it “fuzzy” as explained above. The new PO file is saved with a .po.new extension.

54.1.4. Editing the PO Files

The PO files can be edited with a regular text editor. The translator should only change the area between the quotes after the msgstr directive, add comments, and alter the fuzzy flag. There is (unsurprisingly) a PO mode for Emacs, which I find quite useful.

The PO files need not be completely filled in. The software will automatically fall back to the original string if no translation (or an empty translation) is available. It is no problem to submit incomplete translations for inclusions in the source tree; that gives room for other people to pick up your work. However, you are encouraged to give priority to removing fuzzy entries after doing a merge. Remember that fuzzy entries will not be installed; they only serve as reference for what might be the right translation.

Here are some things to keep in mind while editing the translations:

Make sure that if the original ends with a newline, the translation does, too. Similarly for tabs, etc.
If the original is a printf format string, the translation also needs to be. The translation also needs to have the same format specifiers in the same order. Sometimes the natural rules of the language make this impossible or at least awkward. In that case you can modify the format specifiers like this:
```
msgstr "Die Datei %2$s hat %1$u Zeichen."
```
Then the first placeholder will actually use the second argument from the list. The digits$ needs to follow the % immediately, before any other format manipulators. (This feature really exists in the printf family of functions. You might not have heard of it before because there is little use for it outside of message internationalization.)
If the original string contains a linguistic mistake, report that (or fix it yourself in the program source) and translate normally. The corrected string can be merged in when the program sources have been updated. If the original string contains a factual mistake, report that (or fix it yourself) and do not translate it. Instead, you can mark the string with a comment in the PO file.
Maintain the style and tone of the original string. Specifically, messages that are not sentences (cannot open file %s) should probably not start with a capital letter (if your language distinguishes letter case) or end with a period (if your language uses punctuation marks). It might help to read Section 53.3.
If you don't know what a message means, or if it is ambiguous, ask on the developers' mailing list. Chances are that English speaking end users might also not understand it or find it ambiguous, so it's best to improve the message.

54.2. For the Programmer

54.2.1. Mechanics

This section describes how to implement native language support in a program or library that is part of the PostgreSQL distribution. Currently, it only applies to C programs.

Adding NLS Support to a Program

Insert this code into the start-up sequence of the program:

#ifdef ENABLE_NLS
#include <locale.h>
#endif

...

#ifdef ENABLE_NLS
setlocale(LC_ALL, "");
bindtextdomain("progname", LOCALEDIR);
textdomain("progname");
#endif

(The progname can actually be chosen freely.)

Wherever a message that is a candidate for translation is found, a call to gettext() needs to be inserted. E.g.:
```
fprintf(stderr, "panic level %d\n", lvl);
```
would be changed to:
```
fprintf(stderr, gettext("panic level %d\n"), lvl);
```
(gettext is defined as a no-op if NLS support is not configured.)
This tends to add a lot of clutter. One common shortcut is to use:
```
#define _(x) gettext(x)
```
Another solution is feasible if the program does much of its communication through one or a few functions, such as ereport() in the backend. Then you make this function call gettext internally on all input strings.
Add a file nls.mk in the directory with the program sources. This file will be read as a makefile. The following variable assignments need to be made here:CATALOG_NAME
The program name, as provided in the textdomain() call.AVAIL_LANGUAGES
List of provided translations — initially empty.GETTEXT_FILES
List of files that contain translatable strings, i.e., those marked with gettext or an alternative solution. Eventually, this will include nearly all source files of the program. If this list gets too long you can make the first “file” be a + and the second word be a file that contains one file name per line.GETTEXT_TRIGGERS
The tools that generate message catalogs for the translators to work on need to know what function calls contain translatable strings. By default, only gettext() calls are known. If you used _ or other identifiers you need to list them here. If the translatable string is not the first argument, the item needs to be of the form func:2 (for the second argument). If you have a function that supports pluralized messages, the item should look like func:1,2 (identifying the singular and plural message arguments).

The build system will automatically take care of building and installing the message catalogs.

54.2.2. Message-Writing Guidelines

Here are some guidelines for writing messages that are easily translatable.

Do not construct sentences at run-time, like:
```
printf("Files were %s.\n", flag ? "copied" : "removed");
```
The word order within the sentence might be different in other languages. Also, even if you remember to call gettext() on each fragment, the fragments might not translate well separately. It's better to duplicate a little code so that each message to be translated is a coherent whole. Only numbers, file names, and such-like run-time variables should be inserted at run time into a message text.
For similar reasons, this won't work:
```
printf("copied %d file%s", n, n!=1 ? "s" : "");
```
because it assumes how the plural is formed. If you figured you could solve it like this:
```
if (n==1)
    printf("copied 1 file");
else
    printf("copied %d files", n):
```
then be disappointed. Some languages have more than two forms, with some peculiar rules. It's often best to design the message to avoid the issue altogether, for instance like this:
```
printf("number of copied files: %d", n);
```
If you really want to construct a properly pluralized message, there is support for this, but it's a bit awkward. When generating a primary or detail error message in ereport(), you can write something like this:
```
errmsg_plural("copied %d file",
              "copied %d files",
              n,
              n)
```
The first argument is the format string appropriate for English singular form, the second is the format string appropriate for English plural form, and the third is the integer control value that determines which plural form to use. Subsequent arguments are formatted per the format string as usual. (Normally, the pluralization control value will also be one of the values to be formatted, so it has to be written twice.) In English it only matters whether n is 1 or not 1, but in other languages there can be many different plural forms. The translator sees the two English forms as a group and has the opportunity to supply multiple substitute strings, with the appropriate one being selected based on the run-time value of n.
If you need to pluralize a message that isn't going directly to an errmsg or errdetail report, you have to use the underlying function ngettext. See the gettext documentation.
If you want to communicate something to the translator, such as about how a message is intended to line up with other output, precede the occurrence of the string with a comment that starts with translator, e.g.:
```
/* translator: This message is not what it seems to be. */
```
These comments are copied to the message catalog files so that the translators can see them.

55. 撰寫程序語言的處理程序

使用目前編譯語言的「version 1」介面以外的語言撰寫的函數（包括使用者定義的程序語言函數和用 SQL 撰寫的函數）在被呼叫時，都將透過特定語言的呼叫處理程序函數。呼叫處理程序有責任以有意義的方式執行功能，例如透過解譯程式原始碼。本章概述如何撰寫新程序語言的呼叫處理程序。

程序語言的呼叫處理程序是「一般」函數，必須使用「version 1」介面以編譯語言（例如 C）撰寫，並在 PostgreSQL 中註冊為不帶任何參數且回傳 language_handler 型別。這種特殊的偽型別將函數標識為呼叫處理程序，並防止在 SQL 指令中直接呼叫該函數。有關 C 語言呼叫約定和動態載入的更多詳細訊，請參閱第 37.10 節。

呼叫處理程序的呼叫方式與其他任何函數相同：它會接收到一個指向 FunctionCallInfoBaseData 結構的指標，該結構包含參數值和有關被呼叫函數的資訊，並且期待回傳 Datum 結果（也可能設定 FunctionCallInfoBaseData 結構中 isnull 的欄位，如果該函數希望回傳 SQL NULL 的結果）。呼叫處理程序和普通被呼叫函數之間的區別在於 FunctionCallInfoBaseData 結構的 flinfo->fn_oid 欄位會包含要呼叫實際函數的 OID，而不包含呼叫處理程序本身的 OID。呼叫處理程序必須使用此欄位來決定所要執行的功能。同樣地，傳遞的參數列表是根據目標函數而不是呼叫處理程序的宣告設定。

呼叫處理程序要從 pg_proc 系統目錄中取得函數的項目，並分析被呼叫函數的參數和回傳型別。該函數的 CREATE FUNCTION 命令的 AS 子句可以在 pg_proc 行的 prosrc 欄位中看到。這通常是程序語言的原始碼，但是從理論上說，它也可能是其他內容，例如檔案的路徑名稱，或告訴呼叫處理程序詳細操作的其他任何內容。

通常，每個 SQL 語句會多次呼叫同一函數。呼叫處理程序可以透過使用 flinfo->fn_extra 欄位來避免重複查詢有關被呼叫函數的資訊。最初將為 NULL，但可以由呼叫處理程序設定為指向有關被呼叫函數的資訊。在後續的呼叫中，如果 flinfo->fn_extra 已經為非 NULL，則可以使用它，並且跳過資訊查詢步驟。呼叫處理程序必須確保使 flinfo->fn_extra 指向至少可以保存到目前查詢結束的快取，因為 FmgrInfo 資料結構可以保留那麼長時間。一種方法是在 flinfo->fn_mcxt 指定的快取內容中分配額外的資料。此類資料通常與 FmgrInfo 本身俱有相同的壽命。但是處理程序還可以選擇使用壽命更長的快取，以便它可以跨查詢快取函數定義資訊。

當將程序語言函數作為事件觸發器呼叫時，不會以通常的方式傳遞任何參數，但是 FunctionCallInfoBaseData 的 context 欄位指向 TriggerData 結構，而不是像在普通函數呼叫中那樣為 NULL。語言處理程序應提供程序語言函數的機制，以取得所觸發資訊。

This is a template for a procedural-language handler written in C:

#include "postgres.h"
#include "executor/spi.h"
#include "commands/trigger.h"
#include "fmgr.h"
#include "access/heapam.h"
#include "utils/syscache.h"
#include "catalog/pg_proc.h"
#include "catalog/pg_type.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(plsample_call_handler);

Datum
plsample_call_handler(PG_FUNCTION_ARGS)
{
    Datum          retval;

    if (CALLED_AS_TRIGGER(fcinfo))
    {
        /*
         * Called as a trigger function
         */
        TriggerData    *trigdata = (TriggerData *) fcinfo->context;

        retval = ...
    }
    else
    {
        /*
         * Called as a function
         */

        retval = ...
    }

    return retval;
}

Only a few thousand lines of code have to be added instead of the dots to complete the call handler.

After having compiled the handler function into a loadable module (see Section 37.10.5), the following commands then register the sample procedural language:

CREATE FUNCTION plsample_call_handler() RETURNS language_handler
    AS 'filename'
    LANGUAGE C;
CREATE LANGUAGE plsample
    HANDLER plsample_call_handler;

Although providing a call handler is sufficient to create a minimal procedural language, there are two other functions that can optionally be provided to make the language more convenient to use. These are a validator and an inline handler. A validator can be provided to allow language-specific checking to be done during CREATE FUNCTION. An inline handler can be provided to allow the language to support anonymous code blocks executed via the DO command.

If a validator is provided by a procedural language, it must be declared as a function taking a single parameter of type oid. The validator's result is ignored, so it is customarily declared to return void. The validator will be called at the end of a CREATE FUNCTION command that has created or updated a function written in the procedural language. The passed-in OID is the OID of the function's pg_proc row. The validator must fetch this row in the usual way, and do whatever checking is appropriate. First, call CheckFunctionValidatorAccess() to diagnose explicit calls to the validator that the user could not achieve through CREATE FUNCTION. Typical checks then include verifying that the function's argument and result types are supported by the language, and that the function's body is syntactically correct in the language. If the validator finds the function to be okay, it should just return. If it finds an error, it should report that via the normal ereport() error reporting mechanism. Throwing an error will force a transaction rollback and thus prevent the incorrect function definition from being committed.

Validator functions should typically honor the check_function_bodies parameter: if it is turned off then any expensive or context-sensitive checking should be skipped. If the language provides for code execution at compilation time, the validator must suppress checks that would induce such execution. In particular, this parameter is turned off by pg_dump so that it can load procedural language functions without worrying about side effects or dependencies of the function bodies on other database objects. (Because of this requirement, the call handler should avoid assuming that the validator has fully checked the function. The point of having a validator is not to let the call handler omit checks, but to notify the user immediately if there are obvious errors in a CREATE FUNCTION command.) While the choice of exactly what to check is mostly left to the discretion of the validator function, note that the core CREATE FUNCTION code only executes SET clauses attached to a function when check_function_bodies is on. Therefore, checks whose results might be affected by GUC parameters definitely should be skipped when check_function_bodies is off, to avoid false failures when reloading a dump.

If an inline handler is provided by a procedural language, it must be declared as a function taking a single parameter of type internal. The inline handler's result is ignored, so it is customarily declared to return void. The inline handler will be called when a DO statement is executed specifying the procedural language. The parameter actually passed is a pointer to an InlineCodeBlock struct, which contains information about the DO statement's parameters, in particular the text of the anonymous code block to be executed. The inline handler should execute this code and return.

It's recommended that you wrap all these function declarations, as well as the CREATE LANGUAGE command itself, into an extension so that a simple CREATE EXTENSION command is sufficient to install the language. See Section 37.17 for information about writing extensions.

The procedural languages included in the standard distribution are good references when trying to write your own language handler. Look into the src/pl subdirectory of the source tree. The CREATE LANGUAGE reference page also has some useful details.

56. Writing a Foreign Data Wrapper

All operations on a foreign table are handled through its foreign data wrapper, which consists of a set of functions that the core server calls. The foreign data wrapper is responsible for fetching data from the remote data source and returning it to the PostgreSQL executor. If updating foreign tables is to be supported, the wrapper must handle that, too. This chapter outlines how to write a new foreign data wrapper.

The foreign data wrappers included in the standard distribution are good references when trying to write your own. Look into the contrib subdirectory of the source tree. The CREATE FOREIGN DATA WRAPPER reference page also has some useful details.

Note

The SQL standard specifies an interface for writing foreign data wrappers. However, PostgreSQL does not implement that API, because the effort to accommodate it into PostgreSQL would be large, and the standard API hasn't gained wide adoption anyway.

56.1. Foreign Data Wrapper Functions

The FDW author needs to implement a handler function, and optionally a validator function. Both functions must be written in a compiled language such as C, using the version-1 interface. For details on C language calling conventions and dynamic loading, see Section 37.10.

The handler function simply returns a struct of function pointers to callback functions that will be called by the planner, executor, and various maintenance commands. Most of the effort in writing an FDW is in implementing these callback functions. The handler function must be registered with PostgreSQL as taking no arguments and returning the special pseudo-type fdw_handler. The callback functions are plain C functions and are not visible or callable at the SQL level. The callback functions are described in Section 56.2.

The validator function is responsible for validating options given in CREATE and ALTER commands for its foreign data wrapper, as well as foreign servers, user mappings, and foreign tables using the wrapper. The validator function must be registered as taking two arguments, a text array containing the options to be validated, and an OID representing the type of object the options are associated with (in the form of the OID of the system catalog the object would be stored in, either ForeignDataWrapperRelationId, ForeignServerRelationId, UserMappingRelationId, or ForeignTableRelationId). If no validator function is supplied, options are not checked at object creation time or object alteration time.

56.2. Foreign Data Wrapper Callback Routines

The FDW handler function returns a palloc'd FdwRoutine struct containing pointers to the callback functions described below. The scan-related functions are required, the rest are optional.

The FdwRoutine struct type is declared in src/include/foreign/fdwapi.h, which see for additional details.

56.2.1. FDW Routines for Scanning Foreign Tables

void
GetForeignRelSize(PlannerInfo *root,
                  RelOptInfo *baserel,
                  Oid foreigntableid);

Obtain relation size estimates for a foreign table. This is called at the beginning of planning for a query that scans a foreign table. root is the planner's global information about the query; baserel is the planner's information about this table; and foreigntableid is the pg_class OID of the foreign table. (foreigntableid could be obtained from the planner data structures, but it's passed explicitly to save effort.)

This function should update baserel->rows to be the expected number of rows returned by the table scan, after accounting for the filtering done by the restriction quals. The initial value of baserel->rows is just a constant default estimate, which should be replaced if at all possible. The function may also choose to update baserel->width if it can compute a better estimate of the average result row width.

See Section 56.4 for additional information.

void
GetForeignPaths(PlannerInfo *root,
                RelOptInfo *baserel,
                Oid foreigntableid);

Create possible access paths for a scan on a foreign table. This is called during query planning. The parameters are the same as for GetForeignRelSize, which has already been called.

This function must generate at least one access path (ForeignPath node) for a scan on the foreign table and must call add_path to add each such path to baserel->pathlist. It's recommended to use create_foreignscan_path to build the ForeignPath nodes. The function can generate multiple access paths, e.g., a path which has valid pathkeys to represent a pre-sorted result. Each access path must contain cost estimates, and can contain any FDW-private information that is needed to identify the specific scan method intended.

See Section 56.4 for additional information.

ForeignScan *
GetForeignPlan(PlannerInfo *root,
               RelOptInfo *baserel,
               Oid foreigntableid,
               ForeignPath *best_path,
               List *tlist,
               List *scan_clauses,
               Plan *outer_plan);

Create a ForeignScan plan node from the selected foreign access path. This is called at the end of query planning. The parameters are as for GetForeignRelSize, plus the selected ForeignPath (previously produced by GetForeignPaths, GetForeignJoinPaths, or GetForeignUpperPaths), the target list to be emitted by the plan node, the restriction clauses to be enforced by the plan node, and the outer subplan of the ForeignScan, which is used for rechecks performed by RecheckForeignScan. (If the path is for a join rather than a base relation, foreigntableid is InvalidOid.)

This function must create and return a ForeignScan plan node; it's recommended to use make_foreignscan to build the ForeignScan node.

See Section 56.4 for additional information.

void
BeginForeignScan(ForeignScanState *node,
                 int eflags);

Begin executing a foreign scan. This is called during executor startup. It should perform any initialization needed before the scan can start, but not start executing the actual scan (that should be done upon the first call to IterateForeignScan). The ForeignScanState node has already been created, but its fdw_state field is still NULL. Information about the table to scan is accessible through the ForeignScanState node (in particular, from the underlying ForeignScan plan node, which contains any FDW-private information provided by GetForeignPlan). eflags contains flag bits describing the executor's operating mode for this plan node.

Note that when (eflags & EXEC_FLAG_EXPLAIN_ONLY) is true, this function should not perform any externally-visible actions; it should only do the minimum required to make the node state valid for ExplainForeignScan and EndForeignScan.

TupleTableSlot *
IterateForeignScan(ForeignScanState *node);

Fetch one row from the foreign source, returning it in a tuple table slot (the node's ScanTupleSlot should be used for this purpose). Return NULL if no more rows are available. The tuple table slot infrastructure allows either a physical or virtual tuple to be returned; in most cases the latter choice is preferable from a performance standpoint. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginForeignScan if you need longer-lived storage, or use the es_query_cxt of the node's EState.

The rows returned must match the fdw_scan_tlist target list if one was supplied, otherwise they must match the row type of the foreign table being scanned. If you choose to optimize away fetching columns that are not needed, you should insert nulls in those column positions, or else generate a fdw_scan_tlist list with those columns omitted.

Note that PostgreSQL's executor doesn't care whether the rows returned violate any constraints that were defined on the foreign table — but the planner does care, and may optimize queries incorrectly if there are rows visible in the foreign table that do not satisfy a declared constraint. If a constraint is violated when the user has declared that the constraint should hold true, it may be appropriate to raise an error (just as you would need to do in the case of a data type mismatch).

void
ReScanForeignScan(ForeignScanState *node);

Restart the scan from the beginning. Note that any parameters the scan depends on may have changed value, so the new scan does not necessarily return exactly the same rows.

void
EndForeignScan(ForeignScanState *node);

End the scan and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

56.2.2. FDW Routines for Scanning Foreign Joins

If an FDW supports performing foreign joins remotely (rather than by fetching both tables' data and doing the join locally), it should provide this callback function:

void
GetForeignJoinPaths(PlannerInfo *root,
                    RelOptInfo *joinrel,
                    RelOptInfo *outerrel,
                    RelOptInfo *innerrel,
                    JoinType jointype,
                    JoinPathExtraData *extra);

Create possible access paths for a join of two (or more) foreign tables that all belong to the same foreign server. This optional function is called during query planning. As with GetForeignPaths, this function should generate ForeignPath path(s) for the supplied joinrel (use create_foreign_join_path to build them), and call add_path to add these paths to the set of paths considered for the join. But unlike GetForeignPaths, it is not necessary that this function succeed in creating at least one path, since paths involving local joining are always possible.

Note that this function will be invoked repeatedly for the same join relation, with different combinations of inner and outer relations; it is the responsibility of the FDW to minimize duplicated work.

If a ForeignPath path is chosen for the join, it will represent the entire join process; paths generated for the component tables and subsidiary joins will not be used. Subsequent processing of the join path proceeds much as it does for a path scanning a single foreign table. One difference is that the scanrelid of the resulting ForeignScan plan node should be set to zero, since there is no single relation that it represents; instead, the fs_relids field of the ForeignScan node represents the set of relations that were joined. (The latter field is set up automatically by the core planner code, and need not be filled by the FDW.) Another difference is that, because the column list for a remote join cannot be found from the system catalogs, the FDW must fill fdw_scan_tlist with an appropriate list of TargetEntry nodes, representing the set of columns it will supply at run time in the tuples it returns.

See Section 56.4 for additional information.

56.2.3. FDW Routines for Planning Post-Scan/Join Processing

If an FDW supports performing remote post-scan/join processing, such as remote aggregation, it should provide this callback function:

void
GetForeignUpperPaths(PlannerInfo *root,
                     UpperRelationKind stage,
                     RelOptInfo *input_rel,
                     RelOptInfo *output_rel,
                     void *extra);

Create possible access paths for upper relation processing, which is the planner's term for all post-scan/join query processing, such as aggregation, window functions, sorting, and table updates. This optional function is called during query planning. Currently, it is called only if all base relation(s) involved in the query belong to the same FDW. This function should generate ForeignPath path(s) for any post-scan/join processing that the FDW knows how to perform remotely (use create_foreign_upper_path to build them), and call add_path to add these paths to the indicated upper relation. As with GetForeignJoinPaths, it is not necessary that this function succeed in creating any paths, since paths involving local processing are always possible.

The stage parameter identifies which post-scan/join step is currently being considered. output_rel is the upper relation that should receive paths representing computation of this step, and input_rel is the relation representing the input to this step. The extra parameter provides additional details, currently, it is set only for UPPERREL_PARTIAL_GROUP_AGG or UPPERREL_GROUP_AGG, in which case it points to a GroupPathExtraData structure; or for UPPERREL_FINAL, in which case it points to a FinalPathExtraData structure. (Note that ForeignPath paths added to output_rel would typically not have any direct dependency on paths of the input_rel, since their processing is expected to be done externally. However, examining paths previously generated for the previous processing step can be useful to avoid redundant planning work.)

See Section 56.4 for additional information.

56.2.4. FDW Routines for Updating Foreign Tables

If an FDW supports writable foreign tables, it should provide some or all of the following callback functions depending on the needs and capabilities of the FDW:

void
AddForeignUpdateTargets(Query *parsetree,
                        RangeTblEntry *target_rte,
                        Relation target_relation);

UPDATE and DELETE operations are performed against rows previously fetched by the table-scanning functions. The FDW may need extra information, such as a row ID or the values of primary-key columns, to ensure that it can identify the exact row to update or delete. To support that, this function can add extra hidden, or “junk”, target columns to the list of columns that are to be retrieved from the foreign table during an UPDATE or DELETE.

To do that, add TargetEntry items to parsetree->targetList, containing expressions for the extra values to be fetched. Each such entry must be marked resjunk = true, and must have a distinct resname that will identify it at execution time. Avoid using names matching ctidN, wholerow, or wholerowN, as the core system can generate junk columns of these names. If the extra expressions are more complex than simple Vars, they must be run through eval_const_expressions before adding them to the targetlist.

Although this function is called during planning, the information provided is a bit different from that available to other planning routines. parsetree is the parse tree for the UPDATE or DELETE command, while target_rte and target_relation describe the target foreign table.

If the AddForeignUpdateTargets pointer is set to NULL, no extra target expressions are added. (This will make it impossible to implement DELETE operations, though UPDATE may still be feasible if the FDW relies on an unchanging primary key to identify rows.)

List *
PlanForeignModify(PlannerInfo *root,
                  ModifyTable *plan,
                  Index resultRelation,
                  int subplan_index);

Perform any additional planning actions needed for an insert, update, or delete on a foreign table. This function generates the FDW-private information that will be attached to the ModifyTable plan node that performs the update action. This private information must have the form of a List, and will be delivered to BeginForeignModify during the execution stage.

root is the planner's global information about the query. plan is the ModifyTable plan node, which is complete except for the fdwPrivLists field. resultRelation identifies the target foreign table by its range table index. subplan_index identifies which target of the ModifyTable plan node this is, counting from zero; use this if you want to index into plan->plans or other substructure of the plan node.

See Section 56.4 for additional information.

If the PlanForeignModify pointer is set to NULL, no additional plan-time actions are taken, and the fdw_private list delivered to BeginForeignModify will be NIL.

void
BeginForeignModify(ModifyTableState *mtstate,
                   ResultRelInfo *rinfo,
                   List *fdw_private,
                   int subplan_index,
                   int eflags);

Begin executing a foreign table modification operation. This routine is called during executor startup. It should perform any initialization needed prior to the actual table modifications. Subsequently, ExecForeignInsert, ExecForeignUpdate or ExecForeignDelete will be called for each tuple to be inserted, updated, or deleted.

mtstate is the overall state of the ModifyTable plan node being executed; global data about the plan and execution state is available via this structure. rinfo is the ResultRelInfo struct describing the target foreign table. (The ri_FdwState field of ResultRelInfo is available for the FDW to store any private state it needs for this operation.) fdw_private contains the private data generated by PlanForeignModify, if any. subplan_index identifies which target of the ModifyTable plan node this is. eflags contains flag bits describing the executor's operating mode for this plan node.

If the BeginForeignModify pointer is set to NULL, no action is taken during executor startup.

TupleTableSlot *
ExecForeignInsert(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Insert one tuple into the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains the tuple to be inserted; it will match the row-type definition of the foreign table. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; it differs from slot in possibly containing additional “junk” columns. (The planSlot is typically of little interest for INSERT cases, but is provided for completeness.)

The return value is either a slot containing the data that was actually inserted (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually inserted (again, typically as a result of triggers). The passed-in slot can be re-used for this purpose.

The data in the returned slot is used only if the INSERT statement has a RETURNING clause or involves a view WITH CHECK OPTION; or if the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause or WITH CHECK OPTION constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignInsert pointer is set to NULL, attempts to insert into the foreign table will fail with an error message.

Note that this function is also called when inserting routed tuples into a foreign-table partition or executing COPY FROM on a foreign table, in which case it is called in a different way than it is in the INSERT case. See the callback functions described below that allow the FDW to support that.

TupleTableSlot *
ExecForeignUpdate(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Update one tuple in the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains the new data for the tuple; it will match the row-type definition of the foreign table. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; it differs from slot in possibly containing additional “junk” columns. In particular, any junk columns that were requested by AddForeignUpdateTargets will be available from this slot.

The return value is either a slot containing the row as it was actually updated (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually updated (again, typically as a result of triggers). The passed-in slot can be re-used for this purpose.

The data in the returned slot is used only if the UPDATE statement has a RETURNING clause or involves a view WITH CHECK OPTION; or if the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause or WITH CHECK OPTION constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignUpdate pointer is set to NULL, attempts to update the foreign table will fail with an error message.

TupleTableSlot *
ExecForeignDelete(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Delete one tuple from the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains nothing useful upon call, but can be used to hold the returned tuple. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; in particular, it will carry any junk columns that were requested by AddForeignUpdateTargets. The junk column(s) must be used to identify the tuple to be deleted.

The return value is either a slot containing the row that was deleted, or NULL if no row was deleted (typically as a result of triggers). The passed-in slot can be used to hold the tuple to be returned.

The data in the returned slot is used only if the DELETE query has a RETURNING clause or the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignDelete pointer is set to NULL, attempts to delete from the foreign table will fail with an error message.

void
EndForeignModify(EState *estate,
                 ResultRelInfo *rinfo);

End the table update and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

If the EndForeignModify pointer is set to NULL, no action is taken during executor shutdown.

Tuples inserted into a partitioned table by INSERT or COPY FROM are routed to partitions. If an FDW supports routable foreign-table partitions, it should also provide the following callback functions. These functions are also called when COPY FROM is executed on a foreign table.

void
BeginForeignInsert(ModifyTableState *mtstate,
                   ResultRelInfo *rinfo);

Begin executing an insert operation on a foreign table. This routine is called right before the first tuple is inserted into the foreign table in both cases when it is the partition chosen for tuple routing and the target specified in a COPY FROM command. It should perform any initialization needed prior to the actual insertion. Subsequently, ExecForeignInsert will be called for each tuple to be inserted into the foreign table.

When this is called by a COPY FROM command, the plan-related global data in mtstate is not provided and the planSlot parameter of ExecForeignInsert subsequently called for each inserted tuple is NULL, whether the foreign table is the partition chosen for tuple routing or the target specified in the command.

If the BeginForeignInsert pointer is set to NULL, no action is taken for the initialization.

Note that if the FDW does not support routable foreign-table partitions and/or executing COPY FROM on foreign tables, this function or ExecForeignInsert subsequently called must throw error as needed.

void
EndForeignInsert(EState *estate,
                 ResultRelInfo *rinfo);

End the insert operation and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

If the EndForeignInsert pointer is set to NULL, no action is taken for the termination.

int
IsForeignRelUpdatable(Relation rel);

Report which update operations the specified foreign table supports. The return value should be a bit mask of rule event numbers indicating which operations are supported by the foreign table, using the CmdType enumeration; that is, (1 << CMD_UPDATE) = 4 for UPDATE, (1 << CMD_INSERT) = 8 for INSERT, and (1 << CMD_DELETE) = 16 for DELETE.

If the IsForeignRelUpdatable pointer is set to NULL, foreign tables are assumed to be insertable, updatable, or deletable if the FDW provides ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete respectively. This function is only needed if the FDW supports some tables that are updatable and some that are not. (Even then, it's permissible to throw an error in the execution routine instead of checking in this function. However, this function is used to determine updatability for display in the information_schema views.)

Some inserts, updates, and deletes to foreign tables can be optimized by implementing an alternative set of interfaces. The ordinary interfaces for inserts, updates, and deletes fetch rows from the remote server and then modify those rows one at a time. In some cases, this row-by-row approach is necessary, but it can be inefficient. If it is possible for the foreign server to determine which rows should be modified without actually retrieving them, and if there are no local structures which would affect the operation (row-level local triggers, stored generated columns, or WITH CHECK OPTION constraints from parent views), then it is possible to arrange things so that the entire operation is performed on the remote server. The interfaces described below make this possible.

bool
PlanDirectModify(PlannerInfo *root,
                 ModifyTable *plan,
                 Index resultRelation,
                 int subplan_index);

Decide whether it is safe to execute a direct modification on the remote server. If so, return true after performing planning actions needed for that. Otherwise, return false. This optional function is called during query planning. If this function succeeds, BeginDirectModify, IterateDirectModify and EndDirectModify will be called at the execution stage, instead. Otherwise, the table modification will be executed using the table-updating functions described above. The parameters are the same as for PlanForeignModify.

To execute the direct modification on the remote server, this function must rewrite the target subplan with a ForeignScan plan node that executes the direct modification on the remote server. The operation field of the ForeignScan must be set to the CmdType enumeration appropriately; that is, CMD_UPDATE for UPDATE, CMD_INSERT for INSERT, and CMD_DELETE for DELETE.

See Section 56.4 for additional information.

If the PlanDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

void
BeginDirectModify(ForeignScanState *node,
                  int eflags);

Prepare to execute a direct modification on the remote server. This is called during executor startup. It should perform any initialization needed prior to the direct modification (that should be done upon the first call to IterateDirectModify). The ForeignScanState node has already been created, but its fdw_state field is still NULL. Information about the table to modify is accessible through the ForeignScanState node (in particular, from the underlying ForeignScan plan node, which contains any FDW-private information provided by PlanDirectModify). eflags contains flag bits describing the executor's operating mode for this plan node.

If the BeginDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

TupleTableSlot *
IterateDirectModify(ForeignScanState *node);

When the INSERT, UPDATE or DELETE query doesn't have a RETURNING clause, just return NULL after a direct modification on the remote server. When the query has the clause, fetch one result containing the data needed for the RETURNING calculation, returning it in a tuple table slot (the node's ScanTupleSlot should be used for this purpose). The data that was actually inserted, updated or deleted must be stored in the es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple of the node's EState. Return NULL if no more rows are available. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginDirectModify if you need longer-lived storage, or use the es_query_cxt of the node's EState.

The rows returned must match the fdw_scan_tlist target list if one was supplied, otherwise they must match the row type of the foreign table being updated. If you choose to optimize away fetching columns that are not needed for the RETURNING calculation, you should insert nulls in those column positions, or else generate a fdw_scan_tlist list with those columns omitted.

Whether the query has the clause or not, the query's reported row count must be incremented by the FDW itself. When the query doesn't have the clause, the FDW must also increment the row count for the ForeignScanState node in the EXPLAIN ANALYZE case.

If the IterateDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

void
EndDirectModify(ForeignScanState *node);

Clean up following a direct modification on the remote server. It is normally not important to release palloc'd memory, but for example open files and connections to the remote server should be cleaned up.

If the EndDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

56.2.5. FDW Routines for Row Locking

If an FDW wishes to support late row locking (as described in Section 56.5), it must provide the following callback functions:

RowMarkType
GetForeignRowMarkType(RangeTblEntry *rte,
                      LockClauseStrength strength);

Report which row-marking option to use for a foreign table. rte is the RangeTblEntry node for the table and strength describes the lock strength requested by the relevant FOR UPDATE/SHARE clause, if any. The result must be a member of the RowMarkType enum type.

This function is called during query planning for each foreign table that appears in an UPDATE, DELETE, or SELECT FOR UPDATE/SHARE query and is not the target of UPDATE or DELETE.

If the GetForeignRowMarkType pointer is set to NULL, the ROW_MARK_COPY option is always used. (This implies that RefetchForeignRow will never be called, so it need not be provided either.)

See Section 56.5 for more information.

void
RefetchForeignRow(EState *estate,
                  ExecRowMark *erm,
                  Datum rowid,
                  TupleTableSlot *slot,
                  bool *updated);

Re-fetch one tuple slot from the foreign table, after locking it if required. estate is global execution state for the query. erm is the ExecRowMark struct describing the target foreign table and the row lock type (if any) to acquire. rowid identifies the tuple to be fetched. slot contains nothing useful upon call, but can be used to hold the returned tuple. updated is an output parameter.

This function should store the tuple into the provided slot, or clear it if the row lock couldn't be obtained. The row lock type to acquire is defined by erm->markType, which is the value previously returned by GetForeignRowMarkType. (ROW_MARK_REFERENCE means to just re-fetch the tuple without acquiring any lock, and ROW_MARK_COPY will never be seen by this routine.)

In addition, *updated should be set to true if what was fetched was an updated version of the tuple rather than the same version previously obtained. (If the FDW cannot be sure about this, always returning true is recommended.)

Note that by default, failure to acquire a row lock should result in raising an error; returning with an empty slot is only appropriate if the SKIP LOCKED option is specified by erm->waitPolicy.

The rowid is the ctid value previously read for the row to be re-fetched. Although the rowid value is passed as a Datum, it can currently only be a tid. The function API is chosen in hopes that it may be possible to allow other data types for row IDs in future.

If the RefetchForeignRow pointer is set to NULL, attempts to re-fetch rows will fail with an error message.

See Section 56.5 for more information.

bool
RecheckForeignScan(ForeignScanState *node,
                   TupleTableSlot *slot);

Recheck that a previously-returned tuple still matches the relevant scan and join qualifiers, and possibly provide a modified version of the tuple. For foreign data wrappers which do not perform join pushdown, it will typically be more convenient to set this to NULL and instead set fdw_recheck_quals appropriately. When outer joins are pushed down, however, it isn't sufficient to reapply the checks relevant to all the base tables to the result tuple, even if all needed attributes are present, because failure to match some qualifier might result in some attributes going to NULL, rather than in no tuple being returned. RecheckForeignScan can recheck qualifiers and return true if they are still satisfied and false otherwise, but it can also store a replacement tuple into the supplied slot.

To implement join pushdown, a foreign data wrapper will typically construct an alternative local join plan which is used only for rechecks; this will become the outer subplan of the ForeignScan. When a recheck is required, this subplan can be executed and the resulting tuple can be stored in the slot. This plan need not be efficient since no base table will return more than one row; for example, it may implement all joins as nested loops. The function GetExistingLocalJoinPath may be used to search existing paths for a suitable local join path, which can be used as the alternative local join plan. GetExistingLocalJoinPath searches for an unparameterized path in the path list of the specified join relation. (If it does not find such a path, it returns NULL, in which case a foreign data wrapper may build the local path by itself or may choose not to create access paths for that join.)

56.2.6. FDW Routines for `EXPLAIN`

void
ExplainForeignScan(ForeignScanState *node,
                   ExplainState *es);

Print additional EXPLAIN output for a foreign table scan. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ForeignScanState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case.

If the ExplainForeignScan pointer is set to NULL, no additional information is printed during EXPLAIN.

void
ExplainForeignModify(ModifyTableState *mtstate,
                     ResultRelInfo *rinfo,
                     List *fdw_private,
                     int subplan_index,
                     struct ExplainState *es);

Print additional EXPLAIN output for a foreign table update. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ModifyTableState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case. The first four arguments are the same as for BeginForeignModify.

If the ExplainForeignModify pointer is set to NULL, no additional information is printed during EXPLAIN.

void
ExplainDirectModify(ForeignScanState *node,
                    ExplainState *es);

Print additional EXPLAIN output for a direct modification on the remote server. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ForeignScanState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case.

If the ExplainDirectModify pointer is set to NULL, no additional information is printed during EXPLAIN.

56.2.7. FDW Routines for `ANALYZE`

bool
AnalyzeForeignTable(Relation relation,
                    AcquireSampleRowsFunc *func,
                    BlockNumber *totalpages);

This function is called when ANALYZE is executed on a foreign table. If the FDW can collect statistics for this foreign table, it should return true, and provide a pointer to a function that will collect sample rows from the table in func, plus the estimated size of the table in pages in totalpages. Otherwise, return false.

If the FDW does not support collecting statistics for any tables, the AnalyzeForeignTable pointer can be set to NULL.

If provided, the sample collection function must have the signature

int
AcquireSampleRowsFunc(Relation relation,
                      int elevel,
                      HeapTuple *rows,
                      int targrows,
                      double *totalrows,
                      double *totaldeadrows);

A random sample of up to targrows rows should be collected from the table and stored into the caller-provided rows array. The actual number of rows collected must be returned. In addition, store estimates of the total numbers of live and dead rows in the table into the output parameters totalrows and totaldeadrows. (Set totaldeadrows to zero if the FDW does not have any concept of dead rows.)

56.2.8. FDW Routines for `IMPORT FOREIGN SCHEMA`

List *
ImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid);

Obtain a list of foreign table creation commands. This function is called when executing IMPORT FOREIGN SCHEMA, and is passed the parse tree for that statement, as well as the OID of the foreign server to use. It should return a list of C strings, each of which must contain a CREATE FOREIGN TABLE command. These strings will be parsed and executed by the core server.

Within the ImportForeignSchemaStmt struct, remote_schema is the name of the remote schema from which tables are to be imported. list_type identifies how to filter table names: FDW_IMPORT_SCHEMA_ALL means that all tables in the remote schema should be imported (in this case table_list is empty), FDW_IMPORT_SCHEMA_LIMIT_TO means to include only tables listed in table_list, and FDW_IMPORT_SCHEMA_EXCEPT means to exclude the tables listed in table_list. options is a list of options used for the import process. The meanings of the options are up to the FDW. For example, an FDW could use an option to define whether the NOT NULL attributes of columns should be imported. These options need not have anything to do with those supported by the FDW as database object options.

The FDW may ignore the local_schema field of the ImportForeignSchemaStmt, because the core server will automatically insert that name into the parsed CREATE FOREIGN TABLE commands.

The FDW does not have to concern itself with implementing the filtering specified by list_type and table_list, either, as the core server will automatically skip any returned commands for tables excluded according to those options. However, it's often useful to avoid the work of creating commands for excluded tables in the first place. The function IsImportableForeignTable() may be useful to test whether a given foreign-table name will pass the filter.

If the FDW does not support importing table definitions, the ImportForeignSchema pointer can be set to NULL.

56.2.9. FDW Routines for Parallel Execution

A ForeignScan node can, optionally, support parallel execution. A parallel ForeignScan will be executed in multiple processes and must return each row exactly once across all cooperating processes. To do this, processes can coordinate through fixed-size chunks of dynamic shared memory. This shared memory is not guaranteed to be mapped at the same address in every process, so it must not contain pointers. The following functions are all optional, but most are required if parallel execution is to be supported.

bool
IsForeignScanParallelSafe(PlannerInfo *root, RelOptInfo *rel,
                          RangeTblEntry *rte);

Test whether a scan can be performed within a parallel worker. This function will only be called when the planner believes that a parallel plan might be possible, and should return true if it is safe for that scan to run within a parallel worker. This will generally not be the case if the remote data source has transaction semantics, unless the worker's connection to the data can somehow be made to share the same transaction context as the leader.

If this function is not defined, it is assumed that the scan must take place within the parallel leader. Note that returning true does not mean that the scan itself can be done in parallel, only that the scan can be performed within a parallel worker. Therefore, it can be useful to define this method even when parallel execution is not supported.

Size
EstimateDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt);

Estimate the amount of dynamic shared memory that will be required for parallel operation. This may be higher than the amount that will actually be used, but it must not be lower. The return value is in bytes. This function is optional, and can be omitted if not needed; but if it is omitted, the next three functions must be omitted as well, because no shared memory will be allocated for the FDW's use.

void
InitializeDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt,
                         void *coordinate);

Initialize the dynamic shared memory that will be required for parallel operation. coordinate points to a shared memory area of size equal to the return value of EstimateDSMForeignScan. This function is optional, and can be omitted if not needed.

void
ReInitializeDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt,
                           void *coordinate);

Re-initialize the dynamic shared memory required for parallel operation when the foreign-scan plan node is about to be re-scanned. This function is optional, and can be omitted if not needed. Recommended practice is that this function reset only shared state, while the ReScanForeignScan function resets only local state. Currently, this function will be called before ReScanForeignScan, but it's best not to rely on that ordering.

void
InitializeWorkerForeignScan(ForeignScanState *node, shm_toc *toc,
                            void *coordinate);

Initialize a parallel worker's local state based on the shared state set up by the leader during InitializeDSMForeignScan. This function is optional, and can be omitted if not needed.

void
ShutdownForeignScan(ForeignScanState *node);

Release resources when it is anticipated the node will not be executed to completion. This is not called in all cases; sometimes, EndForeignScan may be called without this function having been called first. Since the DSM segment used by parallel query is destroyed just after this callback is invoked, foreign data wrappers that wish to take some action before the DSM segment goes away should implement this method.

56.2.10. FDW Routines for Reparameterization of Paths

List *
ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
                                 RelOptInfo *child_rel);

This function is called while converting a path parameterized by the top-most parent of the given child relation child_rel to be parameterized by the child relation. The function is used to reparameterize any paths or translate any expression nodes saved in the given fdw_private member of a ForeignPath. The callback may use reparameterize_path_by_child, adjust_appendrel_attrs or adjust_appendrel_attrs_multilevel as required.

56.3. Foreign Data Wrapper Helper Functions

Several helper functions are exported from the core server so that authors of foreign data wrappers can get easy access to attributes of FDW-related objects, such as FDW options. To use any of these functions, you need to include the header file foreign/foreign.h in your source file. That header also defines the struct types that are returned by these functions.

ForeignDataWrapper *
GetForeignDataWrapperExtended(Oid fdwid, bits16 flags);

This function returns a ForeignDataWrapper object for the foreign-data wrapper with the given OID. A ForeignDataWrapper object contains properties of the FDW (see foreign/foreign.h for details). flags is a bitwise-or'd bit mask indicating an extra set of options. It can take the value FDW_MISSING_OK, in which case a NULL result is returned to the caller instead of an error for an undefined object.

ForeignDataWrapper *
GetForeignDataWrapper(Oid fdwid);

ForeignServer *
GetForeignServerExtended(Oid serverid, bits16 flags);

This function returns a ForeignServer object for the foreign server with the given OID. A ForeignServer object contains properties of the server (see foreign/foreign.h for details). flags is a bitwise-or'd bit mask indicating an extra set of options. It can take the value FSV_MISSING_OK, in which case a NULL result is returned to the caller instead of an error for an undefined object.

ForeignServer *
GetForeignServer(Oid serverid);

This function returns a ForeignServer object for the foreign server with the given OID. A ForeignServer object contains properties of the server (see foreign/foreign.h for details).

UserMapping *
GetUserMapping(Oid userid, Oid serverid);

This function returns a UserMapping object for the user mapping of the given role on the given server. (If there is no mapping for the specific user, it will return the mapping for PUBLIC, or throw error if there is none.) A UserMapping object contains properties of the user mapping (see foreign/foreign.h for details).

ForeignTable *
GetForeignTable(Oid relid);

This function returns a ForeignTable object for the foreign table with the given OID. A ForeignTable object contains properties of the foreign table (see foreign/foreign.h for details).

List *
GetForeignColumnOptions(Oid relid, AttrNumber attnum);

This function returns the per-column FDW options for the column with the given foreign table OID and attribute number, in the form of a list of DefElem. NIL is returned if the column has no options.

Some object types have name-based lookup functions in addition to the OID-based ones:

ForeignDataWrapper *
GetForeignDataWrapperByName(const char *name, bool missing_ok);

This function returns a ForeignDataWrapper object for the foreign-data wrapper with the given name. If the wrapper is not found, return NULL if missing_ok is true, otherwise raise an error.

ForeignServer *
GetForeignServerByName(const char *name, bool missing_ok);

This function returns a ForeignServer object for the foreign server with the given name. If the server is not found, return NULL if missing_ok is true, otherwise raise an error.

56.4. Foreign Data Wrapper Query Planning

The FDW callback functions GetForeignRelSize, GetForeignPaths, GetForeignPlan, PlanForeignModify, GetForeignJoinPaths, GetForeignUpperPaths, and PlanDirectModify must fit into the workings of the PostgreSQL planner. Here are some notes about what they must do.

The information in root and baserel can be used to reduce the amount of information that has to be fetched from the foreign table (and therefore reduce the cost). baserel->baserestrictinfo is particularly interesting, as it contains restriction quals (WHERE clauses) that should be used to filter the rows to be fetched. (The FDW itself is not required to enforce these quals, as the core executor can check them instead.) baserel->reltarget->exprs can be used to determine which columns need to be fetched; but note that it only lists columns that have to be emitted by the ForeignScan plan node, not columns that are used in qual evaluation but not output by the query.

Various private fields are available for the FDW planning functions to keep information in. Generally, whatever you store in FDW private fields should be palloc'd, so that it will be reclaimed at the end of planning.

baserel->fdw_private is a void pointer that is available for FDW planning functions to store information relevant to the particular foreign table. The core planner does not touch it except to initialize it to NULL when the RelOptInfo node is created. It is useful for passing information forward from GetForeignRelSize to GetForeignPaths and/or GetForeignPaths to GetForeignPlan, thereby avoiding recalculation.

GetForeignPaths can identify the meaning of different access paths by storing private information in the fdw_private field of ForeignPath nodes. fdw_private is declared as a List pointer, but could actually contain anything since the core planner does not touch it. However, best practice is to use a representation that's dumpable by nodeToString, for use with debugging support available in the backend.

GetForeignPlan can examine the fdw_private field of the selected ForeignPath node, and can generate fdw_exprs and fdw_private lists to be placed in the ForeignScan plan node, where they will be available at execution time. Both of these lists must be represented in a form that copyObject knows how to copy. The fdw_private list has no other restrictions and is not interpreted by the core backend in any way. The fdw_exprs list, if not NIL, is expected to contain expression trees that are intended to be executed at run time. These trees will undergo post-processing by the planner to make them fully executable.

In GetForeignPlan, generally the passed-in target list can be copied into the plan node as-is. The passed scan_clauses list contains the same clauses as baserel->baserestrictinfo, but may be re-ordered for better execution efficiency. In simple cases the FDW can just strip RestrictInfo nodes from the scan_clauses list (using extract_actual_clauses) and put all the clauses into the plan node's qual list, which means that all the clauses will be checked by the executor at run time. More complex FDWs may be able to check some of the clauses internally, in which case those clauses can be removed from the plan node's qual list so that the executor doesn't waste time rechecking them.

As an example, the FDW might identify some restriction clauses of the form foreign_variable = sub_expression, which it determines can be executed on the remote server given the locally-evaluated value of the sub_expression. The actual identification of such a clause should happen during GetForeignPaths, since it would affect the cost estimate for the path. The path's fdw_private field would probably include a pointer to the identified clause's RestrictInfo node. Then GetForeignPlan would remove that clause from scan_clauses, but add the sub_expression to fdw_exprs to ensure that it gets massaged into executable form. It would probably also put control information into the plan node's fdw_private field to tell the execution functions what to do at run time. The query transmitted to the remote server would involve something like WHERE foreign_variable = $1, with the parameter value obtained at run time from evaluation of the fdw_exprs expression tree.

Any clauses removed from the plan node's qual list must instead be added to fdw_recheck_quals or rechecked by RecheckForeignScan in order to ensure correct behavior at the READ COMMITTED isolation level. When a concurrent update occurs for some other table involved in the query, the executor may need to verify that all of the original quals are still satisfied for the tuple, possibly against a different set of parameter values. Using fdw_recheck_quals is typically easier than implementing checks inside RecheckForeignScan, but this method will be insufficient when outer joins have been pushed down, since the join tuples in that case might have some fields go to NULL without rejecting the tuple entirely.

Another ForeignScan field that can be filled by FDWs is fdw_scan_tlist, which describes the tuples returned by the FDW for this plan node. For simple foreign table scans this can be set to NIL, implying that the returned tuples have the row type declared for the foreign table. A non-NIL value must be a target list (list of TargetEntrys) containing Vars and/or expressions representing the returned columns. This might be used, for example, to show that the FDW has omitted some columns that it noticed won't be needed for the query. Also, if the FDW can compute expressions used by the query more cheaply than can be done locally, it could add those expressions to fdw_scan_tlist. Note that join plans (created from paths made by GetForeignJoinPaths) must always supply fdw_scan_tlist to describe the set of columns they will return.

The FDW should always construct at least one path that depends only on the table's restriction clauses. In join queries, it might also choose to construct path(s) that depend on join clauses, for example foreign_variable = local_variable. Such clauses will not be found in baserel->baserestrictinfo but must be sought in the relation's join lists. A path using such a clause is called a “parameterized path”. It must identify the other relations used in the selected join clause(s) with a suitable value of param_info; use get_baserel_parampathinfo to compute that value. In GetForeignPlan, the local_variable portion of the join clause would be added to fdw_exprs, and then at run time the case works the same as for an ordinary restriction clause.

If an FDW supports remote joins, GetForeignJoinPaths should produce ForeignPaths for potential remote joins in much the same way as GetForeignPaths works for base tables. Information about the intended join can be passed forward to GetForeignPlan in the same ways described above. However, baserestrictinfo is not relevant for join relations; instead, the relevant join clauses for a particular join are passed to GetForeignJoinPaths as a separate parameter (extra->restrictlist).

An FDW might additionally support direct execution of some plan actions that are above the level of scans and joins, such as grouping or aggregation. To offer such options, the FDW should generate paths and insert them into the appropriate upper relation. For example, a path representing remote aggregation should be inserted into the UPPERREL_GROUP_AGG relation, using add_path. This path will be compared on a cost basis with local aggregation performed by reading a simple scan path for the foreign relation (note that such a path must also be supplied, else there will be an error at plan time). If the remote-aggregation path wins, which it usually would, it will be converted into a plan in the usual way, by calling GetForeignPlan. The recommended place to generate such paths is in the GetForeignUpperPaths callback function, which is called for each upper relation (i.e., each post-scan/join processing step), if all the base relations of the query come from the same FDW.

PlanForeignModify and the other callbacks described in Section 56.2.4 are designed around the assumption that the foreign relation will be scanned in the usual way and then individual row updates will be driven by a local ModifyTable plan node. This approach is necessary for the general case where an update requires reading local tables as well as foreign tables. However, if the operation could be executed entirely by the foreign server, the FDW could generate a path representing that and insert it into the UPPERREL_FINAL upper relation, where it would compete against the ModifyTable approach. This approach could also be used to implement remote SELECT FOR UPDATE, rather than using the row locking callbacks described in Section 56.2.5. Keep in mind that a path inserted into UPPERREL_FINAL is responsible for implementing all behavior of the query.

When planning an UPDATE or DELETE, PlanForeignModify and PlanDirectModify can look up the RelOptInfo struct for the foreign table and make use of the baserel->fdw_private data previously created by the scan-planning functions. However, in INSERT the target table is not scanned so there is no RelOptInfo for it. The List returned by PlanForeignModify has the same restrictions as the fdw_private list of a ForeignScan plan node, that is it must contain only structures that copyObject knows how to copy.

INSERT with an ON CONFLICT clause does not support specifying the conflict target, as unique constraints or exclusion constraints on remote tables are not locally known. This in turn implies that ON CONFLICT DO UPDATE is not supported, since the specification is mandatory there.

56.5. Row Locking in Foreign Data Wrappers

If an FDW's underlying storage mechanism has a concept of locking individual rows to prevent concurrent updates of those rows, it is usually worthwhile for the FDW to perform row-level locking with as close an approximation as practical to the semantics used in ordinary PostgreSQL tables. There are multiple considerations involved in this.

One key decision to be made is whether to perform early locking or late locking. In early locking, a row is locked when it is first retrieved from the underlying store, while in late locking, the row is locked only when it is known that it needs to be locked. (The difference arises because some rows may be discarded by locally-checked restriction or join conditions.) Early locking is much simpler and avoids extra round trips to a remote store, but it can cause locking of rows that need not have been locked, resulting in reduced concurrency or even unexpected deadlocks. Also, late locking is only possible if the row to be locked can be uniquely re-identified later. Preferably the row identifier should identify a specific version of the row, as PostgreSQL TIDs do.

By default, PostgreSQL ignores locking considerations when interfacing to FDWs, but an FDW can perform early locking without any explicit support from the core code. The API functions described in Section 56.2.5, which were added in PostgreSQL 9.5, allow an FDW to use late locking if it wishes.

An additional consideration is that in READ COMMITTED isolation mode, PostgreSQL may need to re-check restriction and join conditions against an updated version of some target tuple. Rechecking join conditions requires re-obtaining copies of the non-target rows that were previously joined to the target tuple. When working with standard PostgreSQL tables, this is done by including the TIDs of the non-target tables in the column list projected through the join, and then re-fetching non-target rows when required. This approach keeps the join data set compact, but it requires inexpensive re-fetch capability, as well as a TID that can uniquely identify the row version to be re-fetched. By default, therefore, the approach used with foreign tables is to include a copy of the entire row fetched from a foreign table in the column list projected through the join. This puts no special demands on the FDW but can result in reduced performance of merge and hash joins. An FDW that is capable of meeting the re-fetch requirements can choose to do it the first way.

For an UPDATE or DELETE on a foreign table, it is recommended that the ForeignScan operation on the target table perform early locking on the rows that it fetches, perhaps via the equivalent of SELECT FOR UPDATE. An FDW can detect whether a table is an UPDATE/DELETE target at plan time by comparing its relid to root->parse->resultRelation, or at execution time by using ExecRelationIsTargetRelation(). An alternative possibility is to perform late locking within the ExecForeignUpdate or ExecForeignDelete callback, but no special support is provided for this.

For foreign tables that are specified to be locked by a SELECT FOR UPDATE/SHARE command, the ForeignScan operation can again perform early locking by fetching tuples with the equivalent of SELECT FOR UPDATE/SHARE. To perform late locking instead, provide the callback functions defined in Section 56.2.5. In GetForeignRowMarkType, select rowmark option ROW_MARK_EXCLUSIVE, ROW_MARK_NOKEYEXCLUSIVE, ROW_MARK_SHARE, or ROW_MARK_KEYSHARE depending on the requested lock strength. (The core code will act the same regardless of which of these four options you choose.) Elsewhere, you can detect whether a foreign table was specified to be locked by this type of command by using get_plan_rowmark at plan time, or ExecFindRowMark at execution time; you must check not only whether a non-null rowmark struct is returned, but that its strength field is not LCS_NONE.

Lastly, for foreign tables that are used in an UPDATE, DELETE or SELECT FOR UPDATE/SHARE command but are not specified to be row-locked, you can override the default choice to copy entire rows by having GetForeignRowMarkType select option ROW_MARK_REFERENCE when it sees lock strength LCS_NONE. This will cause RefetchForeignRow to be called with that value for markType; it should then re-fetch the row without acquiring any new lock. (If you have a GetForeignRowMarkType function but don't wish to re-fetch unlocked rows, select option ROW_MARK_COPY for LCS_NONE.)

See src/include/nodes/lockoptions.h, the comments for RowMarkType and PlanRowMark in src/include/nodes/plannodes.h, and the comments for ExecRowMark in src/include/nodes/execnodes.h for additional information.

59. Genetic Query Optimizer

Author

Written by Martin Utesch (<utesch@aut.tu-freiberg.de>) for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.

59.1. Query Handling as a Complex Optimization Problem

Among all relational operators the most difficult one to process and optimize is the join. The number of possible query plans grows exponentially with the number of joins in the query. Further optimization effort is caused by the support of a variety of join methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., B-tree, hash, GiST and GIN in PostgreSQL) as access paths for relations.

The normal PostgreSQL query optimizer performs a near-exhaustive search over the space of alternative strategies. This algorithm, first introduced in IBM's System R database, produces a near-optimal join order, but can take an enormous amount of time and memory space when the number of joins in the query grows large. This makes the ordinary PostgreSQL query optimizer inappropriate for queries that join a large number of tables.

The Institute of Automatic Control at the University of Mining and Technology, in Freiberg, Germany, encountered some problems when it wanted to use PostgreSQL as the backend for a decision support knowledge based system for the maintenance of an electrical power grid. The DBMS needed to handle large join queries for the inference machine of the knowledge based system. The number of joins in these queries made using the normal query optimizer infeasible.

In the following we describe the implementation of a genetic algorithm to solve the join ordering problem in a manner that is efficient for queries involving large numbers of joins.

59.2. Genetic Algorithms

The genetic algorithm (GA) is a heuristic optimization method which operates through randomized search. The set of possible solutions for the optimization problem is considered as a population of individuals. The degree of adaptation of an individual to its environment is specified by its fitness.

The coordinates of an individual in the search space are represented by chromosomes, in essence a set of character strings. A gene is a subsection of a chromosome which encodes the value of a single parameter being optimized. Typical encodings for a gene could be binary or integer.

Through simulation of the evolutionary operations recombination, mutation, and selection new generations of search points are found that show a higher average fitness than their ancestors.

According to the comp.ai.genetic FAQ it cannot be stressed too strongly that a GA is not a pure random search for a solution to a problem. A GA uses stochastic processes, but the result is distinctly non-random (better than random).

Figure 59.1. Structured Diagram of a Genetic Algorithm

+=========================================+
|>>>>>>>>>>>  Algorithm GA  <<<<<<<<<<<<<<|
+=========================================+
| INITIALIZE t := 0                       |
+=========================================+
| INITIALIZE P(t)                         |
+=========================================+
| evaluate FITNESS of P(t)                |
+=========================================+
| while not STOPPING CRITERION do         |
|   +-------------------------------------+
|   | P'(t)  := RECOMBINATION{P(t)}       |
|   +-------------------------------------+
|   | P''(t) := MUTATION{P'(t)}           |
|   +-------------------------------------+
|   | P(t+1) := SELECTION{P''(t) + P(t)}  |
|   +-------------------------------------+
|   | evaluate FITNESS of P''(t)          |
|   +-------------------------------------+
|   | t := t + 1                          |
+===+=====================================+

59.3. Genetic Query Optimization (GEQO) in PostgreSQL

The GEQO module approaches the query optimization problem as though it were the well-known traveling salesman problem (TSP). Possible query plans are encoded as integer strings. Each string represents the join order from one relation of the query to the next. For example, the join tree

   /\
  /\ 2
 /\ 3
4  1

is encoded by the integer string '4-1-3-2', which means, first join relation '4' and '1', then '3', and then '2', where 1, 2, 3, 4 are relation IDs within the PostgreSQL optimizer.

Specific characteristics of the GEQO implementation in PostgreSQL are:

Usage of a steady state GA (replacement of the least fit individuals in a population, not whole-generational replacement) allows fast convergence towards improved query plans. This is essential for query handling with reasonable time;
Usage of edge recombination crossover which is especially suited to keep edge losses low for the solution of the TSP by means of a GA;
Mutation as genetic operator is deprecated so that no repair mechanisms are needed to generate legal TSP tours.

Parts of the GEQO module are adapted from D. Whitley's Genitor algorithm.

The GEQO module allows the PostgreSQL query optimizer to support large join queries effectively through non-exhaustive search.

59.3.1. Generating Possible Plans with GEQO

The GEQO planning process uses the standard planner code to generate plans for scans of individual relations. Then join plans are developed using the genetic approach. As shown above, each candidate join plan is represented by a sequence in which to join the base relations. In the initial stage, the GEQO code simply generates some possible join sequences at random. For each join sequence considered, the standard planner code is invoked to estimate the cost of performing the query using that join sequence. (For each step of the join sequence, all three possible join strategies are considered; and all the initially-determined relation scan plans are available. The estimated cost is the cheapest of these possibilities.) Join sequences with lower estimated cost are considered “more fit” than those with higher cost. The genetic algorithm discards the least fit candidates. Then new candidates are generated by combining genes of more-fit candidates — that is, by using randomly-chosen portions of known low-cost join sequences to create new sequences for consideration. This process is repeated until a preset number of join sequences have been considered; then the best one found at any time during the search is used to generate the finished plan.

This process is inherently nondeterministic, because of the randomized choices made during both the initial population selection and subsequent “mutation” of the best candidates. To avoid surprising changes of the selected plan, each run of the GEQO algorithm restarts its random number generator with the current geqo_seed parameter setting. As long as geqo_seed and the other GEQO parameters are kept fixed, the same plan will be generated for a given query (and other planner inputs such as statistics). To experiment with different search paths, try changing geqo_seed.

59.3.2. Future Implementation Tasks for PostgreSQL GEQO

Work is still needed to improve the genetic algorithm parameter settings. In file src/backend/optimizer/geqo/geqo_main.c, routines gimme_pool_size and gimme_number_generations, we have to find a compromise for the parameter settings to satisfy two competing demands:

Optimality of the query plan
Computing time

In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner's join selection and cost estimation code from scratch. To the extent that different candidates use similar sub-sequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state.

At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.

59.4. Further Reading

The following resources contain additional information about genetic algorithms:

The Hitch-Hiker's Guide to Evolutionary Computation, (FAQ for news://comp.ai.genetic)
Evolutionary Computation and its application to art and design, by Craig Reynolds
[elma04]
[fong]

64. GiST Indexes

64.1. Introduction

GiST stands for Generalized Search Tree. It is a balanced, tree-structured access method, that acts as a base template in which to implement arbitrary indexing schemes. B-trees, R-trees and many other indexing schemes can be implemented in GiST.

One advantage of GiST is that it allows the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.

Some of the information here is derived from the University of California at Berkeley's GiST Indexing Project web site and Marcel Kornacker's thesis, Access Methods for Next-Generation Database Systems. The GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their web site.

64.2. Built-in Operator Classes

The core PostgreSQL distribution includes the GiST operator classes shown in Table 64.1. (Some of the optional modules described in Appendix F provide additional GiST operator classes.)

Table 64.1. Built-in GiST Operator Classes

For historical reasons, the inet_ops operator class is not the default class for types inet and cidr. To use it, mention the class name in CREATE INDEX, for example

CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);

64.3. Extensibility

Traditionally, implementing a new index access method meant a lot of difficult work. It was necessary to understand the inner workings of the database, such as the lock manager and Write-Ahead Log. The GiST interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GiST layer itself takes care of concurrency, logging and searching the tree structure.

This extensibility should not be confused with the extensibility of the other standard search trees in terms of the data they can handle. For example, PostgreSQL supports extensible B-trees and hash indexes. That means that you can use PostgreSQL to build a B-tree or hash over any data type you want. But B-trees only support range predicates (<, =, >), and hash indexes only support equality queries.

So if you index, say, an image collection with a PostgreSQL B-tree, you can only issue queries such as “is imagex equal to imagey”, “is imagex less than imagey” and “is imagex greater than imagey”. Depending on how you define “equals”, “less than” and “greater than” in this context, this could be useful. However, by using a GiST based index, you could create ways to ask domain-specific questions, perhaps “find all images of horses” or “find all over-exposed images”.

All it takes to get a GiST access method up and running is to implement several user-defined methods, which define the behavior of keys in the tree. Of course these methods have to be pretty fancy to support fancy queries, but for all the standard queries (B-trees, R-trees, etc.) they're relatively straightforward. In short, GiST combines extensibility along with generality, code reuse, and a clean interface.

There are five methods that an index operator class for GiST must provide, and four that are optional. Correctness of the index is ensured by proper implementation of the same, consistent and union methods, while efficiency (size and speed) of the index will depend on the penalty and picksplit methods. Two optional methods are compress and decompress, which allow an index to have internal tree data of a different type than the data it indexes. The leaves are to be of the indexed data type, while the other tree nodes can be of any C struct (but you still have to follow PostgreSQL data type rules here, see about varlena for variable sized data). If the tree's internal data type exists at the SQL level, the STORAGE option of the CREATE OPERATOR CLASS command can be used. The optional eighth method is distance, which is needed if the operator class wishes to support ordered scans (nearest-neighbor searches). The optional ninth method fetch is needed if the operator class wishes to support index-only scans, except when the compress method is omitted.

consistent

Given an index entry p and a query value q, this function determines whether the index entry is “consistent” with the query; that is, could the predicate “indexed_column indexable_operator q” be true for any row represented by the index entry? For a leaf index entry this is equivalent to testing the indexable condition, while for an internal tree node this determines whether it is necessary to scan the subtree of the index represented by the tree node. When the result is true, a recheck flag must also be returned. This indicates whether the predicate is certainly true or only possibly true. If recheck = false then the index has tested the predicate condition exactly, whereas if recheck = true the row is only a candidate match. In that case the system will automatically evaluate the indexable_operator against the actual row value to see if it is really a match. This convention allows GiST to support both lossless and lossy index structures.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_consistent(internal, data_type, smallint, oid, internal)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_consistent);

Datum
my_consistent(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    data_type  *query = PG_GETARG_DATA_TYPE_P(1);
    StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
    /* Oid subtype = PG_GETARG_OID(3); */
    bool       *recheck = (bool *) PG_GETARG_POINTER(4);
    data_type  *key = DatumGetDataType(entry->key);
    bool        retval;

    /*
     * determine return value as a function of strategy, key and query.
     *
     * Use GIST_LEAF(entry) to know where you're called in the index tree,
     * which comes handy when supporting the = operator for example (you could
     * check for non empty union() in non-leaf nodes and equality in leaf
     * nodes).
     */

    *recheck = true;        /* or false if check is exact */

    PG_RETURN_BOOL(retval);
}

Here, key is an element in the index and query the value being looked up in the index. The StrategyNumber parameter indicates which operator of your operator class is being applied — it matches one of the operator numbers in the CREATE OPERATOR CLASS command.

Depending on which operators you have included in the class, the data type of query could vary with the operator, since it will be whatever type is on the righthand side of the operator, which might be different from the indexed data type appearing on the lefthand side. (The above code skeleton assumes that only one type is possible; if not, fetching the query argument value would have to depend on the operator.) It is recommended that the SQL declaration of the consistent function use the opclass's indexed data type for the query argument, even though the actual type might be something else depending on the operator.

union

This method consolidates information in the tree. Given a set of entries, this function generates a new index entry that represents all the given entries.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_union(internal, internal)
RETURNS storage_type
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_union);

Datum
my_union(PG_FUNCTION_ARGS)
{
    GistEntryVector *entryvec = (GistEntryVector *) PG_GETARG_POINTER(0);
    GISTENTRY  *ent = entryvec->vector;
    data_type  *out,
               *tmp,
               *old;
    int         numranges,
                i = 0;

    numranges = entryvec->n;
    tmp = DatumGetDataType(ent[0].key);
    out = tmp;

    if (numranges == 1)
    {
        out = data_type_deep_copy(tmp);

        PG_RETURN_DATA_TYPE_P(out);
    }

    for (i = 1; i < numranges; i++)
    {
        old = out;
        tmp = DatumGetDataType(ent[i].key);
        out = my_union_implementation(out, tmp);
    }

    PG_RETURN_DATA_TYPE_P(out);
}

As you can see, in this skeleton we're dealing with a data type where union(X, Y, Z) = union(union(X, Y), Z). It's easy enough to support data types where this is not the case, by implementing the proper union algorithm in this GiST support method.

The result of the union function must be a value of the index's storage type, whatever that is (it might or might not be different from the indexed column's type). The union function should return a pointer to newly palloc()ed memory. You can't just return the input value as-is, even if there is no type change.

As shown above, the union function's first internal argument is actually a GistEntryVector pointer. The second argument is a pointer to an integer variable, which can be ignored. (It used to be required that the union function store the size of its result value into that variable, but this is no longer necessary.)

compress

Converts a data item into a format suitable for physical storage in an index page. If the compress method is omitted, data items are stored in the index without modification.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_compress(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_compress);

Datum
my_compress(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    GISTENTRY  *retval;

    if (entry->leafkey)
    {
        /* replace entry->key with a compressed version */
        compressed_data_type *compressed_data = palloc(sizeof(compressed_data_type));

        /* fill *compressed_data from entry->key ... */

        retval = palloc(sizeof(GISTENTRY));
        gistentryinit(*retval, PointerGetDatum(compressed_data),
                      entry->rel, entry->page, entry->offset, FALSE);
    }
    else
    {
        /* typically we needn't do anything with non-leaf entries */
        retval = entry;
    }

    PG_RETURN_POINTER(retval);
}

You have to adapt compressed_data_type to the specific type you're converting to in order to compress your leaf nodes, of course.

decompress

Converts the stored representation of a data item into a format that can be manipulated by the other GiST methods in the operator class. If the decompress method is omitted, it is assumed that the other GiST methods can work directly on the stored data format. (decompress is not necessarily the reverse of the compress method; in particular, if compress is lossy then it's impossible for decompress to exactly reconstruct the original data. decompress is not necessarily equivalent to fetch, either, since the other GiST methods might not require full reconstruction of the data.)

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_decompress(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_decompress);

Datum
my_decompress(PG_FUNCTION_ARGS)
{
    PG_RETURN_POINTER(PG_GETARG_POINTER(0));
}

The above skeleton is suitable for the case where no decompression is needed. (But, of course, omitting the method altogether is even easier, and is recommended in such cases.)

penalty

Returns a value indicating the “cost” of inserting the new entry into a particular branch of the tree. Items will be inserted down the path of least penalty in the tree. Values returned by penalty should be non-negative. If a negative value is returned, it will be treated as zero.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_penalty(internal, internal, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;  -- in some cases penalty functions need not be strict

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_penalty);

Datum
my_penalty(PG_FUNCTION_ARGS)
{
    GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
    GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
    float      *penalty = (float *) PG_GETARG_POINTER(2);
    data_type  *orig = DatumGetDataType(origentry->key);
    data_type  *new = DatumGetDataType(newentry->key);

    *penalty = my_penalty_implementation(orig, new);
    PG_RETURN_POINTER(penalty);
}

For historical reasons, the penalty function doesn't just return a float result; instead it has to store the value at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.

The penalty function is crucial to good performance of the index. It'll get used at insertion time to determine which branch to follow when choosing where to add the new entry in the tree. At query time, the more balanced the index, the quicker the lookup.

picksplit

When an index page split is necessary, this function decides which entries on the page are to stay on the old page, and which are to move to the new page.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_picksplit(internal, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_picksplit);

Datum
my_picksplit(PG_FUNCTION_ARGS)
{
    GistEntryVector *entryvec = (GistEntryVector *) PG_GETARG_POINTER(0);
    GIST_SPLITVEC *v = (GIST_SPLITVEC *) PG_GETARG_POINTER(1);
    OffsetNumber maxoff = entryvec->n - 1;
    GISTENTRY  *ent = entryvec->vector;
    int         i,
                nbytes;
    OffsetNumber *left,
               *right;
    data_type  *tmp_union;
    data_type  *unionL;
    data_type  *unionR;
    GISTENTRY **raw_entryvec;

    maxoff = entryvec->n - 1;
    nbytes = (maxoff + 1) * sizeof(OffsetNumber);

    v->spl_left = (OffsetNumber *) palloc(nbytes);
    left = v->spl_left;
    v->spl_nleft = 0;

    v->spl_right = (OffsetNumber *) palloc(nbytes);
    right = v->spl_right;
    v->spl_nright = 0;

    unionL = NULL;
    unionR = NULL;

    /* Initialize the raw entry vector. */
    raw_entryvec = (GISTENTRY **) malloc(entryvec->n * sizeof(void *));
    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
        raw_entryvec[i] = &(entryvec->vector[i]);

    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
    {
        int         real_index = raw_entryvec[i] - entryvec->vector;

        tmp_union = DatumGetDataType(entryvec->vector[real_index].key);
        Assert(tmp_union != NULL);

        /*
         * Choose where to put the index entries and update unionL and unionR
         * accordingly. Append the entries to either v->spl_left or
         * v->spl_right, and care about the counters.
         */

        if (my_choice_is_left(unionL, curl, unionR, curr))
        {
            if (unionL == NULL)
                unionL = tmp_union;
            else
                unionL = my_union_implementation(unionL, tmp_union);

            *left = real_index;
            ++left;
            ++(v->spl_nleft);
        }
        else
        {
            /*
             * Same on the right
             */
        }
    }

    v->spl_ldatum = DataTypeGetDatum(unionL);
    v->spl_rdatum = DataTypeGetDatum(unionR);
    PG_RETURN_POINTER(v);
}

Notice that the picksplit function's result is delivered by modifying the passed-in v structure. The return value per se is ignored, though it's conventional to pass back the address of v.

Like penalty, the picksplit function is crucial to good performance of the index. Designing suitable penalty and picksplit implementations is where the challenge of implementing well-performing GiST indexes lies.

same

Returns true if two index entries are identical, false otherwise. (An “index entry” is a value of the index's storage type, not necessarily the original indexed column's type.)

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_same(storage_type, storage_type, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_same);

Datum
my_same(PG_FUNCTION_ARGS)
{
    prefix_range *v1 = PG_GETARG_PREFIX_RANGE_P(0);
    prefix_range *v2 = PG_GETARG_PREFIX_RANGE_P(1);
    bool       *result = (bool *) PG_GETARG_POINTER(2);

    *result = my_eq(v1, v2);
    PG_RETURN_POINTER(result);
}

For historical reasons, the same function doesn't just return a Boolean result; instead it has to store the flag at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.

distance

Given an index entry p and a query value q, this function determines the index entry's “distance” from the query value. This function must be supplied if the operator class contains any ordering operators. A query using the ordering operator will be implemented by returning index entries with the smallest “distance” values first, so the results must be consistent with the operator's semantics. For a leaf index entry the result just represents the distance to the index entry; for an internal tree node, the result must be the smallest distance that any child entry could have.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_distance(internal, data_type, smallint, oid, internal)
RETURNS float8
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_distance);

Datum
my_distance(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    data_type  *query = PG_GETARG_DATA_TYPE_P(1);
    StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
    /* Oid subtype = PG_GETARG_OID(3); */
    /* bool *recheck = (bool *) PG_GETARG_POINTER(4); */
    data_type  *key = DatumGetDataType(entry->key);
    double      retval;

    /*
     * determine return value as a function of strategy, key and query.
     */

    PG_RETURN_FLOAT8(retval);
}

The arguments to the distance function are identical to the arguments of the consistent function.

Some approximation is allowed when determining the distance, so long as the result is never greater than the entry's actual distance. Thus, for example, distance to a bounding box is usually sufficient in geometric applications. For an internal tree node, the distance returned must not be greater than the distance to any of the child nodes. If the returned distance is not exact, the function must set *recheck to true. (This is not necessary for internal tree nodes; for them, the calculation is always assumed to be inexact.) In this case the executor will calculate the accurate distance after fetching the tuple from the heap, and reorder the tuples if necessary.

If the distance function returns *recheck = true for any leaf node, the original ordering operator's return type must be float8 or float4, and the distance function's result values must be comparable to those of the original ordering operator, since the executor will sort using both distance function results and recalculated ordering-operator results. Otherwise, the distance function's result values can be any finite float8 values, so long as the relative order of the result values matches the order returned by the ordering operator. (Infinity and minus infinity are used internally to handle cases such as nulls, so it is not recommended that distance functions return these values.)

fetch

Converts the compressed index representation of a data item into the original data type, for index-only scans. The returned data must be an exact, non-lossy copy of the originally indexed value.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_fetch(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

The argument is a pointer to a GISTENTRY struct. On entry, its key field contains a non-NULL leaf datum in compressed form. The return value is another GISTENTRY struct, whose key field contains the same datum in its original, uncompressed form. If the opclass's compress function does nothing for leaf entries, the fetch method can return the argument as-is. Or, if the opclass does not have a compress function, the fetch method can be omitted as well, since it would necessarily be a no-op.

The matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_fetch);

Datum
my_fetch(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    input_data_type *in = DatumGetPointer(entry->key);
    fetched_data_type *fetched_data;
    GISTENTRY  *retval;

    retval = palloc(sizeof(GISTENTRY));
    fetched_data = palloc(sizeof(fetched_data_type));

    /*
     * Convert 'fetched_data' into the a Datum of the original datatype.
     */

    /* fill *retval from fetched_data. */
    gistentryinit(*retval, PointerGetDatum(converted_datum),
                  entry->rel, entry->page, entry->offset, FALSE);

    PG_RETURN_POINTER(retval);
}

If the compress method is lossy for leaf entries, the operator class cannot support index-only scans, and must not define a fetch function.

All the GiST support methods are normally called in short-lived memory contexts; that is, CurrentMemoryContext will get reset after each tuple is processed. It is therefore not very important to worry about pfree'ing everything you palloc. However, in some cases it's useful for a support method to cache data across repeated calls. To do that, allocate the longer-lived data in fcinfo->flinfo->fn_mcxt, and keep a pointer to it in fcinfo->flinfo->fn_extra. Such data will survive for the life of the index operation (e.g., a single GiST index scan, index build, or index tuple insertion). Be careful to pfree the previous value when replacing a fn_extra value, or the leak will accumulate for the duration of the operation.

64.4. Implementation

64.4.1. GiST Buffering Build

Building large GiST indexes by simply inserting all the tuples tends to be slow, because if the index tuples are scattered across the index and the index is large enough to not fit in cache, the insertions need to perform a lot of random I/O. Beginning in version 9.2, PostgreSQL supports a more efficient method to build GiST indexes based on buffering, which can dramatically reduce the number of random I/Os needed for non-ordered data sets. For well-ordered data sets the benefit is smaller or non-existent, because only a small number of pages receive new tuples at a time, and those pages fit in cache even if the index as whole does not.

However, buffering index build needs to call the penalty function more often, which consumes some extra CPU resources. Also, the buffers used in the buffering build need temporary disk space, up to the size of the resulting index. Buffering can also influence the quality of the resulting index, in both positive and negative directions. That influence depends on various factors, like the distribution of the input data and the operator class implementation.

By default, a GiST index build switches to the buffering method when the index size reaches effective_cache_size. It can be manually turned on or off by the buffering parameter to the CREATE INDEX command. The default behavior is good for most cases, but turning buffering off might speed up the build somewhat if the input data is ordered.

64.5. Examples

The PostgreSQL source distribution includes several examples of index methods implemented using GiST. The core system currently provides text search support (indexing for tsvector and tsquery) as well as R-Tree equivalent functionality for some of the built-in geometric data types (see src/backend/access/gist/gistproc.c). The following contrib modules also contain GiST operator classes:

btree_gist

B-tree equivalent functionality for several data types

cube

Indexing for multidimensional cubes

hstore

Module for storing (key, value) pairs

intarray

RD-Tree for one-dimensional array of int4 values

ltree

Indexing for tree-like structures

pg_trgm

Text similarity using trigram matching

seg

Indexing for “float ranges”

65. SP-GiST Indexes

65.1. Introduction

SP-GiST is an abbreviation for space-partitioned GiST. SP-GiST supports partitioned search trees, which facilitate development of a wide range of different non-balanced data structures, such as quad-trees, k-d trees, and radix trees (tries). The common feature of these structures is that they repeatedly divide the search space into partitions that need not be of equal size. Searches that are well matched to the partitioning rule can be very fast.

These popular data structures were originally developed for in-memory usage. In main memory, they are usually designed as a set of dynamically allocated nodes linked by pointers. This is not suitable for direct storing on disk, since these chains of pointers can be rather long which would require too many disk accesses. In contrast, disk-based data structures should have a high fanout to minimize I/O. The challenge addressed by SP-GiST is to map search tree nodes to disk pages in such a way that a search need access only a few disk pages, even if it traverses many nodes.

Like GiST, SP-GiST is meant to allow the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.

Some of the information here is derived from Purdue University's SP-GiST Indexing Project web site. The SP-GiST implementation in PostgreSQL is primarily maintained by Teodor Sigaev and Oleg Bartunov, and there is more information on their web site.

65.2. Built-in Operator Classes

The core PostgreSQL distribution includes the SP-GiST operator classes shown in Table 63.1.

Table 63.1. Built-in SP-GiST Operator Classes

Of the two operator classes for type point, quad_point_ops is the default. kd_point_ops supports the same operators but uses a different index data structure which may offer better performance in some applications.

65.3. Extensibility

SP-GiST offers an interface with a high level of abstraction, requiring the access method developer to implement only methods specific to a given data type. The SP-GiST core is responsible for efficient disk mapping and searching the tree structure. It also takes care of concurrency and logging considerations.

Leaf tuples of an SP-GiST tree contain values of the same data type as the indexed column. Leaf tuples at the root level will always contain the original indexed data value, but leaf tuples at lower levels might contain only a compressed representation, such as a suffix. In that case the operator class support functions must be able to reconstruct the original value using information accumulated from the inner tuples that are passed through to reach the leaf level.

Inner tuples are more complex, since they are branching points in the search tree. Each inner tuple contains a set of one or more nodes, which represent groups of similar leaf values. A node contains a downlink that leads either to another, lower-level inner tuple, or to a short list of leaf tuples that all lie on the same index page. Each node normally has a label that describes it; for example, in a radix tree the node label could be the next character of the string value. (Alternatively, an operator class can omit the node labels, if it works with a fixed set of nodes for all inner tuples; see Section 63.4.2.) Optionally, an inner tuple can have a prefix value that describes all its members. In a radix tree this could be the common prefix of the represented strings. The prefix value is not necessarily really a prefix, but can be any data needed by the operator class; for example, in a quad-tree it can store the central point that the four quadrants are measured with respect to. A quad-tree inner tuple would then also contain four nodes corresponding to the quadrants around this central point.

Some tree algorithms require knowledge of level (or depth) of the current tuple, so the SP-GiST core provides the possibility for operator classes to manage level counting while descending the tree. There is also support for incrementally reconstructing the represented value when that is needed, and for passing down additional data (called traverse values) during a tree descent.

Note

The SP-GiST core code takes care of null entries. Although SP-GiST indexes do store entries for nulls in indexed columns, this is hidden from the index operator class code: no null index entries or search conditions will ever be passed to the operator class methods. (It is assumed that SP-GiST operators are strict and so cannot succeed for null values.) Null values are therefore not discussed further here.

There are five user-defined methods that an index operator class for SP-GiST must provide. All five follow the convention of accepting two internal arguments, the first of which is a pointer to a C struct containing input values for the support method, while the second argument is a pointer to a C struct where output values must be placed. Four of the methods just return void, since all their results appear in the output struct; but leaf_consistent additionally returns a boolean result. The methods must not modify any fields of their input structs. In all cases, the output struct is initialized to zeroes before calling the user-defined method.

The five user-defined methods are:config

Returns static information about the index implementation, including the data type OIDs of the prefix and node label data types.

The SQL declaration of the function must look like this:

CREATE FUNCTION my_config(internal, internal) RETURNS void ...

The first argument is a pointer to a spgConfigIn C struct, containing input data for the function. The second argument is a pointer to a spgConfigOut C struct, which the function must fill with result data.

typedef struct spgConfigIn
{
    Oid         attType;        /* Data type to be indexed */
} spgConfigIn;

typedef struct spgConfigOut
{
    Oid         prefixType;     /* Data type of inner-tuple prefixes */
    Oid         labelType;      /* Data type of inner-tuple node labels */
    bool        canReturnData;  /* Opclass can reconstruct original data */
    bool        longValuesOK;   /* Opclass can cope with values > 1 page */
} spgConfigOut;

attType is passed in order to support polymorphic index operator classes; for ordinary fixed-data-type operator classes, it will always have the same value and so can be ignored.

For operator classes that do not use prefixes, prefixType can be set to VOIDOID. Likewise, for operator classes that do not use node labels, labelType can be set to VOIDOID. canReturnData should be set true if the operator class is capable of reconstructing the originally-supplied index value. longValuesOK should be set true only when the attType is of variable length and the operator class is capable of segmenting long values by repeated suffixing (see Section 63.4.1).choose

Chooses a method for inserting a new value into an inner tuple.

The SQL declaration of the function must look like this:

CREATE FUNCTION my_choose(internal, internal) RETURNS void ...

The first argument is a pointer to a spgChooseIn C struct, containing input data for the function. The second argument is a pointer to a spgChooseOut C struct, which the function must fill with result data.

typedef struct spgChooseIn
{
    Datum       datum;          /* original datum to be indexed */
    Datum       leafDatum;      /* current datum to be stored at leaf */
    int         level;          /* current level (counting from zero) */

    /* Data from current inner tuple */
    bool        allTheSame;     /* tuple is marked all-the-same? */
    bool        hasPrefix;      /* tuple has a prefix? */
    Datum       prefixDatum;    /* if so, the prefix value */
    int         nNodes;         /* number of nodes in the inner tuple */
    Datum      *nodeLabels;     /* node label values (NULL if none) */
} spgChooseIn;

typedef enum spgChooseResultType
{
    spgMatchNode = 1,           /* descend into existing node */
    spgAddNode,                 /* add a node to the inner tuple */
    spgSplitTuple               /* split inner tuple (change its prefix) */
} spgChooseResultType;

typedef struct spgChooseOut
{
    spgChooseResultType resultType;     /* action code, see above */
    union
    {
        struct                  /* results for spgMatchNode */
        {
            int         nodeN;      /* descend to this node (index from 0) */
            int         levelAdd;   /* increment level by this much */
            Datum       restDatum;  /* new leaf datum */
        }           matchNode;
        struct                  /* results for spgAddNode */
        {
            Datum       nodeLabel;  /* new node's label */
            int         nodeN;      /* where to insert it (index from 0) */
        }           addNode;
        struct                  /* results for spgSplitTuple */
        {
            /* Info to form new upper-level inner tuple with one child tuple */
            bool        prefixHasPrefix;    /* tuple should have a prefix? */
            Datum       prefixPrefixDatum;  /* if so, its value */
            int         prefixNNodes;       /* number of nodes */
            Datum      *prefixNodeLabels;   /* their labels (or NULL for
                                             * no labels) */
            int         childNodeN;         /* which node gets child tuple */

            /* Info to form new lower-level inner tuple with all old nodes */
            bool        postfixHasPrefix;   /* tuple should have a prefix? */
            Datum       postfixPrefixDatum; /* if so, its value */
        }           splitTuple;
    }           result;
} spgChooseOut;

datum is the original datum that was to be inserted into the index. leafDatum is initially the same as datum, but can change at lower levels of the tree if the choose or picksplitmethods change it. When the insertion search reaches a leaf page, the current value of leafDatum is what will be stored in the newly created leaf tuple. level is the current inner tuple's level, starting at zero for the root level. allTheSame is true if the current inner tuple is marked as containing multiple equivalent nodes (see Section 63.4.3). hasPrefix is true if the current inner tuple contains a prefix; if so, prefixDatum is its value. nNodes is the number of child nodes contained in the inner tuple, and nodeLabels is an array of their label values, or NULL if there are no labels.

The choose function can determine either that the new value matches one of the existing child nodes, or that a new child node must be added, or that the new value is inconsistent with the tuple prefix and so the inner tuple must be split to create a less restrictive prefix.

If the new value matches one of the existing child nodes, set resultType to spgMatchNode. Set nodeN to the index (from zero) of that node in the node array. Set levelAdd to the increment in level caused by descending through that node, or leave it as zero if the operator class does not use levels. Set restDatum to equal datum if the operator class does not modify datums from one level to the next, or otherwise set it to the modified value to be used as leafDatum at the next level.

If a new child node must be added, set resultType to spgAddNode. Set nodeLabel to the label to be used for the new node, and set nodeN to the index (from zero) at which to insert the node in the node array. After the node has been added, the choose function will be called again with the modified inner tuple; that call should result in an spgMatchNode result.

If the new value is inconsistent with the tuple prefix, set resultType to spgSplitTuple. This action moves all the existing nodes into a new lower-level inner tuple, and replaces the existing inner tuple with a tuple having a single downlink pointing to the new lower-level inner tuple. Set prefixHasPrefix to indicate whether the new upper tuple should have a prefix, and if so set prefixPrefixDatum to the prefix value. This new prefix value must be sufficiently less restrictive than the original to accept the new value to be indexed. Set prefixNNodes to the number of nodes needed in the new tuple, and set prefixNodeLabels to a palloc'd array holding their labels, or to NULL if node labels are not required. Note that the total size of the new upper tuple must be no more than the total size of the tuple it is replacing; this constrains the lengths of the new prefix and new labels. Set childNodeN to the index (from zero) of the node that will downlink to the new lower-level inner tuple. Set postfixHasPrefix to indicate whether the new lower-level inner tuple should have a prefix, and if so set postfixPrefixDatum to the prefix value. The combination of these two prefixes and the downlink node's label (if any) must have the same meaning as the original prefix, because there is no opportunity to alter the node labels that are moved to the new lower-level tuple, nor to change any child index entries. After the node has been split, the choose function will be called again with the replacement inner tuple. That call may return an spgAddNode result, if no suitable node was created by the spgSplitTuple action. Eventually choose must return spgMatchNode to allow the insertion to descend to the next level.picksplit

Decides how to create a new inner tuple over a set of leaf tuples.

The SQL declaration of the function must look like this:

CREATE FUNCTION my_picksplit(internal, internal) RETURNS void ...

The first argument is a pointer to a spgPickSplitIn C struct, containing input data for the function. The second argument is a pointer to a spgPickSplitOut C struct, which the function must fill with result data.

typedef struct spgPickSplitIn
{
    int         nTuples;        /* number of leaf tuples */
    Datum      *datums;         /* their datums (array of length nTuples) */
    int         level;          /* current level (counting from zero) */
} spgPickSplitIn;

typedef struct spgPickSplitOut
{
    bool        hasPrefix;      /* new inner tuple should have a prefix? */
    Datum       prefixDatum;    /* if so, its value */

    int         nNodes;         /* number of nodes for new inner tuple */
    Datum      *nodeLabels;     /* their labels (or NULL for no labels) */

    int        *mapTuplesToNodes;   /* node index for each leaf tuple */
    Datum      *leafTupleDatums;    /* datum to store in each new leaf tuple */
} spgPickSplitOut;

nTuples is the number of leaf tuples provided. datums is an array of their datum values. level is the current level that all the leaf tuples share, which will become the level of the new inner tuple.

Set hasPrefix to indicate whether the new inner tuple should have a prefix, and if so set prefixDatum to the prefix value. Set nNodes to indicate the number of nodes that the new inner tuple will contain, and set nodeLabels to an array of their label values, or to NULL if node labels are not required. Set mapTuplesToNodes to an array that gives the index (from zero) of the node that each leaf tuple should be assigned to. Set leafTupleDatums to an array of the values to be stored in the new leaf tuples (these will be the same as the input datums if the operator class does not modify datums from one level to the next). Note that the picksplit function is responsible for palloc'ing the nodeLabels,mapTuplesToNodes and leafTupleDatums arrays.

If more than one leaf tuple is supplied, it is expected that the picksplit function will classify them into more than one node; otherwise it is not possible to split the leaf tuples across multiple pages, which is the ultimate purpose of this operation. Therefore, if the picksplit function ends up placing all the leaf tuples in the same node, the core SP-GiST code will override that decision and generate an inner tuple in which the leaf tuples are assigned at random to several identically-labeled nodes. Such a tuple is marked allTheSame to signify that this has happened. The choose and inner_consistent functions must take suitable care with such inner tuples. See Section 63.4.3 for more information.

picksplit can be applied to a single leaf tuple only in the case that the config function set longValuesOK to true and a larger-than-a-page input value has been supplied. In this case the point of the operation is to strip off a prefix and produce a new, shorter leaf datum value. The call will be repeated until a leaf datum short enough to fit on a page has been produced. See Section 63.4.1 for more information.inner_consistent

Returns set of nodes (branches) to follow during tree search.

The SQL declaration of the function must look like this:

CREATE FUNCTION my_inner_consistent(internal, internal) RETURNS void ...

The first argument is a pointer to a spgInnerConsistentIn C struct, containing input data for the function. The second argument is a pointer to a spgInnerConsistentOut C struct, which the function must fill with result data.

typedef struct spgInnerConsistentIn
{
    ScanKey     scankeys;       /* array of operators and comparison values */
    int         nkeys;          /* length of array */

    Datum       reconstructedValue;     /* value reconstructed at parent */
    void       *traversalValue; /* opclass-specific traverse value */
    MemoryContext traversalMemoryContext;   /* put new traverse values here */
    int         level;          /* current level (counting from zero) */
    bool        returnData;     /* original data must be returned? */

    /* Data from current inner tuple */
    bool        allTheSame;     /* tuple is marked all-the-same? */
    bool        hasPrefix;      /* tuple has a prefix? */
    Datum       prefixDatum;    /* if so, the prefix value */
    int         nNodes;         /* number of nodes in the inner tuple */
    Datum      *nodeLabels;     /* node label values (NULL if none) */
} spgInnerConsistentIn;

typedef struct spgInnerConsistentOut
{
    int         nNodes;         /* number of child nodes to be visited */
    int        *nodeNumbers;    /* their indexes in the node array */
    int        *levelAdds;      /* increment level by this much for each */
    Datum      *reconstructedValues;    /* associated reconstructed values */
    void      **traversalValues;        /* opclass-specific traverse values */
} spgInnerConsistentOut;

The array scankeys, of length nkeys, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them are interesting. (Note that nkeys = 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy and sk_argument fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue is the value reconstructed for the parent tuple; it is (Datum) 0 at the root level or if the inner_consistent function did not provide a value at the parent level. traversalValue is a pointer to any traverse data passed down from the previous call of inner_consistent on the parent index tuple, or NULL at the root level. traversalMemoryContext is the memory context in which to store output traverse values (see below). level is the current inner tuple's level, starting at zero for the root level. returnData is true if reconstructed data is required for this query; this will only be so if the config function asserted canReturnData. allTheSame is true if the current inner tuple is marked “all-the-same”; in this case all the nodes have the same label (if any) and so either all or none of them match the query (see Section 63.4.3).hasPrefix is true if the current inner tuple contains a prefix; if so, prefixDatum is its value. nNodes is the number of child nodes contained in the inner tuple, and nodeLabels is an array of their label values, or NULL if the nodes do not have labels.

nNodes must be set to the number of child nodes that need to be visited by the search, and nodeNumbers must be set to an array of their indexes. If the operator class keeps track of levels, set levelAdds to an array of the level increments required when descending to each node to be visited. (Often these increments will be the same for all the nodes, but that's not necessarily so, so an array is used.) If value reconstruction is needed, set reconstructedValues to an array of the values reconstructed for each child node to be visited; otherwise, leave reconstructedValues as NULL. If it is desired to pass down additional out-of-band information (“traverse values”) to lower levels of the tree search, settraversalValues to an array of the appropriate traverse values, one for each child node to be visited; otherwise, leave traversalValues as NULL. Note that the inner_consistentfunction is responsible for palloc'ing the nodeNumbers, levelAdds, reconstructedValues, and traversalValues arrays in the current memory context. However, any output traverse values pointed to by the traversalValues array should be allocated in traversalMemoryContext. Each traverse value must be a single palloc'd chunk.leaf_consistent

Returns true if a leaf tuple satisfies a query.

The SQL declaration of the function must look like this:

CREATE FUNCTION my_leaf_consistent(internal, internal) RETURNS bool ...

The first argument is a pointer to a spgLeafConsistentIn C struct, containing input data for the function. The second argument is a pointer to a spgLeafConsistentOut C struct, which the function must fill with result data.

typedef struct spgLeafConsistentIn
{
    ScanKey     scankeys;       /* array of operators and comparison values */
    int         nkeys;          /* length of array */

    Datum       reconstructedValue;     /* value reconstructed at parent */
    void       *traversalValue; /* opclass-specific traverse value */
    int         level;          /* current level (counting from zero) */
    bool        returnData;     /* original data must be returned? */

    Datum       leafDatum;      /* datum in leaf tuple */
} spgLeafConsistentIn;

typedef struct spgLeafConsistentOut
{
    Datum       leafValue;      /* reconstructed original data, if any */
    bool        recheck;        /* set true if operator must be rechecked */
} spgLeafConsistentOut;

The array scankeys, of length nkeys, describes the index search condition(s). These conditions are combined with AND — only index entries that satisfy all of them satisfy the query. (Note that nkeys = 0 implies that all index entries satisfy the query.) Usually the consistent function only cares about the sk_strategy and sk_argument fields of each array entry, which respectively give the indexable operator and comparison value. In particular it is not necessary to check sk_flags to see if the comparison value is NULL, because the SP-GiST core code will filter out such conditions. reconstructedValue is the value reconstructed for the parent tuple; it is (Datum) 0 at the root level or if the inner_consistentfunction did not provide a value at the parent level. traversalValue is a pointer to any traverse data passed down from the previous call of inner_consistent on the parent index tuple, or NULL at the root level. level is the current leaf tuple's level, starting at zero for the root level. returnData is true if reconstructed data is required for this query; this will only be so if the config function asserted canReturnData. leafDatum is the key value stored in the current leaf tuple.

The function must return true if the leaf tuple matches the query, or false if not. In the true case, if returnData is true then leafValue must be set to the value originally supplied to be indexed for this leaf tuple. Also, recheck may be set to true if the match is uncertain and so the operator(s) must be re-applied to the actual heap tuple to verify the match.

All the SP-GiST support methods are normally called in a short-lived memory context; that is, CurrentMemoryContext will be reset after processing of each tuple. It is therefore not very important to worry about pfree'ing everything you palloc. (The config method is an exception: it should try to avoid leaking memory. But usually the config method need do nothing but assign constants into the passed parameter struct.)

If the indexed column is of a collatable data type, the index collation will be passed to all the support methods, using the standard PG_GET_COLLATION() mechanism.

65.4. Implementation

This section covers implementation details and other tricks that are useful for implementers of SP-GiST operator classes to know.

63.4.1. SP-GiST Limits

Individual leaf tuples and inner tuples must fit on a single index page (8kB by default). Therefore, when indexing values of variable-length data types, long values can only be supported by methods such as radix trees, in which each level of the tree includes a prefix that is short enough to fit on a page, and the final leaf level includes a suffix also short enough to fit on a page. The operator class should set longValuesOK to TRUE only if it is prepared to arrange for this to happen. Otherwise, the SP-GiST core will reject any request to index a value that is too large to fit on an index page.

Likewise, it is the operator class's responsibility that inner tuples do not grow too large to fit on an index page; this limits the number of child nodes that can be used in one inner tuple, as well as the maximum size of a prefix value.

Another limitation is that when an inner tuple's node points to a set of leaf tuples, those tuples must all be in the same index page. (This is a design decision to reduce seeking and save space in the links that chain such tuples together.) If the set of leaf tuples grows too large for a page, a split is performed and an intermediate inner tuple is inserted. For this to fix the problem, the new inner tuple must divide the set of leaf values into more than one node group. If the operator class's picksplit function fails to do that, the SP-GiST core resorts to extraordinary measures described in Section 63.4.3.

63.4.2. SP-GiST Without Node Labels

Some tree algorithms use a fixed set of nodes for each inner tuple; for example, in a quad-tree there are always exactly four nodes corresponding to the four quadrants around the inner tuple's centroid point. In such a case the code typically works with the nodes by number, and there is no need for explicit node labels. To suppress node labels (and thereby save some space), the picksplit function can return NULL for the nodeLabels array, and likewise the choose function can return NULL for the prefixNodeLabels array during a spgSplitTupleaction. This will in turn result in nodeLabels being NULL during subsequent calls to choose and inner_consistent. In principle, node labels could be used for some inner tuples and omitted for others in the same index.

When working with an inner tuple having unlabeled nodes, it is an error for choose to return spgAddNode, since the set of nodes is supposed to be fixed in such cases.

63.4.3. “All-the-same” Inner Tuples

The SP-GiST core can override the results of the operator class's picksplit function when picksplit fails to divide the supplied leaf values into at least two node categories. When this happens, the new inner tuple is created with multiple nodes that each have the same label (if any) that picksplit gave to the one node it did use, and the leaf values are divided at random among these equivalent nodes. The allTheSame flag is set on the inner tuple to warn the choose and inner_consistent functions that the tuple does not have the node set that they might otherwise expect.

When dealing with an allTheSame tuple, a choose result of spgMatchNode is interpreted to mean that the new value can be assigned to any of the equivalent nodes; the core code will ignore the supplied nodeN value and descend into one of the nodes at random (so as to keep the tree balanced). It is an error for choose to return spgAddNode, since that would make the nodes not all equivalent; the spgSplitTuple action must be used if the value to be inserted doesn't match the existing nodes.

When dealing with an allTheSame tuple, the inner_consistent function should return either all or none of the nodes as targets for continuing the index search, since they are all equivalent. This may or may not require any special-case code, depending on how much the inner_consistent function normally assumes about the meaning of the nodes.

65.5. Examples

The PostgreSQL source distribution includes several examples of index operator classes for SP-GiST, as described in Table 63.1. Look into src/backend/access/spgist/ and src/backend/utils/adt/ to see the code.

66. GIN 索引

66.1. 簡介

GIN 代表 Generalized Inverted Index。GIN 設計用於處理要被索引的項目是複合值的情況，並且由索引處理的查詢需要搜索出現在複合項目內的元素值。例如，這些項目可能是文件，查詢可能是搜索包含特定單詞的文件。

我們使用單詞 item 來引用要編入索引的複合值，並使用單詞索引鍵（word key）來引用元素值。GIN 總是儲存和搜索索引鍵，而不是其值。

GIN 索引儲存一組（key, posting list）對，其中 posting list 是 key 對應的一組資料列 ID。同一資料列 ID 可以出現在多個 posting list 當中，因為一個項目可以包含多個關鍵字。每個索引鍵值只儲存一次，因此對於相同鍵出現多次的情況，GIN 索引非常會緊湊。

GIN 是泛用的，因為 GIN 存取方法的語法不需要知道它加速的具體操作。相反地，它使用為特定資料型別定義的自訂策略。該策略定義瞭瞭如何從索引項目和查詢條件中提取關鍵字，以及如何確定包含查詢中某些關鍵值的資料列能夠實際滿足查詢。

GIN 的一個優點是，它允許由資料型別領域的專家而不是資料庫專家使用適當的存取方法開發自訂的資料型別。這與使用 GiST 的優點相同。

PostgreSQL 中的 GIN 實現偏主要由 Teodor Sigaev 和 Oleg Bartunov 維護。在他們的網站上有更多關於 GIN 的訊息。

66.2. 內建運算子類

主要的 PostgreSQL 版本包括 Table 64.1 中所示的 GIN 運算子類。（附錄 F 中描述的一些選用套件提供了額外的 GIN 運算子類。）

Table 66.1. Built-in GIN Operator Classes

在 jsonb 型別的兩個運算子類中，jsonb_ops 是預設值。jsonb_path_ops 支援較少的運算子，但為這些運算子提供了更好的效能。有關詳細訊息，請參閱第 8.14.4 節。

66.3. 延伸介面

The GIN interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GIN layer itself takes care of concurrency, logging and searching the tree structure.

All it takes to get a GIN access method working is to implement a few user-defined methods, which define the behavior of keys in the tree and the relationships between keys, indexed items, and indexable queries. In short, GIN combines extensibility with generality, code reuse, and a clean interface.

There are two methods that an operator class for GIN must provide:

Datum *extractValue(Datum itemValue, int32 *nkeys, bool **nullFlags)

Returns a palloc'd array of keys given an item to be indexed. The number of returned keys must be stored into *nkeys. If any of the keys can be null, also palloc an array of *nkeys bool fields, store its address at *nullFlags, and set these null flags as needed. *nullFlags can be left NULL (its initial value) if all keys are non-null. The return value can be NULL if the item contains no keys.

Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n, bool **pmatch, Pointer **extra_data, bool **nullFlags, int32 *searchMode)

Returns a palloc'd array of keys given a value to be queried; that is, query is the value on the right-hand side of an indexable operator whose left-hand side is the indexed column. n is the strategy number of the operator within the operator class (see Section 37.16.2). Often, extractQuery will need to consult n to determine the data type of query and the method it should use to extract key values. The number of returned keys must be stored into *nkeys. If any of the keys can be null, also palloc an array of *nkeys bool fields, store its address at *nullFlags, and set these null flags as needed. *nullFlags can be left NULL (its initial value) if all keys are non-null. The return value can be NULL if the query contains no keys.

searchMode is an output argument that allows extractQuery to specify details about how the search will be done. If *searchMode is set to GIN_SEARCH_MODE_DEFAULT (which is the value it is initialized to before call), only items that match at least one of the returned keys are considered candidate matches. If *searchMode is set to GIN_SEARCH_MODE_INCLUDE_EMPTY, then in addition to items containing at least one matching key, items that contain no keys at all are considered candidate matches. (This mode is useful for implementing is-subset-of operators, for example.) If *searchMode is set to GIN_SEARCH_MODE_ALL, then all non-null items in the index are considered candidate matches, whether they match any of the returned keys or not. (This mode is much slower than the other two choices, since it requires scanning essentially the entire index, but it may be necessary to implement corner cases correctly. An operator that needs this mode in most cases is probably not a good candidate for a GIN operator class.) The symbols to use for setting this mode are defined in access/gin.h.

pmatch is an output argument for use when partial match is supported. To use it, extractQuery must allocate an array of *nkeys bools and store its address at *pmatch. Each element of the array should be set to true if the corresponding key requires partial match, false if not. If *pmatch is set to NULL then GIN assumes partial match is not required. The variable is initialized to NULL before call, so this argument can simply be ignored by operator classes that do not support partial match.

extra_data is an output argument that allows extractQuery to pass additional data to the consistent and comparePartial methods. To use it, extractQuery must allocate an array of *nkeys pointers and store its address at *extra_data, then store whatever it wants to into the individual pointers. The variable is initialized to NULL before call, so this argument can simply be ignored by operator classes that do not require extra data. If *extra_data is set, the whole array is passed to the consistent method, and the appropriate element to the comparePartial method.

An operator class must also provide a function to check if an indexed item matches the query. It comes in two flavors, a Boolean consistent function, and a ternary triConsistent function. triConsistent covers the functionality of both, so providing triConsistent alone is sufficient. However, if the Boolean variant is significantly cheaper to calculate, it can be advantageous to provide both. If only the Boolean variant is provided, some optimizations that depend on refuting index items before fetching all the keys are disabled.

bool consistent(bool check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool nullFlags[])

Returns true if an indexed item satisfies the query operator with strategy number n (or might satisfy it, if the recheck indication is returned). This function does not have direct access to the indexed item's value, since GIN does not store items explicitly. Rather, what is available is knowledge about which key values extracted from the query appear in a given indexed item. The check array has length nkeys, which is the same as the number of keys previously returned by extractQuery for this query datum. Each element of the check array is true if the indexed item contains the corresponding query key, i.e., if (check[i] == true) the i-th key of the extractQuery result array is present in the indexed item. The original query datum is passed in case the consistent method needs to consult it, and so are the queryKeys[] and nullFlags[] arrays previously returned by extractQuery. extra_data is the extra-data array returned by extractQuery, or NULL if none.

When extractQuery returns a null key in queryKeys[], the corresponding check[] element is true if the indexed item contains a null key; that is, the semantics of check[] are like IS NOT DISTINCT FROM. The consistent function can examine the corresponding nullFlags[] element if it needs to tell the difference between a regular value match and a null match.

On success, *recheck should be set to true if the heap tuple needs to be rechecked against the query operator, or false if the index test is exact. That is, a false return value guarantees that the heap tuple does not match the query; a true return value with *recheck set to false guarantees that the heap tuple does match the query; and a true return value with *recheck set to true means that the heap tuple might match the query, so it needs to be fetched and rechecked by evaluating the query operator directly against the originally indexed item.

GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query, int32 nkeys, Pointer extra_data[], Datum queryKeys[], bool nullFlags[])

triConsistent is similar to consistent, but instead of Booleans in the check vector, there are three possible values for each key: GIN_TRUE, GIN_FALSE and GIN_MAYBE. GIN_FALSE and GIN_TRUE have the same meaning as regular Boolean values, while GIN_MAYBE means that the presence of that key is not known. When GIN_MAYBE values are present, the function should only return GIN_TRUE if the item certainly matches whether or not the index item contains the corresponding query keys. Likewise, the function must return GIN_FALSE only if the item certainly does not match, whether or not it contains the GIN_MAYBE keys. If the result depends on the GIN_MAYBE entries, i.e., the match cannot be confirmed or refuted based on the known query keys, the function must return GIN_MAYBE.

When there are no GIN_MAYBE values in the check vector, a GIN_MAYBE return value is the equivalent of setting the recheck flag in the Boolean consistent function.

In addition, GIN must have a way to sort the key values stored in the index. The operator class can define the sort ordering by specifying a comparison method:

int compare(Datum a, Datum b)

Compares two keys (not indexed items!) and returns an integer less than zero, zero, or greater than zero, indicating whether the first key is less than, equal to, or greater than the second. Null keys are never passed to this function.

Alternatively, if the operator class does not provide a compare method, GIN will look up the default btree operator class for the index key data type, and use its comparison function. It is recommended to specify the comparison function in a GIN operator class that is meant for just one data type, as looking up the btree operator class costs a few cycles. However, polymorphic GIN operator classes (such as array_ops) typically cannot specify a single comparison function.

Optionally, an operator class for GIN can supply the following method:

int comparePartial(Datum partial_key, Datum key, StrategyNumber n, Pointer extra_data)

Compare a partial-match query key to an index key. Returns an integer whose sign indicates the result: less than zero means the index key does not match the query, but the index scan should continue; zero means that the index key does match the query; greater than zero indicates that the index scan should stop because no more matches are possible. The strategy number n of the operator that generated the partial match query is provided, in case its semantics are needed to determine when to end the scan. Also, extra_data is the corresponding element of the extra-data array made by extractQuery, or NULL if none. Null keys are never passed to this function.

To support “partial match” queries, an operator class must provide the comparePartial method, and its extractQuery method must set the pmatch parameter when a partial-match query is encountered. See Section 66.4.2 for details.

The actual data types of the various Datum values mentioned above vary depending on the operator class. The item values passed to extractValue are always of the operator class's input type, and all key values must be of the class's STORAGE type. The type of the query argument passed to extractQuery, consistent and triConsistent is whatever is the right-hand input type of the class member operator identified by the strategy number. This need not be the same as the indexed type, so long as key values of the correct type can be extracted from it. However, it is recommended that the SQL declarations of these three support functions use the opclass's indexed data type for the query argument, even though the actual type might be something else depending on the operator.

66.4. 實作說明

在內部，GIN 索引包含在索引鍵上建構的 B-tree 索引，其中每個索引鍵是一個或多個索引項目（例如，陣列的成員）元素，並且葉結點頁面中的每個 tuple 包含指向 heap 指標的 B-tree（“posting tree”）或 heap 指標的簡易列表（“posting list”），其列表足夠小以能與其索引鍵值放進單個索引 tuple。

從 PostgreSQL 9.1 開始，null 索引鍵值可以包含在索引中。此外，placeholder null 值也包含在索引項目中，該索引項目根據 extractValue 的結果判斷為 null 或不包含任何鍵值。這可以進行空項目的搜索。

透過在複合值（欄位號碼，鍵值）上建構單個 B-tree 來實現多欄位 GIN 索引。不同欄位的鍵值可以是不同型別。

Figure 66.1. GIN Internals

64.4.1. GIN 快速更新技術

由於反向索引的固有特性，更新 GIN 索引往往會很慢：插入或更新一個 heap 資料列會導致許多項目插入到索引中（每個索引鍵從索引項目中提取一個）。從 PostgreSQL 8.4 開始，GIN 能夠透過將新的 tuple 插入臨時的未排序待處理條目列表來延遲大部分工作。當資料表被清理或自動分析時，或者當呼叫 gin_clean_pending_list 函數時，又或者待處理列表變得大於 gin_pending_list_limit 時，使用在初始索引建立期間使用的相同批次插入技術將項目移動到主要的 GIN 資料結構。這大大提升了 GIN 索引更新速度，甚至可以計算額外的清理開銷。此外，額外的工作可以通過背景程序而不是前端查詢處理來完成。

這種方法的主要缺點是除了搜尋一般索引之外，還必須掃描待處理項目列表，因此大量待處理項目將顯著拖慢搜尋速度。另一個缺點是，雖然大多數更新都很快，但是導致待處理列表變得「太大」的更新將導致觸發立即性清理工作，因此比其他更新慢得多。正確使用 autovacuum 可以儘可能地減少這些問題。

如果一致回應時間比更新速度更重要，則可以透過關閉 GIN 索引的 fastupdate 儲存參數來停用待處理項目的使用。有關詳細訊息，請參閱 CREATE INDEX。

64.4.2. Partial Match Algorithm

GIN **可以支援「部分匹配」查詢，其中查詢不確定一個或多個索引鍵能完全匹配，但可能的匹配屬於相當窄的索引鍵值範圍（以比較支持法確定的索引鍵排序順序））。 extractQuery 方法不是回傳要精確匹配的索引鍵值，而是回傳某鍵值，該鍵值是搜尋範圍的下限，並將 pmatch 標示設定為 true。然後使用 comparePartial 方法掃描關鍵的範圍。comparePartial 讓能匹配的索引鍵回傳零，對於仍在搜尋範圍內的非匹配小於零，或者如果索引鍵超出可匹配的範圍則大於零。

66.5. GIN 小巧技

Create vs. insert

由於可能為每個項目插入了許多索引鍵，因此插入 GIN 索引可能會很慢。因此，對於批次插入表格，建議在完成批次插入後刪除 GIN 索引並重新建立。

從 PostgreSQL 8.4 開始，由於使用了延遲索引，因此此建議不太必要（詳見第 66.4.1 節）。但對於非常大的更新，還是最好刪除並重新建立索引。

maintenance_work_mem

GIN 索引的建構時間對 maintenance_work_mem 設定非常敏感；在建立索引期間，不需要花費工作記憶體。

gin_pending_list_limit

在一系列插入已啟用 fastupdate 的現有 GIN 索引期間，只要列表大於 gin_pending_list_limit，系統就會清理待處理項目列表。為了避免觀察的回應時間波動，希望在背景進行待處理列表清理（即透過 autovacuum）。透過增加 gin_pending_list_limit 或使 autovacuum 更積極，可以減少手動清理操作。但是，擴大清理操作的閾值意味著如果確實發生了手動清理，則需要更長時間。

可以透過變更儲存參數來覆蓋各個 GIN 索引的 gin_pending_list_limit，並允許每個 GIN 索引具有自己的清理閾值。例如，可以僅為可以大量更新的 GIN 索引增加閾值，否則可以減少閾值。

gin_fuzzy_search_limit

開發 GIN 索引的主要目標是在 PostgreSQL 中建立對可高度延展的全文檢索支援，並且通常情況下全文檢索會回傳非常大的結果集合。然而，當查詢包含非常頻繁的單詞時，通常會發生這種情況，因此大型結果集甚至不起作用。由於從磁碟讀取許多 tuple 並對它們進行排序可能需要花費大量時間，因此這對於產品環境來說是不可接受的。（請注意，索引搜尋本身非常快。）

為了便於控制執行此類查詢，GIN 對回傳的資料列數量有一個可配置的軟性上限：gin_fuzzy_search_limit 配置參數。預設設定為 0（表示無限制）。如果設定了非零的限制，則回傳的集合是整個結果集合的子集，隨機選擇。

「軟性上限」表示回傳結果的實際數量可能與指定的限制略有不同，具體取決於查詢和系統隨機數産生器的情況。

從經驗來看，數千以上的值（例如 5000 - 20000）是比較好的範圍。

66.6. 限制

GIN 假定可索引運算子是嚴格的。這意味著將根本不對 null 值呼叫 extractValue（相反地，將自動建立 placeholder 索引項目），並且也將不對 null 查詢值呼叫 extractQuery（相反地，假定查詢將無法滿足要求）。但是請注意，支援包含在非 null 組合項目或查詢值中的 null 鍵值。

66.7. 範例

PostgreSQL 的主要發行版包括先前在 Table 66.1 中所列出的 GIN 運算子類。以下是在 contrib 套件中包含的 GIN 運算子類：

btree_gin

B-tree equivalent functionality for several data types

hstore

Module for storing (key, value) pairs

intarray

Enhanced support for int[]

pg_trgm

Text similarity using trigram matching

67. BRIN Indexes

67.1. Introduction

BRIN 意思是 Block Range Index。BRIN 設計用於處理非常大的資料表，其中某些欄位與其在資料表中的物理位置具有某些自然的相關性。區域範圍是在資料表中是物理上相鄰的一組頁面。對於每個區域範圍，索引都會儲存一些摘要資訊。例如，儲存商店銷售訂單的資料表中可能有一個日期欄位，每個訂單都放置在該欄位上，大多數時候，較早訂單的項目也會在資料表中更早出現。儲存郵遞區號欄的資料表可能會將某個城市的所有編碼自然地組合在一起。

BRIN 索引可以透過一般的 bitmap 索引掃描來滿足查詢，如果索引儲存的摘要資訊與查詢條件一致，則 BRIN 索引將回傳每個範圍內所有頁面中的所有資料列。查詢執行程序負責重新檢查這些資料列，並丟棄不符合查詢條件的資料列 - 換句話說，這些索引是失真的。由於 BRIN 索引非常小，因此與順序掃描相比，掃描索引幾乎不會增加成本，但是可以避免掃描資料表多數不符合條件的部分。

BRIN 索引將儲存的特定資料，使該索引能夠滿足的特定查詢，其取決於為索引的每一欄位選擇的運算子類。例如，具有線性排序順序的資料型別可以具有儲存每個區域範圍內的最小值和最大值的運算子類。幾何型別可以儲存區域範圍內所有物件的邊界。

區域範圍的大小在索引建立時由 pages_per_range 儲存參數決定。索引項目的數量等於頁面中關係的大小除以 pages_per_range 的值。因此，數字越小，索引就越大（由於需要儲存更多的索引項目），但是同時儲存的摘要資料可以更精確，並且在索引掃描期間可以跳過更多的資料區塊。

67.1.1. Index Maintenance

在建立的時候，將會掃描所有現有的 heap 頁面，並為每個資料範圍（包括最後可能不完整的範圍）建立一個摘要索引資料。當新的頁面充滿資料時，已經彙總的頁面範圍將會讓彙總資訊被來自新資料的資訊更新。當建立的新頁面不在上一個彙總範圍內時，該範圍就不會自動獲取彙總資訊；這些資料保持未摘要狀態，直到稍後呼叫摘要重整以建立初始摘要。可以使用 brin_summarize_range(regclass, bigint) 或 brin_summarize_new_values(regclass) 函數手動呼叫此程序。 VACUUM 時會自動處理資料表；或透過 autovacuum 執行自動彙總（INSERT 指令時）。（最後一個觸發器預設情況下處於停用狀態，可以使用 autosummarize 參數啟用。）相反地，可以使用 brin_desummarize_range(regclass, bigint) 函數對範圍進行反彙總，當索引資料不再是一個很效的索引時，此函數會很有用，因為現有值已經變更。

啟用 autosummarization 後，每次頁面範圍被填滿時，都會發送一個請求到 autovacuum，以對其執行針對該範圍的目標彙總，該請求將在下一個在同一資料庫上執行的工作程序結束時完成。如果請求佇列已滿，則不會記錄該請求，並且會將訊息發送到伺服器日誌：

LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was not recorded

發生這種情況時，該範圍將在資料表的下一次日常 vacuum 期間修正彙總資訊。

67.2. Built-in Operator Classes

The core PostgreSQL distribution includes the BRIN operator classes shown in Table 67.1.

The minmax operator classes store the minimum and the maximum values appearing in the indexed column within the range. The inclusion operator classes store a value which includes the values in the indexed column within the range.

Table 67.1. Built-in BRIN Operator Classes

67.3. Extensibility

The BRIN interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The BRIN layer itself takes care of concurrency, logging and searching the index structure.

All it takes to get a BRIN access method working is to implement a few user-defined methods, which define the behavior of summary values stored in the index and the way they interact with scan keys. In short, BRIN combines extensibility with generality, code reuse, and a clean interface.

There are four methods that an operator class for BRIN must provide:

BrinOpcInfo *opcInfo(Oid type_oid)

Returns internal information about the indexed columns' summary data. The return value must point to a palloc'd BrinOpcInfo, which has this definition:

typedef struct BrinOpcInfo
{
    /* Number of columns stored in an index column of this opclass */
    uint16      oi_nstored;

    /* Opaque pointer for the opclass' private use */
    void       *oi_opaque;

    /* Type cache entries of the stored columns */
    TypeCacheEntry *oi_typcache[FLEXIBLE_ARRAY_MEMBER];
} BrinOpcInfo;

BrinOpcInfo.oi_opaque can be used by the operator class routines to pass information between support functions during an index scan.

bool consistent(BrinDesc *bdesc, BrinValues *column, ScanKey key)

Returns whether the ScanKey is consistent with the given indexed values for a range. The attribute number to use is passed as part of the scan key.

bool addValue(BrinDesc *bdesc, BrinValues *column, Datum newval, bool isnull)

Given an index tuple and an indexed value, modifies the indicated attribute of the tuple so that it additionally represents the new value. If any modification was done to the tuple, true is returned.

bool unionTuples(BrinDesc *bdesc, BrinValues *a, BrinValues *b)

Consolidates two index tuples. Given two index tuples, modifies the indicated attribute of the first of them so that it represents both tuples. The second tuple is not modified.

The core distribution includes support for two types of operator classes: minmax and inclusion. Operator class definitions using them are shipped for in-core data types as appropriate. Additional operator classes can be defined by the user for other data types using equivalent definitions, without having to write any source code; appropriate catalog entries being declared is enough. Note that assumptions about the semantics of operator strategies are embedded in the support functions' source code.

Operator classes that implement completely different semantics are also possible, provided implementations of the four main support functions described above are written. Note that backwards compatibility across major releases is not guaranteed: for example, additional support functions might be required in later releases.

To write an operator class for a data type that implements a totally ordered set, it is possible to use the minmax support functions alongside the corresponding operators, as shown in Table 67.2. All operator class members (functions and operators) are mandatory.

Table 67.2. Function and Support Numbers for Minmax Operator Classes

To write an operator class for a complex data type which has values included within another type, it's possible to use the inclusion support functions alongside the corresponding operators, as shown in Table 67.3. It requires only a single additional function, which can be written in any language. More functions can be defined for additional functionality. All operators are optional. Some operators require other operators, as shown as dependencies on the table.

Table 67.3. Function and Support Numbers for Inclusion Operator Classes

Support function numbers 1-10 are reserved for the BRIN internal functions, so the SQL level functions start with number 11. Support function number 11 is the main function required to build the index. It should accept two arguments with the same data type as the operator class, and return the union of them. The inclusion operator class can store union values with different data types if it is defined with the STORAGE parameter. The return value of the union function should match the STORAGE data type.

Support function numbers 12 and 14 are provided to support irregularities of built-in data types. Function number 12 is used to support network addresses from different families which are not mergeable. Function number 14 is used to support empty ranges. Function number 13 is an optional but recommended one, which allows the new value to be checked before it is passed to the union function. As the BRIN framework can shortcut some operations when the union is not changed, using this function can improve index performance.

Both minmax and inclusion operator classes support cross-data-type operators, though with these the dependencies become more complicated. The minmax operator class requires a full set of operators to be defined with both arguments having the same data type. It allows additional data types to be supported by defining extra sets of operators. Inclusion operator class operator strategies are dependent on another operator strategy as shown in Table 67.3, or the same operator strategy as themselves. They require the dependency operator to be defined with the STORAGE data type as the left-hand-side argument and the other supported data type to be the right-hand-side argument of the supported operator. See float4_minmax_ops as an example of minmax, and box_inclusion_ops as an example of inclusion.

68. 資料庫實體儲存格式

本章概述了 PostgreSQL 資料庫使用的實體儲存格式。

68.1. Database File Layout

This section describes the storage format at the level of files and directories.

Traditionally, the configuration and data files used by a database cluster are stored together within the cluster's data directory, commonly referred to as PGDATA (after the name of the environment variable that can be used to define it). A common location for PGDATA is /var/lib/pgsql/data. Multiple clusters, managed by different server instances, can exist on the same machine.

The PGDATA directory contains several subdirectories and control files, as shown in Table 68.1. In addition to these required items, the cluster configuration files postgresql.conf, pg_hba.conf, and pg_ident.conf are traditionally stored in PGDATA, although it is possible to place them elsewhere.

Table 68.1. Contents of PGDATA

For each database in the cluster there is a subdirectory within PGDATA/base, named after the database's OID in pg_database. This subdirectory is the default location for the database's files; in particular, its system catalogs are stored there.

Note that the following sections describe the behavior of the builtin heap table access method, and the builtin index access methods. Due to the extensible nature of PostgreSQL, other access methods might work differently.

Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class.relfilenode. But for temporary relations, the file name is of the form tBBB_FFF, where BBB is the backend ID of the backend which created the file, and FFF is the filenode number. In either case, in addition to the main file (a/k/a main fork), each table and index has a free space map (see Section 68.3), which stores information about free space available in the relation. The free space map is stored in a file named with the filenode number plus the suffix _fsm. Tables also have a visibility map, stored in a fork with the suffix _vm, to track which pages are known to have no dead tuples. The visibility map is described further in Section 68.4. Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init (see Section 68.5).

Caution

Note that while a table's filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE, REINDEX, CLUSTER and some forms of ALTER TABLE, can change the filenode while preserving the OID. Avoid assuming that filenode and table OID are the same. Also, for certain system catalogs including pg_class itself, pg_class.relfilenode contains zero. The actual filenode number of these catalogs is stored in a lower-level data structure, and can be obtained using the pg_relation_filenode() function.

When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. The first segment's file name is the same as the filenode; subsequent segments are named filenode.1, filenode.2, etc. This arrangement avoids problems on platforms that have file size limitations. (Actually, 1 GB is just the default segment size. The segment size can be adjusted using the configuration option --with-segsize when building PostgreSQL.) In principle, free space map and visibility map forks could require multiple segments as well, though this is unlikely to happen in practice.

A table that has columns with potentially large entries will have an associated TOAST table, which is used for out-of-line storage of field values that are too large to keep in the table rows proper. pg_class.reltoastrelid links from a table to its TOAST table, if any. See Section 68.2 for more information.

The contents of tables and indexes are discussed further in Section 68.6.

Tablespaces make the scenario more complicated. Each user-defined tablespace has a symbolic link inside the PGDATA/pg_tblspc directory, which points to the physical tablespace directory (i.e., the location specified in the tablespace's CREATE TABLESPACE command). This symbolic link is named after the tablespace's OID. Inside the physical tablespace directory there is a subdirectory with a name that depends on the PostgreSQL server version, such as PG_9.0_201008051. (The reason for using this subdirectory is so that successive versions of the database can use the same CREATE TABLESPACE location value without conflicts.) Within the version-specific subdirectory, there is a subdirectory for each database that has elements in the tablespace, named after the database's OID. Tables and indexes are stored within that directory, using the filenode naming scheme. The pg_default tablespace is not accessed through pg_tblspc, but corresponds to PGDATA/base. Similarly, the pg_global tablespace is not accessed through pg_tblspc, but corresponds to PGDATA/global.

The pg_relation_filepath() function shows the entire path (relative to PGDATA) of any relation. It is often useful as a substitute for remembering many of the above rules. But keep in mind that this function just gives the name of the first segment of the main fork of the relation — you may need to append a segment number and/or _fsm, _vm, or _init to find all the files associated with the relation.

Temporary files (for operations such as sorting more data than can fit in memory) are created within PGDATA/base/pgsql_tmp, or within a pgsql_tmp subdirectory of a tablespace directory if a tablespace other than pg_default is specified for them. The name of a temporary file has the form pgsql_tmpPPP.NNN, where PPP is the PID of the owning backend and NNN distinguishes different temporary files of that backend.

68.2. TOAST

本節概述了 TOAST（The Oversized-Attribute Storage Technique，超大型屬性儲存技術）。

PostgreSQL 使用固定的頁面大小（通常為 8 kB），並且不允許 tuple 跨越多個頁面。因此，不可能直接儲存非常大的字串。為了克服這個限制，將大字串壓縮和分解成多個實體資料列。這對使用者而言是無感的，對大多數後端程式碼的影響很小。該技術被親切地稱為 TOAST（或「切片麵包以來最好的東西」）。TOAST 基礎結構還用於改進記憶體中大資料值的處理。

只有某些資料型別支援 TOAST - 不需要對無法産生大字串的資料型別增加成本。為了支援 TOAST，資料型別必須具有可變長度（varlena）表示，其中，通常，任何儲存值的第一個 4 bytes 包含以 byte 為單位的總長度（包括其自身）。TOAST不會限制資料型別表示的其餘部分。統稱為 TOASTed 的特殊值表示透過修改或重新解釋此初始長度字來起作用。因此，支援 TOAST-able 資料型別的 C 語言函數必須注意它們如何處理可能的 TOASTed 輸入值：輸入實際上可能不包含 4 bytes 長度的字和內容，直到它被解除 TOAST。（這通常透過在對輸入值執行任何操作之前呼叫 PG_DETOAST_DATUM 來完成，但在某些情況下可以採用更有效的方法。有關更多詳細訊息，請參閱第 37.11.1 節。）

TOAST 使用 varlena 長度的兩位元（big-endian 機器上的高位元，little-endian 機器上的低位元），從而將 TOAST-able 資料型別的任何值的邏輯大小限制為 1 GB。當兩個位元都為零時，該值是資料型別的普通值非 TOAST，長度位元組的其餘位以位元組為單位記錄總資料大小（包括長度位元組）。當設定最高位或最低位時，該值只有一個單位元組標頭而不是普通的四位元組標頭，該位元組的其餘位元表示以位元組為單位的總資料大小（包括長度位元組）。此額外的方案支援空間高效率儲存短於 127 位元組的值，同時仍允許資料型別在需要時增長到 1 GB。具有單位元組標頭的值不在任何特定邊界上對齊，而具有四位元組標頭的值在至少四位元組邊界上對齊；與短值相比，這種省略對齊填充提供了額外的空間節省。作為特殊情況，如果單位元組標頭的剩餘位全部為零（對於自包含長度而言這是不可能的），則該值是指向外部資料的指標，具有如所描述的幾種可能的替代方案，如下所示。這種 TOAST 指標的型別和大小由儲存在資料的第二個位元組中的代碼決定。最後，當最高位元或最低位元清除為零但相鄰位置時，資料的內容已被壓縮，必須在使用前解壓縮。在這種情況下，四位元組長度字的剩餘位表示壓縮資料的總大小，而不是原始資料。請注意，對於外部資料也可以進行壓縮，但 varlena 標頭不會告訴它是否已經發生 - 而 TOAST 指標的內容則說明這件事。

如上所述，有多種類型的 TOAST 指標基準。最舊和最常見的類型是指向儲存在 TOAST 資料表中的外部資料的指標，該資料表與包含 TOAST 指標資料本身的資料表分開但與之相關聯。當要儲存在磁碟上的 tuple 太大而無法按原樣儲存時，這些磁碟指標基準由 TOAST 管理代碼（在 access/heap/tuptoaster.c 中）建立。更多細節見第 66.2.1 節。或者，TOAST 指標資料可以包含指向出現在記憶體中其他位置外部資料的指標。這些資料必然是短暫的，並且永遠不會出現在磁碟上，但它們對於避免複製和冗餘處理大量資料值非常有用。更多細節見第 66.2.2 節。

用於壓縮資料的壓縮技術是 LZ 系列壓縮技術中相當簡單且非常快速的方法。有關詳細訊息，請參閱 src/common/pg_lzcompress.c。

66.2.1. Out-of-line, on-disk TOAST storage

如果資料表的任何欄位都是可以 TOAST 的，則該資料表將擁有關連的 TOAST 資料表，其 OID 儲存在資料表的 pg_class.reltoastrelid 項目中。磁盤上 TOAST 後的值保留在 TOAST 資料表中，下面將有更詳細的描述。

將 out-of-line 的內容（在壓縮後使用）分割為最多 TOAST_MAX_CHUNK_SIZE 個字元的區塊（預設情況下，選擇此值使得四個區塊的資料列行剛好放進一個 page，大約為 2000 個字元）。每個區塊都屬於其所有資料表的 TOAST 資料表中單獨的資料列來儲存。每個 TOAST 資料表都有欄位的 chunk_id（識別特定有 TOAST 值的 OID），chunk_seq（其值中區塊的序列號）和 chunk_data（區塊的實際資料）。chunk_id 和 chunk_seq 上的唯一索引提供了對內容的快速檢索。表示線上磁碟 TOAST 值的指標資料需要儲存要查看的 TOAST 資料表 OID 以及特定值的 OID（其chunk_id）。為方便起見，指標 datum 還儲存邏輯上的 datum 大小（原始未壓縮字串長度）和實際上的儲存大小（如果套用了壓縮則會不同）。因此，允許 varlena 標頭字元，磁碟 TOAST 指標資料的總大小為 18 個位元組，不論其所表示字串大小。

僅當要儲存在資料表中的資料列內容大於 TOAST_TUPLE_THRESHOLD 字元（通常為2 kB）時，才會觸發 TOAST 機制。TOAST 程式將會壓縮或移動字串內容，直到資料列小於 TOAST_TUPLE_TARGET 個字元（通常也是 2 kB）或者不能再獲得更多的增益。在 UPDATE 操作期間，未變更字串的內容通常就保持原樣；因此，如果沒有任何 out-of-line 需要變更，則具有 out-of-line 的資料列更新就不會產生任何 TOAST 成本。

TOAST 機制識別用於在磁碟上儲存可 TOAST 欄位有四種不同策略：

PLAIN 可防止壓縮或 out-of-line 儲存方式；此外，它禁止使用 varlena 類型的單字元標頭。對於非 TOAST-capable 資料型別欄位，這是唯一可行的策略。
EXTENDED 允許壓縮和 out-of-line 儲存。這是大多數 TOAST-capable 資料型別的預設方式。首先嘗試壓縮，然後在資料列仍然太大的情況下進行 out-of-line 儲存。
EXTERNAL 允許 out-of-line 儲存但不允許壓縮。使用 EXTERNAL 將使大量文字和 bytea 欄位上的子字串操作更快（以增加的儲存空間為代價），因為這些操作被最佳化為在未壓縮時僅獲取 out-of-line 內容所需的部分。
MAIN 允許壓縮但不允許 out-of-line 儲存。（實際上，仍然會為這些欄位執行 out-of-line 儲存，但只有在沒有其他方法使資料列足夠小到適合頁面時才做的最後手段。）

每個 TOAST-able 資料型別會為該型別的欄位指定預設策略，但是可以使用 ALTER TABLE ... SET STORAGE 變更指定資料表欄位的策略。

與更直覺的方法（例如允許資料列內容跨越頁面）相比，此方案具有許多優點。假設查詢通常透過與相對較小的鍵值進行比較來過濾，執行程序的大部分工作將使用主要欄位完成。 TOASTed 屬性的大量內容只會在結果集發送到用戶端時被取出（如果選中的話）。因此，與沒有任何外部儲存的情況相比，主要資料表更小並且其更多資料列置於共享緩衝區高速處理。排序集合也會縮小，而排序通常完全在記憶體中完成。一個小小的測試顯示，包含典型 HTML 頁面及其 URL 的資料儲存在大約一半的原始資料大小（包括 TOAST 資料表）中，並且主要資料表僅包含大約 10％的內容（URL 和一些小的 HTML）。與未轉換的相比，並沒有執行時間差異，其中所有 HTML 頁面都被削減到 7 kB 以適應頁面。

66.2.2. Out-of-line, in-memory TOAST storage

TOAST pointers can point to data that is not on disk, but is elsewhere in the memory of the current server process. Such pointers obviously cannot be long-lived, but they are nonetheless useful. There are currently two sub-cases: pointers to indirect data and pointers to expanded data.

Indirect TOAST pointers simply point at a non-indirect varlena value stored somewhere in memory. This case was originally created merely as a proof of concept, but it is currently used during logical decoding to avoid possibly having to create physical tuples exceeding 1 GB (as pulling all out-of-line field values into the tuple might do). The case is of limited use since the creator of the pointer datum is entirely responsible that the referenced data survives for as long as the pointer could exist, and there is no infrastructure to help with this.

Expanded TOAST pointers are useful for complex data types whose on-disk representation is not especially suited for computational purposes. As an example, the standard varlena representation of a PostgreSQL array includes dimensionality information, a nulls bitmap if there are any null elements, then the values of all the elements in order. When the element type itself is variable-length, the only way to find the N'th element is to scan through all the preceding elements. This representation is appropriate for on-disk storage because of its compactness, but for computations with the array it's much nicer to have an “expanded” or “deconstructed” representation in which all the element starting locations have been identified. The TOAST pointer mechanism supports this need by allowing a pass-by-reference Datum to point to either a standard varlena value (the on-disk representation) or a TOAST pointer that points to an expanded representation somewhere in memory. The details of this expanded representation are up to the data type, though it must have a standard header and meet the other API requirements given in src/include/utils/expandeddatum.h. C-level functions working with the data type can choose to handle either representation. Functions that do not know about the expanded representation, but simply apply PG_DETOAST_DATUMto their inputs, will automatically receive the traditional varlena representation; so support for an expanded representation can be introduced incrementally, one function at a time.

TOAST pointers to expanded values are further broken down into read-write and read-only pointers. The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not; it must first create a copy if it wants to make a modified version of the value. This distinction and some associated conventions make it possible to avoid unnecessary copying of expanded values during query execution.

For all types of in-memory TOAST pointer, the TOAST management code ensures that no such pointer datum can accidentally get stored on disk. In-memory TOAST pointers are automatically expanded to normal in-line varlena values before storage — and then possibly converted to on-disk TOAST pointers, if the containing tuple would otherwise be too big.

68.3. Free Space Map

Each heap and index relation, except for hash indexes, has a Free Space Map (FSM) to keep track of available space in the relation. It's stored alongside the main relation data in a separate relation fork, named after the filenode number of the relation, plus a _fsm suffix. For example, if the filenode of a relation is 12345, the FSM is stored in a file called 12345_fsm, in the same directory as the main relation file.

The Free Space Map is organized as a tree of FSM pages. The bottom level FSM pages store the free space available on each heap (or index) page, using one byte to represent each such page. The upper levels aggregate information from the lower levels.

Within each FSM page is a binary tree, stored in an array with one byte per node. Each leaf node represents a heap page, or a lower level FSM page. In each non-leaf node, the higher of its children's values is stored. The maximum value in the leaf nodes is therefore stored at the root.

See src/backend/storage/freespace/README for more details on how the FSM is structured, and how it's updated and searched. The pg_freespacemap module can be used to examine the information stored in free space maps.

VII. 資料庫進階

50. PostgreSQL 的內部架構

Author

50.1. The Path of a Query

50.2. How Connections Are Established

50.3. The Parser Stage

50.3.1. Parser

Note

50.3.2. Transformation Process

50.4. The PostgreSQL Rule System

50.5. Planner/Optimizer

Note

50.5.1. Generating Possible Plans

50.6. Executor

51. 系統目錄

51.3. pg_am

Note

51.7. pg_attribute

Table 51.7. pg_attribute Columns

51.8. pg_authid

51.9. pg_auth_members

51.11 pg_class

Table 52.11. pg_class Columns

51.12. pg_collation

51.13. pg_constraint

Note

Note

51.15 pg_database

51.22. pg_extension

51.26 pg_index

51.29. pg_language

Table 51.29. pg_language Columns

51.32. pg_namespace

Table 52.32. pg_namespace Columns

51.33. pg_opclass

51.38. pg_policy

Note

51.39. pg_proc

Table 51.39. pg_proc Columns

51.44. pg_rewrite

Note

51.50. pg_statistic

Table 51.50. pg_statistic Columns

51.51. pg_statistic_ext

Table 52.51. pg_statistic_ext Columns

51.54. pg_tablespace

51.56. pg_trigger

Note

Note

51.62. pg_type

51.66. pg_available_extensions

Table 51.67. pg_available_extensions Columns

51.67. pg_available_extension_versions

Table 51.68. pg_available_extension_versions Columns

51.72. pg_hba_file_rules

Table 51.73. pg_hba_file_rules Columns

51.73. pg_indexes

Table 51.74. pg_indexes Columns

51.74. pg_locks

Table 51.75. pg_locks Columns

51.79. pg_replication_origin_status

51.82 pg_roles

51.85. pg_settings

Table 52.86. pg_settings Columns

51.87. pg_stats

51.90. pg_tables

Table 51.91. pg_tables Columns

51.95. pg_views

Table 51.96. pg_views Columns

52. Frontend/Backend Protocol

52.1. Overview

52.1.1. Messaging Overview

52.1.2. Extended Query Overview

52.1.3. Formats and Format Codes

52.2. Message Flow

52.3. SASL Authentication

52.4. Streaming Replication Protocol

52.5. Logical Streaming Replication Protocol

52.5.1. Logical Streaming Replication Parameters

52.5.2. Logical Replication Protocol Messages

Table 51.7. `pg_attribute` Columns

Table 52.11. `pg_class` Columns

Table 51.29. `pg_language` Columns

Table 52.32. `pg_namespace` Columns

Table 51.39. `pg_proc` Columns

Table 51.50. `pg_statistic` Columns

Table 52.51. `pg_statistic_ext` Columns

Table 51.67. `pg_available_extensions` Columns

Table 51.68. `pg_available_extension_versions` Columns

Table 51.73. `pg_hba_file_rules` Columns

Table 51.74. `pg_indexes` Columns

Table 51.75. `pg_locks` Columns

Table 52.86. `pg_settings` Columns

Table 51.91. `pg_tables` Columns

Table 51.96. `pg_views` Columns

56.2.6. FDW Routines for `EXPLAIN`

56.2.7. FDW Routines for `ANALYZE`

56.2.8. FDW Routines for `IMPORT FOREIGN SCHEMA`