1 of 100

VII. 資料庫進階

本部分包含可能用於 PostgreSQL 研發人員的各種內容。

52. PostgreSQL 的內部架構

Author

This chapter originated as part of [sim98], Stefan Simkovics' Master's Thesis prepared at Vienna University of Technology under the direction of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

本章概述了 PostgreSQL 的內部架構。閱讀以下各節後，你應該可以了解一個查詢是如何被處理的。本章的目的不是詳細描述 PostgreSQL 的內部操作，因為那樣的說明太過於詳盡。本章旨在幫助讀者理解資料庫後端內部發生的一些操作程序，從接收查詢的開始到將結果回傳給用戶端之間所發生的事。\

52.1. 處理查詢語句的流程

在這裡，我們簡單概述查詢必須經過某些階段才能得到結果。

必須建立從應用程式到 PostgreSQL 伺服器的連線。應用程式將查詢語句發送到伺服器，並等待接收伺服器送回的結果。
查詢解析程序(parser)階段檢查應用程式所發送的查詢語句是否語法正確，並建立查詢樹(query tree)。
查詢改寫(rewrite)系統採用查詢解析程序階段所建立的查詢樹，查詢要使用於查詢樹的所有規則（記錄在系統目錄中）。它執行規則內容所敘述的改寫轉換方式。改寫系統的其中一種用途是檢視表(view)的實作。每當針對檢視表（或虛擬資料表）進行查詢時，改寫系統都會將使用者的查詢覆寫為存取檢視表定義中所宣告的基本資料表查詢。
計劃程序或最佳化程序採用已改寫的查詢樹並建立一個查詢計劃，該計劃將作為執行程序的輸入。
為此，首先建立所有可能產生相同結果的查詢路徑。例如，如果要掃描的關連上有索引，則有兩條掃描路徑。一種可能性是簡單的循序掃描，另一種可能性是使用索引。接下來，估計每條路徑的執行成本，並選擇最便宜的路徑。最便宜的路徑將被產生用於執行者可以使用的完整計劃。
執行程序以遞迴方式走遍計劃樹並以計劃規劃的方式檢索資料。執行程序利用儲存系統掃描資料關連、執行排序和集合、過濾條件，在最後回傳所產生的資料內容。

在以下各節中，我們將更詳細地介紹上述每個項目，以更好地理解 PostgreSQL 的內部控制和資料結構。

52.2. 連線是如何被建立的

連線是如何被建立的

PostgreSQL採用了一種“每個使用者一個程序”的客戶端/伺服器模型。在這種模型中，每個客戶端程序(client process)只連接到一個後端程序(backend process)。由於我們事先不知道會有多少連線，所以我們必須使用一個「監督程序」，每次收到連線請求時，它就產生一個新的後端程序。這個監督程序叫做postmaster，它在指定的 TCP/IP 連線埠上監聽傳入的連線。每當它檢測到一個連線請求，它就產生一個新的後端程序。這些後端程序之間以及與實例(instance)的其他程序使用 semaphores 和共享記憶體來溝通，以確保在同時間進行資料存取中的資料完整性

客戶端程序可以是任何理解 PostgreSQL 協定的程式，該協定說明在第 55 章中。許多客戶端是基於 C 語言函式庫 libpq，但也有一些是獨立實作該協定的程式，例如 Java 的 JDBC 驅動程式。

只要連線建立之後，該客戶端就能夠送查詢到對應的後端程序。查詢會以明文的方式傳送過去，客戶端不需要做解析的操作。而對應的後端程序會解析該查詢並建立一個執行計畫(execution plan)，接著執行該計畫並透過對應的連線回傳查詢到的每筆(row)資料。

52.3. 解析器階段

解析器階段

解析器階段由兩部分組合：

解析器是透過 Unix 工具 bison 跟 flex 實作出來的並定義在 gram.y 和 scan.l 中。
轉換處理(transformation process) 負責將解析器產生出來的資料結構進行修改和加強。

52.3.1. 解析器

語法解析器必須檢查送過來的明文查詢字串是否語法正確。如果語法正確會建立一個解析樹( parser tree)並回傳，否則將回傳一個錯誤。語法解析器與詞法解析器是透過著名的 Unix 工具 bison 跟 flex 實作的。

詞法解析器定義在 scan.l 檔案裡，負責解析 identifiers、SQL keywords 等等。對於每一個找到的 identifier、keyword 會產生一個 token 並回傳給語法解析器。

語法解析器定義在 gram.y 檔案裡由一組 grammer rules 和 actions 所組成，每當滿足一個 rule 的時候就會觸發對應的 action (由 C 語言實作) 並建立出對應的解析樹。

flex 程式將檔案 scan.l 轉換成 scan.c C 語言檔案，bison 將檔案 gram.y 轉換成 gram.c C 語言檔案。轉換結束後，C 編譯器就能編譯這些檔案並建立解析器。不應該對這些產生出來的 C 檔案做任何修改，因為每次 flex 或 bison 都會改寫這些檔案。

註記

前面提到的轉換和編譯是由定義在 PostgreSQL source code 內的 makefiles 所執行。

對於 bison 或是 gram.y 中的語法規則的介紹超出了本文件的教學範圍。有許多相關的書本或是文件都在介紹 flex 和 bison。建議在學習 gram.y 中的語法前，先了解 bison 的相關原理，才不會很難理解。

52.3.2.

52.4. The PostgreSQL Rule System

PostgreSQL supports a powerful rule system for the specification of views and ambiguous view updates. Originally the PostgreSQL rule system consisted of two implementations:

The first one worked using row level processing and was implemented deep in the executor. The rule system was called whenever an individual row had been accessed. This implementation was removed in 1995 when the last official release of the Berkeley Postgres project was transformed into Postgres95.
The second implementation of the rule system is a technique called query rewriting. The rewrite system is a module that exists between the parser stage and the planner/optimizer. This technique is still implemented.

The query rewriter is discussed in some detail in Chapter 40, so there is no need to cover it here. We will only point out that both the input and the output of the rewriter are query trees, that is, there is no change in the representation or level of semantic detail in the trees. Rewriting can be thought of as a form of macro expansion.

52.5. Planner/Optimizer

The task of the planner/optimizer is to create an optimal execution plan. A given SQL query (and hence, a query tree) can be actually executed in a wide variety of different ways, each of which will produce the same set of results. If it is computationally feasible, the query optimizer will examine each of these possible execution plans, ultimately selecting the execution plan that is expected to run the fastest.

Note

In some situations, examining each possible way in which a query can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine a reasonable (not necessarily optimal) query plan in a reasonable amount of time, PostgreSQL uses a Genetic Query Optimizer (see Chapter 59) when the number of joins exceeds a threshold (see geqo_threshold).

The planner's search procedure actually works with data structures called paths, which are simply cut-down representations of plans containing only as much information as the planner needs to make its decisions. After the cheapest path is determined, a full-fledged plan tree is built to pass to the executor. This represents the desired execution plan in sufficient detail for the executor to run it. In the rest of this section we'll ignore the distinction between paths and plans.

50.5.1. Generating Possible Plans

The planner/optimizer starts by generating plans for scanning each individual relation (table) used in the query. The possible plans are determined by the available indexes on each relation. There is always the possibility of performing a sequential scan on a relation, so a sequential scan plan is always created. Assume an index is defined on a relation (for example a B-tree index) and a query contains the restriction relation.attribute OPR constant. If relation.attribute happens to match the key of the B-tree index and OPR is one of the operators listed in the index's operator class, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an index, further plans will be considered. Index scan plans are also generated for indexes that have a sort ordering that can match the query's ORDER BY clause (if any), or a sort ordering that might be useful for merge joining (see below).

If the query requires joining two or more relations, plans for joining relations are considered after all feasible plans have been found for scanning single relations. The three available join strategies are:

nested loop join: The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

When the query involves more than two relations, the final result must be built up by a tree of join steps, each with two inputs. The planner examines different possible join sequences to find the cheapest one.

If the query uses fewer than geqo_threshold relations, a near-exhaustive search is conducted to find the best join sequence. The planner preferentially considers joins between any two relations for which there exist a corresponding join clause in the WHERE qualification (i.e., for which a restriction like where rel1.attr1=rel2.attr2 exists). Join pairs with no join clause are considered only when there is no other choice, that is, a particular relation has no available join clauses to any other relation. All possible plans are generated for every join pair considered by the planner, and the one that is (estimated to be) the cheapest is chosen.

When geqo_threshold is exceeded, the join sequences considered are determined by heuristics, as described in Chapter 59. Otherwise the process is the same.

The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as needed, plus any auxiliary steps needed, such as sort nodes or aggregate-function calculation nodes. Most of these plan node types have the additional ability to do selection (discarding rows that do not meet a specified Boolean condition) and projection (computation of a derived column set based on given column values, that is, evaluation of scalar expressions where needed). One of the responsibilities of the planner is to attach selection conditions from the WHERE clause and computation of required output expressions to the most appropriate nodes of the plan tree.

52.6. Executor

The executor takes the plan created by the planner/optimizer and recursively processes it to extract the required set of rows. This is essentially a demand-pull pipeline mechanism. Each time a plan node is called, it must deliver one more row, or report that it is done delivering rows.

To provide a concrete example, assume that the top node is a MergeJoin node. Before any merge can be done two rows have to be fetched (one from each subplan). So the executor recursively calls itself to process the subplans (it starts with the subplan attached to lefttree). The new top node (the top node of the left subplan) is, let's say, a Sort node and again recursion is needed to obtain an input row. The child node of the Sort might be a SeqScan node, representing actual reading of a table. Execution of this node causes the executor to fetch a row from the table and return it up to the calling node. The Sort node will repeatedly call its child to obtain all the rows to be sorted. When the input is exhausted (as indicated by the child node returning a NULL instead of a row), the Sort code performs the sort, and finally is able to return its first output row, namely the first one in sorted order. It keeps the remaining rows stored so that it can deliver them in sorted order in response to later demands.

The MergeJoin node similarly demands the first row from its right subplan. Then it compares the two rows to see if they can be joined; if so, it returns a join row to its caller. On the next call, or immediately if it cannot join the current pair of inputs, it advances to the next row of one table or the other (depending on how the comparison came out), and again checks for a match. Eventually, one subplan or the other is exhausted, and the MergeJoin node returns NULL to indicate that no more join rows can be formed.

Complex queries can involve many levels of plan nodes, but the general approach is the same: each node computes and returns its next output row each time it is called. Each node is also responsible for applying any selection or projection expressions that were assigned to it by the planner.

The executor mechanism is used to evaluate all four basic SQL query types: SELECT, INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs to send each row returned by the query plan tree off to the client. For INSERT, each returned row is inserted into the target table specified for the INSERT. This is done in a special top-level plan node called ModifyTable. (A simple INSERT ... VALUES command creates a trivial plan tree consisting of a single Result node, which computes just one result row, and ModifyTable above it to perform the insertion. But INSERT ... SELECT can demand the full power of the executor mechanism.) For UPDATE, the planner arranges that each computed row includes all the updated column values, plus the TID (tuple ID, or row ID) of the original target row; this data is fed into a ModifyTable node, which uses the information to create a new updated row and mark the old row deleted. For DELETE, the only column that is actually returned by the plan is the TID, and the ModifyTable node simply uses the TID to visit each target row and mark it deleted.

53. 系統資訊目錄

系統目錄（system catalog）是記錄資料庫管理系統儲存結構原始資料的地方，例如關於資料表和欄位的訊息以及內部日誌記錄訊息。PostgreSQL 的系統目錄是一般的資料表。您可以刪除並重新建立資料表、增加欄位、插入和更新內容，並以這種方式嚴重混淆您的系統。當然，通常情況下，不應該手動更改系統目錄，通常有 SQL 命令來執行此操作。（例如，CREATE DATABASE 向 pg_database 系統目錄插入一行 - 實際上是在磁碟上建立數據庫）。對於特別深奧的操作有一些例外，但其中很多已經隨著時間的推移而變為 SQL 命令，因此需要系統目錄的直接操作正在不斷減少。

51.3. pg_am

The catalog pg_am stores information about relation access methods. There is one row for each access method supported by the system. Currently, only tables and indexes have access methods. The requirements for table and index access methods are discussed in detail in Chapter 63 and Chapter 64 respectively.

Table 53.3. `pg_am` Columns

Note

Before PostgreSQL 9.6, pg_am contained many additional columns representing properties of index access methods. That data is now only directly visible at the C code level. However, pg_index_column_has_property() and related functions have been added to allow SQL queries to inspect index access method properties; see Table 9.71.

51.7. pg_attribute

The catalog pg_attribute stores information about table columns. There will be exactly one pg_attribute row for every column in every table in the database. (There will also be attribute entries for indexes, and indeed all objects that have pg_class entries.)

The term attribute is equivalent to column and is used for historical reasons.

Table 51.7. `pg_attribute` Columns

Name

Type

References

Description

In a dropped column's pg_attribute entry, atttypid is reset to zero, but attlen and the other fields copied from pg_type are still valid. This arrangement is needed to cope with the situation where the dropped column's data type was later dropped, and so there is no pg_type row anymore. attlen and the other fields can be used to interpret the contents of a row of the table.

51.8. pg_authid

The catalog pg_authid contains information about database authorization identifiers (roles). A role subsumes the concepts of “users” and “groups”. A user is essentially just a role with the rolcanlogin flag set. Any role (with or without rolcanlogin) can have other roles as members; see pg_auth_members.

Since this catalog contains passwords, it must not be publicly readable. pg_roles is a publicly readable view on pg_authid that blanks out the password field.

Chapter 21 contains detailed information about user and privilege management.

Because user identities are cluster-wide, pg_authid is shared across all databases of a cluster: there is only one copy of pg_authid per cluster, not one per database.

Table 51.8. `pg_authid` Columns

For an MD5 encrypted password, rolpassword column will begin with the string md5 followed by a 32-character hexadecimal MD5 hash. The MD5 hash will be of the user's password concatenated to their user name. For example, if user joe has password xyzzy, PostgreSQL will store the md5 hash of xyzzyjoe.

If the password is encrypted with SCRAM-SHA-256, it has the format:

SCRAM-SHA-256$<iteration count>:<salt>$<StoredKey>:<ServerKey>

where salt, StoredKey and ServerKey are in Base64 encoded format. This format is the same as that specified by RFC 5803.

A password that does not follow either of those formats is assumed to be unencrypted.

51.9. pg_auth_members

目錄 pg_auth_members 顯示角色之間的成員資格關連。允許任何非循環的關連。

由於角色身份是叢集範圍的，因此 pg_auth_members 在叢集的所有資料庫之間共享：每個叢集只有一個 pg_auth_members 副本，而不是每個資料庫一個副本。

Table 51.9. pg_auth_members Columns

Name

Type

References

Description

51.10. pg_cast

系統目錄 pg_cast 儲存資料型別的轉換方式，包括了內建的和使用者定義的。

需要注意的是，pg_cast 並不代表系統知道如何執行的每一種型別轉換；只有那些不能從某些通用規則中推導出來的。例如，domain 和它的基本型別之間的轉換在 pg_cast 中就沒有明確表示。另一個重要的例外是「自動 I/O 強制轉換」，即使用資料型別自己的 I/O 函數執行的轉換為 text 或其他字串型別或從 text 及其他字串型別轉換的那些，在 pg_cast 中沒有明確表示。

Table 51.10. `pg_cast` Columns

The cast functions listed in pg_cast must always take the cast source type as their first argument type, and return the cast destination type as their result type. A cast function can have up to three arguments. The second argument, if present, must be type integer; it receives the type modifier associated with the destination type, or -1 if there is none. The third argument, if present, must be type boolean; it receives true if the cast is an explicit cast, false otherwise.

It is legitimate to create a pg_cast entry in which the source and target types are the same, if the associated function takes more than one argument. Such entries represent “length coercion functions” that coerce values of the type to be legal for a particular type modifier value.

When a pg_cast entry has different source and target types and a function that takes more than one argument, it represents converting from one type to another and applying a length coercion in a single step. When no such entry is available, coercion to a type that uses a type modifier involves two steps, one to convert between data types and a second to apply the modifier.

51.11 pg_class

版本：11

The catalog pg_class catalogs tables and most everything else that has columns or is otherwise similar to a table. This includes indexes (but see also pg_index), sequences (but see also pg_sequence), views, materialized views, composite types, and TOAST tables; see relkind. Below, when we mean all of these kinds of objects we speak of “relations”. Not all columns are meaningful for all relation types.

Table 51.11. `pg_class` Columns

Several of the Boolean flags in pg_class are maintained lazily: they are guaranteed to be true if that's the correct state, but may not be reset to false immediately when the condition is no longer true. For example, relhasindex is set by CREATE INDEX, but it is never cleared by DROP INDEX. Instead, VACUUM clears relhasindex if it finds the table has no indexes. This arrangement avoids race conditions and improves concurrency.\

51.12. pg_collation

The catalog pg_collation describes the available collations, which are essentially mappings from an SQL name to operating system locale categories. See for more information.

Table 51.12. pg_collation Columns

Name

Type

References

Description

Note that the unique key on this catalog is (collname, collencoding, collnamespace) not just (collname, collnamespace). PostgreSQL generally ignores all collations that do not have collencoding equal to either the current database's encoding or -1, and creation of new entries with the same name as an entry with collencoding = -1 is forbidden. Therefore it is sufficient to use a qualified SQL name (schema.name) to identify a collation, even though this is not unique according to the catalog definition. The reason for defining the catalog this way is that initdb fills it in at cluster initialization time with entries for all locales available on the system, so it must be able to hold entries for all encodings that might ever be used in the cluster.

In the template0 database, it could be useful to create collations whose encoding does not match the database encoding, since they could match the encodings of databases later cloned from template0. This would currently have to be done manually.

51.13. pg_constraint

The catalog pg_constraint stores check, primary key, unique, foreign key, and exclusion constraints on tables. (Column constraints are not treated specially. Every column constraint is equivalent to some table constraint.) Not-null constraints are represented in the pg_attribute catalog, not here.

User-defined constraint triggers (created with CREATE CONSTRAINT TRIGGER) also give rise to an entry in this table.

Check constraints on domains are stored here, too.

Table 51.13. pg_constraint Columns

Name

Type

References

Description

In the case of an exclusion constraint, conkey is only useful for constraint elements that are simple column references. For other cases, a zero appears in conkey and the associated index must be consulted to discover the expression that is constrained. (conkey thus has the same contents as pg_index.indkey for the index.)

Note

consrc is not updated when referenced objects change; for example, it won't track renaming of columns. Rather than relying on this field, it's best to use pg_get_constraintdef() to extract the definition of a check constraint.

Note

pg_class.relchecks needs to agree with the number of check-constraint entries found in this table for each relation.

51.21. pg_event_trigger

The catalog pg_event_trigger stores event triggers. See Chapter 39 for more information.

Table 51.21. `pg_event_trigger` Columns

51.22. pg_extension

目錄 pg_extension 儲存有關已安裝延伸功能的資訊。有關延伸功能的詳細資訊，請參閱。

Table 51.22. pg_extension Columns

Name

Type

References

Description

請注意，與大多數帶有「namespace」欄位的目錄不同，extnamespace 並不暗指該延伸功能屬於該綱要(schema)。延伸功能並不在任何綱要之中。不過 extnamespace 指示包含大多數或所有延伸功能所屬物件的綱要。如果 extrelocatable 為 true，則此綱要實際上必須包含屬於該延伸功能的所有需要綱要的物件。\

51.26 pg_index

The catalog pg_index contains part of the information about indexes. The rest is mostly in pg_class.

Table 51.26. `pg_index` Columns

51.29. pg_language

版本：11

目錄 pg_language 註冊了可以撰寫函數或 stored procedure 的語言。有關語言處理程序的更多訊息，請參閱 CREATE LANGUAGE 和第 41 章。

Table 51.29. `pg_language` Columns

Name

Type

References

Description

51.32. pg_namespace

版本：11

The catalog pg_namespace stores namespaces. A namespace is the structure underlying SQL schemas: each namespace can have a separate collection of relations, types, etc. without name conflicts.

Table 51.32. `pg_namespace` Columns

51.33. pg_opclass

The catalog pg_opclass defines index access method operator classes. Each operator class defines semantics for index columns of a particular data type and a particular index access method. An operator class essentially specifies that a particular operator family is applicable to a particular indexable column data type. The set of operators from the family that are actually usable with the indexed column are whichever ones accept the column's data type as their left-hand input.

Operator classes are described at length in .

Table 51.33. pg_opclass Columns

Name

Type

References

Description

An operator class's opcmethod must match the opfmethod of its containing operator family. Also, there must be no more than one pg_opclass row having opcdefault true for any given combination of opcmethod and opcintype.

51.52. pg_subscription

目錄 pg_subscription 包含所有現有訂閱的邏輯複寫。有關邏輯複寫的相關資訊，請參閱。

與大多數系統目錄不同，pg_subscription 在叢集的所有資料庫之間是共享的：每個叢集只有一個 pg_subscription，而不是每個資料庫一個。

普通使用者沒有權限讀取 subconninfo 欄位，因為該欄位可能包含純文字密碼。

Table 51.52. `pg_subscription` Columns

51.72. pg_indexes

檢視表 pg_indexes 提供資料庫中每一個索引的資訊。

Table 51.73. `pg_indexes` Columns

54.35. pg_views

The view pg_views provides access to useful information about each view in the database.

Table 54.35. `pg_views` Columns

52.2. Message Flow

52.3. SASL Authentication

52.6. Message Data Types

52.7. Message Formats

52.8. Error and Notice Message Fields

52.9. Logical Replication Message Formats

52.10. Summary of Changes since Protocol 2.0

56. PostgreSQL 程式撰寫慣例

57. Native Language Support

67. B-Tree Indexes

67.2. Behavior of B-Tree Operator Classes

As shown in , a btree operator class must provide five comparison operators, <, <=, =, >= and >. One might expect that <> should also be part of the operator class, but it is not, because it would almost never be useful to use a <> WHERE clause in an index search. (For some purposes, the planner treats <> as associated with a btree operator class; but it finds that operator via the = operator's negator link, rather than from pg_amop.)

When several data types share near-identical sorting semantics, their operator classes can be grouped into an operator family. Doing so is advantageous because it allows the planner to make deductions about cross-type comparisons. Each operator class within the family should contain the single-type operators (and associated support functions) for its input data type, while cross-type comparison operators and support functions are “loose” in the family. It is recommendable that a complete set of cross-type operators be included in the family, thus ensuring that the planner can represent any comparison conditions that it deduces from transitivity.

There are some basic assumptions that a btree operator family must satisfy:

An = operator must be an equivalence relation; that is, for all non-null values A, B, C of the data type:
- A = A is true (reflexive law)
- if A = B, then B = A (symmetric law)
- if A = B and B = C, then A = C (transitive law)
A < operator must be a strong ordering relation; that is, for all non-null values A, B, C:
- A < A is false (irreflexive law)
- if A < B and B < C, then A < C (transitive law)
Furthermore, the ordering is total; that is, for all non-null values A, B:
- exactly one of A < B, A = B, and B < A is true (trichotomy law)
(The trichotomy law justifies the definition of the comparison support function, of course.)

The other three operators are defined in terms of = and < in the obvious way, and must act consistently with them.

For an operator family supporting multiple data types, the above laws must hold when A, B, C are taken from any data types in the family. The transitive laws are the trickiest to ensure, as in cross-type situations they represent statements that the behaviors of two or three different operators are consistent. As an example, it would not work to put float8 and numeric into the same operator family, at least not with the current semantics that numeric values are converted to float8 for comparison to a float8. Because of the limited accuracy of float8, this means there are distinct numeric values that will compare equal to the same float8 value, and thus the transitive law would fail.

Another requirement for a multiple-data-type family is that any implicit or binary-coercion casts that are defined between data types included in the operator family must not change the associated sort ordering.

It should be fairly clear why a btree index requires these laws to hold within a single data type: without them there is no ordering to arrange the keys with. Also, index searches using a comparison key of a different data type require comparisons to behave sanely across two data types. The extensions to three or more data types within a family are not strictly required by the btree index mechanism itself, but the planner relies on them for optimization purposes.

68. GiST Indexes

52.4. Streaming Replication Protocol

To initiate streaming replication, the frontend sends the replication parameter in the startup message. A Boolean value of true (or on, yes, 1) tells the backend to go into physical replication walsender mode, wherein a small set of replication commands, shown below, can be issued instead of SQL statements.

Passing database as the value for the replication parameter instructs the backend to go into logical replication walsender mode, connecting to the database specified in the dbname parameter. In logical replication walsender mode, the replication commands shown below as well as normal SQL commands can be issued.

In either physical replication or logical replication walsender mode, only the simple query protocol can be used.

For the purpose of testing replication commands, you can make a replication connection via psql or any other libpq-using tool with a connection string including the replication option, e.g.:

psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"

However, it is often more useful to use pg_receivewal (for physical replication) or pg_recvlogical (for logical replication).

Replication commands are logged in the server log when log_replication_commands is enabled.

The commands accepted in replication mode are:

IDENTIFY_SYSTEM

Requests the server to identify itself. Server replies with a result set of a single row, containing four fields:

systemid (text)

The unique system identifier identifying the cluster. This can be used to check that the base backup used to initialize the standby came from the same cluster.

timeline (int4)

Current timeline ID. Also useful to check that the standby is consistent with the master.

xlogpos (text)

Current WAL flush location. Useful to get a known location in the write-ahead log where streaming can start.

dbname (text)

Database connected to or null.

SHOW name

Requests the server to send the current setting of a run-time parameter. This is similar to the SQL command SHOW.

name

The name of a run-time parameter. Available parameters are documented in Chapter 19.

TIMELINE_HISTORY tli

Requests the server to send over the timeline history file for timeline tli. Server replies with a result set of a single row, containing two fields:

filename (text)

File name of the timeline history file, e.g., 00000002.history.

content (bytea)

Contents of the timeline history file.

CREATE_REPLICATION_SLOT slot_name [ TEMPORARY ] { PHYSICAL [ RESERVE_WAL ] | LOGICAL output_plugin [ EXPORT_SNAPSHOT | NOEXPORT_SNAPSHOT | USE_SNAPSHOT ] }

Create a physical or logical replication slot. See Section 26.2.6 for more about replication slots.

slot_name

The name of the slot to create. Must be a valid replication slot name (see Section 26.2.6.1).

output_plugin

The name of the output plugin used for logical decoding (see Section 48.6).

TEMPORARY

Specify that this replication slot is a temporary one. Temporary slots are not saved to disk and are automatically dropped on error or when the session has finished.

RESERVE_WAL

Specify that this physical replication slot reserves WAL immediately. Otherwise, WAL is only reserved upon connection from a streaming replication client.

EXPORT_SNAPSHOT NOEXPORT_SNAPSHOT USE_SNAPSHOT

Decides what to do with the snapshot created during logical slot initialization. EXPORT_SNAPSHOT, which is the default, will export the snapshot for use in other sessions. This option can't be used inside a transaction. USE_SNAPSHOT will use the snapshot for the current transaction executing the command. This option must be used in a transaction, and CREATE_REPLICATION_SLOT must be the first command run in that transaction. Finally, NOEXPORT_SNAPSHOT will just use the snapshot for logical decoding as normal but won't do anything else with it.

In response to this command, the server will send a one-row result set containing the following fields:

slot_name (text)

The name of the newly-created replication slot.

consistent_point (text)

The WAL location at which the slot became consistent. This is the earliest location from which streaming can start on this replication slot.

snapshot_name (text)

The identifier of the snapshot exported by the command. The snapshot is valid until a new command is executed on this connection or the replication connection is closed. Null if the created slot is physical.

output_plugin (text)

The name of the output plugin used by the newly-created replication slot. Null if the created slot is physical.

START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]

Instructs server to start streaming WAL, starting at WAL location XXX/XXX. If TIMELINE option is specified, streaming starts on timeline tli; otherwise, the server's current timeline is selected. The server can reply with an error, for example if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.

If a slot's name is provided via slot_name, it will be updated as replication progresses so that the server knows which WAL segments, and if hot_standby_feedback is on which transactions, are still needed by the standby.

If the client requests a timeline that's not the latest but is part of the history of the server, the server will stream all the WAL on that timeline starting from the requested start point up to the point where the server switched to another timeline. If the client requests streaming at exactly the end of an old timeline, the server responds immediately with CommandComplete without entering COPY mode.

After streaming all the WAL on a timeline that is not the latest one, the server will end streaming by exiting the COPY mode. When the client acknowledges this by also exiting COPY mode, the server sends a result set with one row and two columns, indicating the next timeline in this server's history. The first column is the next timeline's ID (type int8), and the second column is the WAL location where the switch happened (type text). Usually, the switch position is the end of the WAL that was streamed, but there are corner cases where the server can send some WAL from the old timeline that it has not itself replayed before promoting. Finally, the server sends two CommandComplete messages (one that ends the CopyData and the other ends the START_REPLICATION itself), and is ready to accept a new command.

WAL data is sent as a series of CopyData messages. (This allows other information to be intermixed; in particular the server can send an ErrorResponse message if it encounters a failure after beginning to stream.) The payload of each CopyData message from server to the client contains a message of one of the following formats:

XLogData (B)

Byte1('w')

Identifies the message as WAL data.

Int64

The starting point of the WAL data in this message.

Int64

The current end of WAL on the server.

Int64

The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.

Byte_n_

A section of the WAL data stream.

A single WAL record is never split across two XLogData messages. When a WAL record crosses a WAL page boundary, and is therefore already split using continuation records, it can be split at the page boundary. In other words, the first main WAL record and its continuation records can be sent in different XLogData messages.

Primary keepalive message (B)

Byte1('k')

Identifies the message as a sender keepalive.

Int64

The current end of WAL on the server.

Int64

The server's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.

Byte1

1 means that the client should reply to this message as soon as possible, to avoid a timeout disconnect. 0 otherwise.

The receiving process can send replies back to the sender at any time, using one of the following message formats (also in the payload of a CopyData message):

Standby status update (F)

Byte1('r')

Identifies the message as a receiver status update.

Int64

The location of the last WAL byte + 1 received and written to disk in the standby.

Int64

The location of the last WAL byte + 1 flushed to disk in the standby.Int64

The location of the last WAL byte + 1 applied in the standby.Int64

The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.Byte1

If 1, the client requests the server to reply to this message immediately. This can be used to ping the server, to test if the connection is still healthy.Hot Standby feedback message (F)Byte1('h')

Identifies the message as a Hot Standby feedback message.Int64

The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.Int32

The standby's current global xmin, excluding the catalog_xmin from any replication slots. If both this value and the following catalog_xmin are 0 this is treated as a notification that Hot Standby feedback will no longer be sent on this connection. Later non-zero messages may reinitiate the feedback mechanism.Int32

The epoch of the global xmin xid on the standby.Int32

The lowest catalog_xmin of any replication slots on the standby. Set to 0 if no catalog_xmin exists on the standby or if hot standby feedback is being disabled.Int32

The epoch of the catalog_xmin xid on the standby.START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ ( option_name [ option_value ] [, ...] ) ]

Instructs server to start streaming WAL for logical replication, starting at WAL location XXX/XXX. The server can reply with an error, for example if the requested section of WAL has already been recycled. On success, server responds with a CopyBothResponse message, and then starts to stream WAL to the frontend.

The messages inside the CopyBothResponse messages are of the same format documented for START_REPLICATION ... PHYSICAL, including two CommandComplete messages.

The output plugin associated with the selected slot is used to process the output for streaming.SLOT slot_name

The name of the slot to stream changes from. This parameter is required, and must correspond to an existing logical replication slot created with CREATE_REPLICATION_SLOT in LOGICAL mode.XXX/XXX

The WAL location to begin streaming at.option_name

The name of an option passed to the slot's logical decoding plugin.option_value

Optional value, in the form of a string constant, associated with the specified option.DROP_REPLICATION_SLOT slot_name [ WAIT ]

Drops a replication slot, freeing any reserved server-side resources. If the slot is a logical slot that was created in a database other than the database the walsender is connected to, this command fails.slot_name

The name of the slot to drop.WAIT

This option causes the command to wait if the slot is active until it becomes inactive, instead of the default behavior of raising an error.BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ] [ MANIFEST manifest_option ] [ MANIFEST_CHECKSUMS checksum_algorithm ]

Instructs the server to start streaming a base backup. The system will automatically be put in backup mode before the backup is started, and taken out of it when the backup is complete. The following options are accepted:LABEL 'label'

Sets the label of the backup. If none is specified, a backup label of base backup will be used. The quoting rules for the label are the same as a standard SQL string with standard_conforming_strings turned on.PROGRESS

Request information required to generate a progress report. This will send back an approximate size in the header of each tablespace, which can be used to calculate how far along the stream is done. This is calculated by enumerating all the file sizes once before the transfer is even started, and might as such have a negative impact on the performance. In particular, it might take longer before the first data is streamed. Since the database files can change during the backup, the size is only approximate and might both grow and shrink between the time of approximation and the sending of the actual files.FAST

Request a fast checkpoint.WAL

Include the necessary WAL segments in the backup. This will include all the files between start and stop backup in the pg_wal directory of the base directory tar file.NOWAIT

By default, the backup will wait until the last required WAL segment has been archived, or emit a warning if log archiving is not enabled. Specifying NOWAIT disables both the waiting and the warning, leaving the client responsible for ensuring the required log is available.MAX_RATE rate

Limit (throttle) the maximum amount of data transferred from server to client per unit of time. The expected unit is kilobytes per second. If this option is specified, the value must either be equal to zero or it must fall within the range from 32 kB through 1 GB (inclusive). If zero is passed or the option is not specified, no restriction is imposed on the transfer.TABLESPACE_MAP

Include information about symbolic links present in the directory pg_tblspc in a file named tablespace_map. The tablespace map file includes each symbolic link name as it exists in the directory pg_tblspc/ and the full path of that symbolic link.NOVERIFY_CHECKSUMS

By default, checksums are verified during a base backup if they are enabled. Specifying NOVERIFY_CHECKSUMS disables this verification.MANIFEST manifest_option

When this option is specified with a value of yes or force-encode, a backup manifest is created and sent along with the backup. The manifest is a list of every file present in the backup with the exception of any WAL files that may be included. It also stores the size, last modification time, and optionally a checksum for each file. A value of force-encode forces all filenames to be hex-encoded; otherwise, this type of encoding is performed only for files whose names are non-UTF8 octet sequences. force-encode is intended primarily for testing purposes, to be sure that clients which read the backup manifest can handle this case. For compatibility with previous releases, the default is MANIFEST 'no'.MANIFEST_CHECKSUMS checksum_algorithm

Specifies the checksum algorithm that should be applied to each file included in the backup manifest. Currently, the available algorithms are NONE, CRC32C, SHA224, SHA256, SHA384, and SHA512. The default is CRC32C.

When the backup is started, the server will first send two ordinary result sets, followed by one or more CopyResponse results.

The first ordinary result set contains the starting position of the backup, in a single row with two columns. The first column contains the start position given in XLogRecPtr format, and the second column contains the corresponding timeline ID.

The second ordinary result set has one row for each tablespace. The fields in this row are:spcoid (oid)

The OID of the tablespace, or null if it's the base directory.spclocation (text)

The full path of the tablespace directory, or null if it's the base directory.size (int8)

The approximate size of the tablespace, in kilobytes (1024 bytes), if progress report has been requested; otherwise it's null.

After the second regular result set, one or more CopyResponse results will be sent, one for the main data directory and one for each additional tablespace other than pg_default and pg_global. The data in the CopyResponse results will be a tar format (following the “ustar interchange format” specified in the POSIX 1003.1-2008 standard) dump of the tablespace contents, except that the two trailing blocks of zeroes specified in the standard are omitted. After the tar data is complete, and if a backup manifest was requested, another CopyResponse result is sent, containing the manifest data for the current base backup. In any case, a final ordinary result set will be sent, containing the WAL end position of the backup, in the same format as the start position.

The tar archive for the data directory and each tablespace will contain all files in the directories, regardless of whether they are PostgreSQL files or other files added to the same directory. The only excluded files are:

postmaster.pid
postmaster.opts
pg_internal.init (found in multiple directories)
Various temporary files and directories created during the operation of the PostgreSQL server, such as any file or directory beginning with pgsql_tmp and temporary relations.
Unlogged relations, except for the init fork which is required to recreate the (empty) unlogged relation on recovery.
pg_wal, including subdirectories. If the backup is run with WAL files included, a synthesized version of pg_wal will be included, but it will only contain the files necessary for the backup to work, not the rest of the contents.
pg_dynshmem, pg_notify, pg_replslot, pg_serial, pg_snapshots, pg_stat_tmp, and pg_subtrans are copied as empty directories (even if they are symbolic links).
Files other than regular files and directories, such as symbolic links (other than for the directories listed above) and special device files, are skipped. (Symbolic links in pg_tblspc are maintained.)

Owner, group, and file mode are set if the underlying file system on the server supports it.

53.2. Reporting Errors Within the Server

Error, warning, and log messages generated within the server code should be created using ereport, or its older cousin elog. The use of this function is complex enough to require some explanation.

There are two required elements for every message: a severity level (ranging from DEBUG to PANIC) and a primary message text. In addition there are optional elements, the most common of which is an error identifier code that follows the SQL spec's SQLSTATE conventions. ereport itself is just a shell function, that exists mainly for the syntactic convenience of making message generation look like a function call in the C source code. The only parameter accepted directly by ereport is the severity level. The primary message text and any optional message elements are generated by calling auxiliary functions, such as errmsg, within the ereport call.

A typical call to ereport might look like this:

ereport(ERROR,
        (errcode(ERRCODE_DIVISION_BY_ZERO),
         errmsg("division by zero")));

This specifies error severity level ERROR (a run-of-the-mill error). The errcode call specifies the SQLSTATE error code using a macro defined in src/include/utils/errcodes.h. The errmsg call provides the primary message text. Notice the extra set of parentheses surrounding the auxiliary function calls — these are annoying but syntactically necessary.

Here is a more complex example:

ereport(ERROR,
        (errcode(ERRCODE_AMBIGUOUS_FUNCTION),
         errmsg("function %s is not unique",
                func_signature_string(funcname, nargs,
                                      NIL, actual_arg_types)),
         errhint("Unable to choose a best candidate function. "
                 "You might need to add explicit typecasts.")));

This illustrates the use of format codes to embed run-time values into a message text. Also, an optional “hint” message is provided.

If the severity level is ERROR or higher, ereport aborts the execution of the user-defined function and does not return to the caller. If the severity level is lower than ERROR, ereport returns normally.

The available auxiliary routines for ereport are:

errcode(sqlerrcode) specifies the SQLSTATE error identifier code for the condition. If this routine is not called, the error identifier defaults to ERRCODE_INTERNAL_ERROR when the error severity level is ERROR or higher, ERRCODE_WARNING when the error level is WARNING, otherwise (for NOTICE and below) ERRCODE_SUCCESSFUL_COMPLETION. While these defaults are often convenient, always think whether they are appropriate before omitting the errcode() call.
errmsg(const char *msg, ...) specifies the primary error message text, and possibly run-time values to insert into it. Insertions are specified by sprintf-style format codes. In addition to the standard format codes accepted by sprintf, the format code %m can be used to insert the error message returned by strerror for the current value of errno. [13] %m does not require any corresponding entry in the parameter list for errmsg. Note that the message string will be run through gettext for possible localization before format codes are processed.
errmsg_internal(const char *msg, ...) is the same as errmsg, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for “cannot happen” cases that are probably not worth expending translation effort on.
errmsg_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errmsg, but with support for various plural forms of the message. fmt_singular is the English singular format, fmt_plural is the English plural format, n is the integer value that determines which plural form is needed, and the remaining arguments are formatted according to the selected format string. For more information see Section 55.2.2.
errdetail(const char *msg, ...) supplies an optional “detail” message; this is to be used when there is additional information that seems inappropriate to put in the primary message. The message string is processed in just the same way as for errmsg.
errdetail_internal(const char *msg, ...) is the same as errdetail, except that the message string will not be translated nor included in the internationalization message dictionary. This should be used for detail messages that are not worth expending translation effort on, for instance because they are too technical to be useful to most users.
errdetail_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errdetail, but with support for various plural forms of the message. For more information see Section 55.2.2.
errdetail_log(const char *msg, ...) is the same as errdetail except that this string goes only to the server log, never to the client. If both errdetail (or one of its equivalents above) and errdetail_log are used then one string goes to the client and the other to the log. This is useful for error details that are too security-sensitive or too bulky to include in the report sent to the client.
errdetail_log_plural(const char *fmt_singular, const char *fmt_plural, unsigned long n, ...) is like errdetail_log, but with support for various plural forms of the message. For more information see Section 55.2.2.
errhint(const char *msg, ...) supplies an optional “hint” message; this is to be used when offering suggestions about how to fix the problem, as opposed to factual details about what went wrong. The message string is processed in just the same way as for errmsg.
errcontext(const char *msg, ...) is not normally called directly from an ereport message site; rather it is used in error_context_stack callback functions to provide information about the context in which an error occurred, such as the current location in a PL function. The message string is processed in just the same way as for errmsg. Unlike the other auxiliary functions, this can be called more than once per ereport call; the successive strings thus supplied are concatenated with separating newlines.
errposition(int cursorpos) specifies the textual location of an error within a query string. Currently it is only useful for errors detected in the lexical and syntactic analysis phases of query processing.
errtable(Relation rel) specifies a relation whose name and schema name should be included as auxiliary fields in the error report.
errtablecol(Relation rel, int attnum) specifies a column whose name, table name, and schema name should be included as auxiliary fields in the error report.
errtableconstraint(Relation rel, const char *conname) specifies a table constraint whose name, table name, and schema name should be included as auxiliary fields in the error report. Indexes should be considered to be constraints for this purpose, whether or not they have an associated pg_constraint entry. Be careful to pass the underlying heap relation, not the index itself, as rel.
errdatatype(Oid datatypeOid) specifies a data type whose name and schema name should be included as auxiliary fields in the error report.
errdomainconstraint(Oid datatypeOid, const char *conname) specifies a domain constraint whose name, domain name, and schema name should be included as auxiliary fields in the error report.
errcode_for_file_access() is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a file-access-related system call. It uses the saved errno to determine which error code to generate. Usually this should be used in combination with %m in the primary error message text.
errcode_for_socket_access() is a convenience function that selects an appropriate SQLSTATE error identifier for a failure in a socket-related system call.
errhidestmt(bool hide_stmt) can be called to specify suppression of the STATEMENT: portion of a message in the postmaster log. Generally this is appropriate if the message text includes the current statement already.
errhidecontext(bool hide_ctx) can be called to specify suppression of the CONTEXT: portion of a message in the postmaster log. This should only be used for verbose debugging messages where the repeated inclusion of context would bloat the log volume too much.

Note

At most one of the functions errtable, errtablecol, errtableconstraint, errdatatype, or errdomainconstraint should be used in an ereport call. These functions exist to allow applications to extract the name of a database object associated with the error condition without having to examine the potentially-localized error message text. These functions should be used in error reports for which it's likely that applications would wish to have automatic error handling. As of PostgreSQL 9.3, complete coverage exists only for errors in SQLSTATE class 23 (integrity constraint violation), but this is likely to be expanded in future.

There is an older function elog that is still heavily used. An elog call:

elog(level, "format string", ...);

is exactly equivalent to:

ereport(level, (errmsg_internal("format string", ...)));

Notice that the SQLSTATE error code is always defaulted, and the message string is not subject to translation. Therefore, elog should be used only for internal errors and low-level debug logging. Any message that is likely to be of interest to ordinary users should go through ereport. Nonetheless, there are enough internal “cannot happen” error checks in the system that elog is still widely used; it is preferred for those messages for its notational simplicity.

Advice about writing good error messages can be found in Section 54.3.\

[13] That is, the value that was current when the ereport call was reached; changes of errno within the auxiliary reporting routines will not affect it. That would not be true if you were to write strerror(errno) explicitly in errmsg's parameter list; accordingly, do not do so.

56.4. Foreign Data Wrapper Query Planning

The FDW callback functions GetForeignRelSize, GetForeignPaths, GetForeignPlan, PlanForeignModify, GetForeignJoinPaths, GetForeignUpperPaths, and PlanDirectModify must fit into the workings of the PostgreSQL planner. Here are some notes about what they must do.

The information in root and baserel can be used to reduce the amount of information that has to be fetched from the foreign table (and therefore reduce the cost). baserel->baserestrictinfo is particularly interesting, as it contains restriction quals (WHERE clauses) that should be used to filter the rows to be fetched. (The FDW itself is not required to enforce these quals, as the core executor can check them instead.) baserel->reltarget->exprs can be used to determine which columns need to be fetched; but note that it only lists columns that have to be emitted by the ForeignScan plan node, not columns that are used in qual evaluation but not output by the query.

Various private fields are available for the FDW planning functions to keep information in. Generally, whatever you store in FDW private fields should be palloc'd, so that it will be reclaimed at the end of planning.

baserel->fdw_private is a void pointer that is available for FDW planning functions to store information relevant to the particular foreign table. The core planner does not touch it except to initialize it to NULL when the RelOptInfo node is created. It is useful for passing information forward from GetForeignRelSize to GetForeignPaths and/or GetForeignPaths to GetForeignPlan, thereby avoiding recalculation.

GetForeignPaths can identify the meaning of different access paths by storing private information in the fdw_private field of ForeignPath nodes. fdw_private is declared as a List pointer, but could actually contain anything since the core planner does not touch it. However, best practice is to use a representation that's dumpable by nodeToString, for use with debugging support available in the backend.

GetForeignPlan can examine the fdw_private field of the selected ForeignPath node, and can generate fdw_exprs and fdw_private lists to be placed in the ForeignScan plan node, where they will be available at execution time. Both of these lists must be represented in a form that copyObject knows how to copy. The fdw_private list has no other restrictions and is not interpreted by the core backend in any way. The fdw_exprs list, if not NIL, is expected to contain expression trees that are intended to be executed at run time. These trees will undergo post-processing by the planner to make them fully executable.

In GetForeignPlan, generally the passed-in target list can be copied into the plan node as-is. The passed scan_clauses list contains the same clauses as baserel->baserestrictinfo, but may be re-ordered for better execution efficiency. In simple cases the FDW can just strip RestrictInfo nodes from the scan_clauses list (using extract_actual_clauses) and put all the clauses into the plan node's qual list, which means that all the clauses will be checked by the executor at run time. More complex FDWs may be able to check some of the clauses internally, in which case those clauses can be removed from the plan node's qual list so that the executor doesn't waste time rechecking them.

As an example, the FDW might identify some restriction clauses of the form foreign_variable = sub_expression, which it determines can be executed on the remote server given the locally-evaluated value of the sub_expression. The actual identification of such a clause should happen during GetForeignPaths, since it would affect the cost estimate for the path. The path's fdw_private field would probably include a pointer to the identified clause's RestrictInfo node. Then GetForeignPlan would remove that clause from scan_clauses, but add the sub_expression to fdw_exprs to ensure that it gets massaged into executable form. It would probably also put control information into the plan node's fdw_private field to tell the execution functions what to do at run time. The query transmitted to the remote server would involve something like WHERE foreign_variable = $1, with the parameter value obtained at run time from evaluation of the fdw_exprs expression tree.

Any clauses removed from the plan node's qual list must instead be added to fdw_recheck_quals or rechecked by RecheckForeignScan in order to ensure correct behavior at the READ COMMITTED isolation level. When a concurrent update occurs for some other table involved in the query, the executor may need to verify that all of the original quals are still satisfied for the tuple, possibly against a different set of parameter values. Using fdw_recheck_quals is typically easier than implementing checks inside RecheckForeignScan, but this method will be insufficient when outer joins have been pushed down, since the join tuples in that case might have some fields go to NULL without rejecting the tuple entirely.

Another ForeignScan field that can be filled by FDWs is fdw_scan_tlist, which describes the tuples returned by the FDW for this plan node. For simple foreign table scans this can be set to NIL, implying that the returned tuples have the row type declared for the foreign table. A non-NIL value must be a target list (list of TargetEntrys) containing Vars and/or expressions representing the returned columns. This might be used, for example, to show that the FDW has omitted some columns that it noticed won't be needed for the query. Also, if the FDW can compute expressions used by the query more cheaply than can be done locally, it could add those expressions to fdw_scan_tlist. Note that join plans (created from paths made by GetForeignJoinPaths) must always supply fdw_scan_tlist to describe the set of columns they will return.

The FDW should always construct at least one path that depends only on the table's restriction clauses. In join queries, it might also choose to construct path(s) that depend on join clauses, for example foreign_variable = local_variable. Such clauses will not be found in baserel->baserestrictinfo but must be sought in the relation's join lists. A path using such a clause is called a “parameterized path”. It must identify the other relations used in the selected join clause(s) with a suitable value of param_info; use get_baserel_parampathinfo to compute that value. In GetForeignPlan, the local_variable portion of the join clause would be added to fdw_exprs, and then at run time the case works the same as for an ordinary restriction clause.

If an FDW supports remote joins, GetForeignJoinPaths should produce ForeignPaths for potential remote joins in much the same way as GetForeignPaths works for base tables. Information about the intended join can be passed forward to GetForeignPlan in the same ways described above. However, baserestrictinfo is not relevant for join relations; instead, the relevant join clauses for a particular join are passed to GetForeignJoinPaths as a separate parameter (extra->restrictlist).

An FDW might additionally support direct execution of some plan actions that are above the level of scans and joins, such as grouping or aggregation. To offer such options, the FDW should generate paths and insert them into the appropriate upper relation. For example, a path representing remote aggregation should be inserted into the UPPERREL_GROUP_AGG relation, using add_path. This path will be compared on a cost basis with local aggregation performed by reading a simple scan path for the foreign relation (note that such a path must also be supplied, else there will be an error at plan time). If the remote-aggregation path wins, which it usually would, it will be converted into a plan in the usual way, by calling GetForeignPlan. The recommended place to generate such paths is in the GetForeignUpperPaths callback function, which is called for each upper relation (i.e., each post-scan/join processing step), if all the base relations of the query come from the same FDW.

PlanForeignModify and the other callbacks described in Section 56.2.4 are designed around the assumption that the foreign relation will be scanned in the usual way and then individual row updates will be driven by a local ModifyTable plan node. This approach is necessary for the general case where an update requires reading local tables as well as foreign tables. However, if the operation could be executed entirely by the foreign server, the FDW could generate a path representing that and insert it into the UPPERREL_FINAL upper relation, where it would compete against the ModifyTable approach. This approach could also be used to implement remote SELECT FOR UPDATE, rather than using the row locking callbacks described in Section 56.2.5. Keep in mind that a path inserted into UPPERREL_FINAL is responsible for implementing all behavior of the query.

When planning an UPDATE or DELETE, PlanForeignModify and PlanDirectModify can look up the RelOptInfo struct for the foreign table and make use of the baserel->fdw_private data previously created by the scan-planning functions. However, in INSERT the target table is not scanned so there is no RelOptInfo for it. The List returned by PlanForeignModify has the same restrictions as the fdw_private list of a ForeignScan plan node, that is it must contain only structures that copyObject knows how to copy.

INSERT with an ON CONFLICT clause does not support specifying the conflict target, as unique constraints or exclusion constraints on remote tables are not locally known. This in turn implies that ON CONFLICT DO UPDATE is not supported, since the specification is mandatory there.

56.2. Foreign Data Wrapper Callback Routines

The FDW handler function returns a palloc'd FdwRoutine struct containing pointers to the callback functions described below. The scan-related functions are required, the rest are optional.

The FdwRoutine struct type is declared in src/include/foreign/fdwapi.h, which see for additional details.

56.2.1. FDW Routines for Scanning Foreign Tables

void
GetForeignRelSize(PlannerInfo *root,
                  RelOptInfo *baserel,
                  Oid foreigntableid);

Obtain relation size estimates for a foreign table. This is called at the beginning of planning for a query that scans a foreign table. root is the planner's global information about the query; baserel is the planner's information about this table; and foreigntableid is the pg_class OID of the foreign table. (foreigntableid could be obtained from the planner data structures, but it's passed explicitly to save effort.)

This function should update baserel->rows to be the expected number of rows returned by the table scan, after accounting for the filtering done by the restriction quals. The initial value of baserel->rows is just a constant default estimate, which should be replaced if at all possible. The function may also choose to update baserel->width if it can compute a better estimate of the average result row width.

See Section 56.4 for additional information.

void
GetForeignPaths(PlannerInfo *root,
                RelOptInfo *baserel,
                Oid foreigntableid);

Create possible access paths for a scan on a foreign table. This is called during query planning. The parameters are the same as for GetForeignRelSize, which has already been called.

This function must generate at least one access path (ForeignPath node) for a scan on the foreign table and must call add_path to add each such path to baserel->pathlist. It's recommended to use create_foreignscan_path to build the ForeignPath nodes. The function can generate multiple access paths, e.g., a path which has valid pathkeys to represent a pre-sorted result. Each access path must contain cost estimates, and can contain any FDW-private information that is needed to identify the specific scan method intended.

See Section 56.4 for additional information.

ForeignScan *
GetForeignPlan(PlannerInfo *root,
               RelOptInfo *baserel,
               Oid foreigntableid,
               ForeignPath *best_path,
               List *tlist,
               List *scan_clauses,
               Plan *outer_plan);

Create a ForeignScan plan node from the selected foreign access path. This is called at the end of query planning. The parameters are as for GetForeignRelSize, plus the selected ForeignPath (previously produced by GetForeignPaths, GetForeignJoinPaths, or GetForeignUpperPaths), the target list to be emitted by the plan node, the restriction clauses to be enforced by the plan node, and the outer subplan of the ForeignScan, which is used for rechecks performed by RecheckForeignScan. (If the path is for a join rather than a base relation, foreigntableid is InvalidOid.)

This function must create and return a ForeignScan plan node; it's recommended to use make_foreignscan to build the ForeignScan node.

See Section 56.4 for additional information.

void
BeginForeignScan(ForeignScanState *node,
                 int eflags);

Begin executing a foreign scan. This is called during executor startup. It should perform any initialization needed before the scan can start, but not start executing the actual scan (that should be done upon the first call to IterateForeignScan). The ForeignScanState node has already been created, but its fdw_state field is still NULL. Information about the table to scan is accessible through the ForeignScanState node (in particular, from the underlying ForeignScan plan node, which contains any FDW-private information provided by GetForeignPlan). eflags contains flag bits describing the executor's operating mode for this plan node.

Note that when (eflags & EXEC_FLAG_EXPLAIN_ONLY) is true, this function should not perform any externally-visible actions; it should only do the minimum required to make the node state valid for ExplainForeignScan and EndForeignScan.

TupleTableSlot *
IterateForeignScan(ForeignScanState *node);

Fetch one row from the foreign source, returning it in a tuple table slot (the node's ScanTupleSlot should be used for this purpose). Return NULL if no more rows are available. The tuple table slot infrastructure allows either a physical or virtual tuple to be returned; in most cases the latter choice is preferable from a performance standpoint. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginForeignScan if you need longer-lived storage, or use the es_query_cxt of the node's EState.

The rows returned must match the fdw_scan_tlist target list if one was supplied, otherwise they must match the row type of the foreign table being scanned. If you choose to optimize away fetching columns that are not needed, you should insert nulls in those column positions, or else generate a fdw_scan_tlist list with those columns omitted.

Note that PostgreSQL's executor doesn't care whether the rows returned violate any constraints that were defined on the foreign table — but the planner does care, and may optimize queries incorrectly if there are rows visible in the foreign table that do not satisfy a declared constraint. If a constraint is violated when the user has declared that the constraint should hold true, it may be appropriate to raise an error (just as you would need to do in the case of a data type mismatch).

void
ReScanForeignScan(ForeignScanState *node);

Restart the scan from the beginning. Note that any parameters the scan depends on may have changed value, so the new scan does not necessarily return exactly the same rows.

void
EndForeignScan(ForeignScanState *node);

End the scan and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

56.2.2. FDW Routines for Scanning Foreign Joins

If an FDW supports performing foreign joins remotely (rather than by fetching both tables' data and doing the join locally), it should provide this callback function:

void
GetForeignJoinPaths(PlannerInfo *root,
                    RelOptInfo *joinrel,
                    RelOptInfo *outerrel,
                    RelOptInfo *innerrel,
                    JoinType jointype,
                    JoinPathExtraData *extra);

Create possible access paths for a join of two (or more) foreign tables that all belong to the same foreign server. This optional function is called during query planning. As with GetForeignPaths, this function should generate ForeignPath path(s) for the supplied joinrel (use create_foreign_join_path to build them), and call add_path to add these paths to the set of paths considered for the join. But unlike GetForeignPaths, it is not necessary that this function succeed in creating at least one path, since paths involving local joining are always possible.

Note that this function will be invoked repeatedly for the same join relation, with different combinations of inner and outer relations; it is the responsibility of the FDW to minimize duplicated work.

If a ForeignPath path is chosen for the join, it will represent the entire join process; paths generated for the component tables and subsidiary joins will not be used. Subsequent processing of the join path proceeds much as it does for a path scanning a single foreign table. One difference is that the scanrelid of the resulting ForeignScan plan node should be set to zero, since there is no single relation that it represents; instead, the fs_relids field of the ForeignScan node represents the set of relations that were joined. (The latter field is set up automatically by the core planner code, and need not be filled by the FDW.) Another difference is that, because the column list for a remote join cannot be found from the system catalogs, the FDW must fill fdw_scan_tlist with an appropriate list of TargetEntry nodes, representing the set of columns it will supply at run time in the tuples it returns.

See Section 56.4 for additional information.

56.2.3. FDW Routines for Planning Post-Scan/Join Processing

If an FDW supports performing remote post-scan/join processing, such as remote aggregation, it should provide this callback function:

void
GetForeignUpperPaths(PlannerInfo *root,
                     UpperRelationKind stage,
                     RelOptInfo *input_rel,
                     RelOptInfo *output_rel,
                     void *extra);

Create possible access paths for upper relation processing, which is the planner's term for all post-scan/join query processing, such as aggregation, window functions, sorting, and table updates. This optional function is called during query planning. Currently, it is called only if all base relation(s) involved in the query belong to the same FDW. This function should generate ForeignPath path(s) for any post-scan/join processing that the FDW knows how to perform remotely (use create_foreign_upper_path to build them), and call add_path to add these paths to the indicated upper relation. As with GetForeignJoinPaths, it is not necessary that this function succeed in creating any paths, since paths involving local processing are always possible.

The stage parameter identifies which post-scan/join step is currently being considered. output_rel is the upper relation that should receive paths representing computation of this step, and input_rel is the relation representing the input to this step. The extra parameter provides additional details, currently, it is set only for UPPERREL_PARTIAL_GROUP_AGG or UPPERREL_GROUP_AGG, in which case it points to a GroupPathExtraData structure; or for UPPERREL_FINAL, in which case it points to a FinalPathExtraData structure. (Note that ForeignPath paths added to output_rel would typically not have any direct dependency on paths of the input_rel, since their processing is expected to be done externally. However, examining paths previously generated for the previous processing step can be useful to avoid redundant planning work.)

See Section 56.4 for additional information.

56.2.4. FDW Routines for Updating Foreign Tables

If an FDW supports writable foreign tables, it should provide some or all of the following callback functions depending on the needs and capabilities of the FDW:

void
AddForeignUpdateTargets(Query *parsetree,
                        RangeTblEntry *target_rte,
                        Relation target_relation);

UPDATE and DELETE operations are performed against rows previously fetched by the table-scanning functions. The FDW may need extra information, such as a row ID or the values of primary-key columns, to ensure that it can identify the exact row to update or delete. To support that, this function can add extra hidden, or “junk”, target columns to the list of columns that are to be retrieved from the foreign table during an UPDATE or DELETE.

To do that, add TargetEntry items to parsetree->targetList, containing expressions for the extra values to be fetched. Each such entry must be marked resjunk = true, and must have a distinct resname that will identify it at execution time. Avoid using names matching ctidN, wholerow, or wholerowN, as the core system can generate junk columns of these names. If the extra expressions are more complex than simple Vars, they must be run through eval_const_expressions before adding them to the targetlist.

Although this function is called during planning, the information provided is a bit different from that available to other planning routines. parsetree is the parse tree for the UPDATE or DELETE command, while target_rte and target_relation describe the target foreign table.

If the AddForeignUpdateTargets pointer is set to NULL, no extra target expressions are added. (This will make it impossible to implement DELETE operations, though UPDATE may still be feasible if the FDW relies on an unchanging primary key to identify rows.)

List *
PlanForeignModify(PlannerInfo *root,
                  ModifyTable *plan,
                  Index resultRelation,
                  int subplan_index);

Perform any additional planning actions needed for an insert, update, or delete on a foreign table. This function generates the FDW-private information that will be attached to the ModifyTable plan node that performs the update action. This private information must have the form of a List, and will be delivered to BeginForeignModify during the execution stage.

root is the planner's global information about the query. plan is the ModifyTable plan node, which is complete except for the fdwPrivLists field. resultRelation identifies the target foreign table by its range table index. subplan_index identifies which target of the ModifyTable plan node this is, counting from zero; use this if you want to index into plan->plans or other substructure of the plan node.

See Section 56.4 for additional information.

If the PlanForeignModify pointer is set to NULL, no additional plan-time actions are taken, and the fdw_private list delivered to BeginForeignModify will be NIL.

void
BeginForeignModify(ModifyTableState *mtstate,
                   ResultRelInfo *rinfo,
                   List *fdw_private,
                   int subplan_index,
                   int eflags);

Begin executing a foreign table modification operation. This routine is called during executor startup. It should perform any initialization needed prior to the actual table modifications. Subsequently, ExecForeignInsert, ExecForeignUpdate or ExecForeignDelete will be called for each tuple to be inserted, updated, or deleted.

mtstate is the overall state of the ModifyTable plan node being executed; global data about the plan and execution state is available via this structure. rinfo is the ResultRelInfo struct describing the target foreign table. (The ri_FdwState field of ResultRelInfo is available for the FDW to store any private state it needs for this operation.) fdw_private contains the private data generated by PlanForeignModify, if any. subplan_index identifies which target of the ModifyTable plan node this is. eflags contains flag bits describing the executor's operating mode for this plan node.

If the BeginForeignModify pointer is set to NULL, no action is taken during executor startup.

TupleTableSlot *
ExecForeignInsert(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Insert one tuple into the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains the tuple to be inserted; it will match the row-type definition of the foreign table. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; it differs from slot in possibly containing additional “junk” columns. (The planSlot is typically of little interest for INSERT cases, but is provided for completeness.)

The return value is either a slot containing the data that was actually inserted (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually inserted (again, typically as a result of triggers). The passed-in slot can be re-used for this purpose.

The data in the returned slot is used only if the INSERT statement has a RETURNING clause or involves a view WITH CHECK OPTION; or if the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause or WITH CHECK OPTION constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignInsert pointer is set to NULL, attempts to insert into the foreign table will fail with an error message.

Note that this function is also called when inserting routed tuples into a foreign-table partition or executing COPY FROM on a foreign table, in which case it is called in a different way than it is in the INSERT case. See the callback functions described below that allow the FDW to support that.

TupleTableSlot *
ExecForeignUpdate(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Update one tuple in the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains the new data for the tuple; it will match the row-type definition of the foreign table. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; it differs from slot in possibly containing additional “junk” columns. In particular, any junk columns that were requested by AddForeignUpdateTargets will be available from this slot.

The return value is either a slot containing the row as it was actually updated (this might differ from the data supplied, for example as a result of trigger actions), or NULL if no row was actually updated (again, typically as a result of triggers). The passed-in slot can be re-used for this purpose.

The data in the returned slot is used only if the UPDATE statement has a RETURNING clause or involves a view WITH CHECK OPTION; or if the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause or WITH CHECK OPTION constraints. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignUpdate pointer is set to NULL, attempts to update the foreign table will fail with an error message.

TupleTableSlot *
ExecForeignDelete(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot *slot,
                  TupleTableSlot *planSlot);

Delete one tuple from the foreign table. estate is global execution state for the query. rinfo is the ResultRelInfo struct describing the target foreign table. slot contains nothing useful upon call, but can be used to hold the returned tuple. planSlot contains the tuple that was generated by the ModifyTable plan node's subplan; in particular, it will carry any junk columns that were requested by AddForeignUpdateTargets. The junk column(s) must be used to identify the tuple to be deleted.

The return value is either a slot containing the row that was deleted, or NULL if no row was deleted (typically as a result of triggers). The passed-in slot can be used to hold the tuple to be returned.

The data in the returned slot is used only if the DELETE query has a RETURNING clause or the foreign table has an AFTER ROW trigger. Triggers require all columns, but the FDW could choose to optimize away returning some or all columns depending on the contents of the RETURNING clause. Regardless, some slot must be returned to indicate success, or the query's reported row count will be wrong.

If the ExecForeignDelete pointer is set to NULL, attempts to delete from the foreign table will fail with an error message.

void
EndForeignModify(EState *estate,
                 ResultRelInfo *rinfo);

End the table update and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

If the EndForeignModify pointer is set to NULL, no action is taken during executor shutdown.

Tuples inserted into a partitioned table by INSERT or COPY FROM are routed to partitions. If an FDW supports routable foreign-table partitions, it should also provide the following callback functions. These functions are also called when COPY FROM is executed on a foreign table.

void
BeginForeignInsert(ModifyTableState *mtstate,
                   ResultRelInfo *rinfo);

Begin executing an insert operation on a foreign table. This routine is called right before the first tuple is inserted into the foreign table in both cases when it is the partition chosen for tuple routing and the target specified in a COPY FROM command. It should perform any initialization needed prior to the actual insertion. Subsequently, ExecForeignInsert will be called for each tuple to be inserted into the foreign table.

When this is called by a COPY FROM command, the plan-related global data in mtstate is not provided and the planSlot parameter of ExecForeignInsert subsequently called for each inserted tuple is NULL, whether the foreign table is the partition chosen for tuple routing or the target specified in the command.

If the BeginForeignInsert pointer is set to NULL, no action is taken for the initialization.

Note that if the FDW does not support routable foreign-table partitions and/or executing COPY FROM on foreign tables, this function or ExecForeignInsert subsequently called must throw error as needed.

void
EndForeignInsert(EState *estate,
                 ResultRelInfo *rinfo);

End the insert operation and release resources. It is normally not important to release palloc'd memory, but for example open files and connections to remote servers should be cleaned up.

If the EndForeignInsert pointer is set to NULL, no action is taken for the termination.

int
IsForeignRelUpdatable(Relation rel);

Report which update operations the specified foreign table supports. The return value should be a bit mask of rule event numbers indicating which operations are supported by the foreign table, using the CmdType enumeration; that is, (1 << CMD_UPDATE) = 4 for UPDATE, (1 << CMD_INSERT) = 8 for INSERT, and (1 << CMD_DELETE) = 16 for DELETE.

If the IsForeignRelUpdatable pointer is set to NULL, foreign tables are assumed to be insertable, updatable, or deletable if the FDW provides ExecForeignInsert, ExecForeignUpdate, or ExecForeignDelete respectively. This function is only needed if the FDW supports some tables that are updatable and some that are not. (Even then, it's permissible to throw an error in the execution routine instead of checking in this function. However, this function is used to determine updatability for display in the information_schema views.)

Some inserts, updates, and deletes to foreign tables can be optimized by implementing an alternative set of interfaces. The ordinary interfaces for inserts, updates, and deletes fetch rows from the remote server and then modify those rows one at a time. In some cases, this row-by-row approach is necessary, but it can be inefficient. If it is possible for the foreign server to determine which rows should be modified without actually retrieving them, and if there are no local structures which would affect the operation (row-level local triggers, stored generated columns, or WITH CHECK OPTION constraints from parent views), then it is possible to arrange things so that the entire operation is performed on the remote server. The interfaces described below make this possible.

bool
PlanDirectModify(PlannerInfo *root,
                 ModifyTable *plan,
                 Index resultRelation,
                 int subplan_index);

Decide whether it is safe to execute a direct modification on the remote server. If so, return true after performing planning actions needed for that. Otherwise, return false. This optional function is called during query planning. If this function succeeds, BeginDirectModify, IterateDirectModify and EndDirectModify will be called at the execution stage, instead. Otherwise, the table modification will be executed using the table-updating functions described above. The parameters are the same as for PlanForeignModify.

To execute the direct modification on the remote server, this function must rewrite the target subplan with a ForeignScan plan node that executes the direct modification on the remote server. The operation field of the ForeignScan must be set to the CmdType enumeration appropriately; that is, CMD_UPDATE for UPDATE, CMD_INSERT for INSERT, and CMD_DELETE for DELETE.

See Section 56.4 for additional information.

If the PlanDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

void
BeginDirectModify(ForeignScanState *node,
                  int eflags);

Prepare to execute a direct modification on the remote server. This is called during executor startup. It should perform any initialization needed prior to the direct modification (that should be done upon the first call to IterateDirectModify). The ForeignScanState node has already been created, but its fdw_state field is still NULL. Information about the table to modify is accessible through the ForeignScanState node (in particular, from the underlying ForeignScan plan node, which contains any FDW-private information provided by PlanDirectModify). eflags contains flag bits describing the executor's operating mode for this plan node.

If the BeginDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

TupleTableSlot *
IterateDirectModify(ForeignScanState *node);

When the INSERT, UPDATE or DELETE query doesn't have a RETURNING clause, just return NULL after a direct modification on the remote server. When the query has the clause, fetch one result containing the data needed for the RETURNING calculation, returning it in a tuple table slot (the node's ScanTupleSlot should be used for this purpose). The data that was actually inserted, updated or deleted must be stored in the es_result_relation_info->ri_projectReturning->pi_exprContext->ecxt_scantuple of the node's EState. Return NULL if no more rows are available. Note that this is called in a short-lived memory context that will be reset between invocations. Create a memory context in BeginDirectModify if you need longer-lived storage, or use the es_query_cxt of the node's EState.

The rows returned must match the fdw_scan_tlist target list if one was supplied, otherwise they must match the row type of the foreign table being updated. If you choose to optimize away fetching columns that are not needed for the RETURNING calculation, you should insert nulls in those column positions, or else generate a fdw_scan_tlist list with those columns omitted.

Whether the query has the clause or not, the query's reported row count must be incremented by the FDW itself. When the query doesn't have the clause, the FDW must also increment the row count for the ForeignScanState node in the EXPLAIN ANALYZE case.

If the IterateDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

void
EndDirectModify(ForeignScanState *node);

Clean up following a direct modification on the remote server. It is normally not important to release palloc'd memory, but for example open files and connections to the remote server should be cleaned up.

If the EndDirectModify pointer is set to NULL, no attempts to execute a direct modification on the remote server are taken.

56.2.5. FDW Routines for Row Locking

If an FDW wishes to support late row locking (as described in Section 56.5), it must provide the following callback functions:

RowMarkType
GetForeignRowMarkType(RangeTblEntry *rte,
                      LockClauseStrength strength);

Report which row-marking option to use for a foreign table. rte is the RangeTblEntry node for the table and strength describes the lock strength requested by the relevant FOR UPDATE/SHARE clause, if any. The result must be a member of the RowMarkType enum type.

This function is called during query planning for each foreign table that appears in an UPDATE, DELETE, or SELECT FOR UPDATE/SHARE query and is not the target of UPDATE or DELETE.

If the GetForeignRowMarkType pointer is set to NULL, the ROW_MARK_COPY option is always used. (This implies that RefetchForeignRow will never be called, so it need not be provided either.)

See Section 56.5 for more information.

void
RefetchForeignRow(EState *estate,
                  ExecRowMark *erm,
                  Datum rowid,
                  TupleTableSlot *slot,
                  bool *updated);

Re-fetch one tuple slot from the foreign table, after locking it if required. estate is global execution state for the query. erm is the ExecRowMark struct describing the target foreign table and the row lock type (if any) to acquire. rowid identifies the tuple to be fetched. slot contains nothing useful upon call, but can be used to hold the returned tuple. updated is an output parameter.

This function should store the tuple into the provided slot, or clear it if the row lock couldn't be obtained. The row lock type to acquire is defined by erm->markType, which is the value previously returned by GetForeignRowMarkType. (ROW_MARK_REFERENCE means to just re-fetch the tuple without acquiring any lock, and ROW_MARK_COPY will never be seen by this routine.)

In addition, *updated should be set to true if what was fetched was an updated version of the tuple rather than the same version previously obtained. (If the FDW cannot be sure about this, always returning true is recommended.)

Note that by default, failure to acquire a row lock should result in raising an error; returning with an empty slot is only appropriate if the SKIP LOCKED option is specified by erm->waitPolicy.

The rowid is the ctid value previously read for the row to be re-fetched. Although the rowid value is passed as a Datum, it can currently only be a tid. The function API is chosen in hopes that it may be possible to allow other data types for row IDs in future.

If the RefetchForeignRow pointer is set to NULL, attempts to re-fetch rows will fail with an error message.

See Section 56.5 for more information.

bool
RecheckForeignScan(ForeignScanState *node,
                   TupleTableSlot *slot);

Recheck that a previously-returned tuple still matches the relevant scan and join qualifiers, and possibly provide a modified version of the tuple. For foreign data wrappers which do not perform join pushdown, it will typically be more convenient to set this to NULL and instead set fdw_recheck_quals appropriately. When outer joins are pushed down, however, it isn't sufficient to reapply the checks relevant to all the base tables to the result tuple, even if all needed attributes are present, because failure to match some qualifier might result in some attributes going to NULL, rather than in no tuple being returned. RecheckForeignScan can recheck qualifiers and return true if they are still satisfied and false otherwise, but it can also store a replacement tuple into the supplied slot.

To implement join pushdown, a foreign data wrapper will typically construct an alternative local join plan which is used only for rechecks; this will become the outer subplan of the ForeignScan. When a recheck is required, this subplan can be executed and the resulting tuple can be stored in the slot. This plan need not be efficient since no base table will return more than one row; for example, it may implement all joins as nested loops. The function GetExistingLocalJoinPath may be used to search existing paths for a suitable local join path, which can be used as the alternative local join plan. GetExistingLocalJoinPath searches for an unparameterized path in the path list of the specified join relation. (If it does not find such a path, it returns NULL, in which case a foreign data wrapper may build the local path by itself or may choose not to create access paths for that join.)

56.2.6. FDW Routines for `EXPLAIN`

void
ExplainForeignScan(ForeignScanState *node,
                   ExplainState *es);

Print additional EXPLAIN output for a foreign table scan. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ForeignScanState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case.

If the ExplainForeignScan pointer is set to NULL, no additional information is printed during EXPLAIN.

void
ExplainForeignModify(ModifyTableState *mtstate,
                     ResultRelInfo *rinfo,
                     List *fdw_private,
                     int subplan_index,
                     struct ExplainState *es);

Print additional EXPLAIN output for a foreign table update. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ModifyTableState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case. The first four arguments are the same as for BeginForeignModify.

If the ExplainForeignModify pointer is set to NULL, no additional information is printed during EXPLAIN.

void
ExplainDirectModify(ForeignScanState *node,
                    ExplainState *es);

Print additional EXPLAIN output for a direct modification on the remote server. This function can call ExplainPropertyText and related functions to add fields to the EXPLAIN output. The flag fields in es can be used to determine what to print, and the state of the ForeignScanState node can be inspected to provide run-time statistics in the EXPLAIN ANALYZE case.

If the ExplainDirectModify pointer is set to NULL, no additional information is printed during EXPLAIN.

56.2.7. FDW Routines for `ANALYZE`

bool
AnalyzeForeignTable(Relation relation,
                    AcquireSampleRowsFunc *func,
                    BlockNumber *totalpages);

This function is called when ANALYZE is executed on a foreign table. If the FDW can collect statistics for this foreign table, it should return true, and provide a pointer to a function that will collect sample rows from the table in func, plus the estimated size of the table in pages in totalpages. Otherwise, return false.

If the FDW does not support collecting statistics for any tables, the AnalyzeForeignTable pointer can be set to NULL.

If provided, the sample collection function must have the signature

int
AcquireSampleRowsFunc(Relation relation,
                      int elevel,
                      HeapTuple *rows,
                      int targrows,
                      double *totalrows,
                      double *totaldeadrows);

A random sample of up to targrows rows should be collected from the table and stored into the caller-provided rows array. The actual number of rows collected must be returned. In addition, store estimates of the total numbers of live and dead rows in the table into the output parameters totalrows and totaldeadrows. (Set totaldeadrows to zero if the FDW does not have any concept of dead rows.)

56.2.8. FDW Routines for `IMPORT FOREIGN SCHEMA`

List *
ImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid);

Obtain a list of foreign table creation commands. This function is called when executing IMPORT FOREIGN SCHEMA, and is passed the parse tree for that statement, as well as the OID of the foreign server to use. It should return a list of C strings, each of which must contain a CREATE FOREIGN TABLE command. These strings will be parsed and executed by the core server.

Within the ImportForeignSchemaStmt struct, remote_schema is the name of the remote schema from which tables are to be imported. list_type identifies how to filter table names: FDW_IMPORT_SCHEMA_ALL means that all tables in the remote schema should be imported (in this case table_list is empty), FDW_IMPORT_SCHEMA_LIMIT_TO means to include only tables listed in table_list, and FDW_IMPORT_SCHEMA_EXCEPT means to exclude the tables listed in table_list. options is a list of options used for the import process. The meanings of the options are up to the FDW. For example, an FDW could use an option to define whether the NOT NULL attributes of columns should be imported. These options need not have anything to do with those supported by the FDW as database object options.

The FDW may ignore the local_schema field of the ImportForeignSchemaStmt, because the core server will automatically insert that name into the parsed CREATE FOREIGN TABLE commands.

The FDW does not have to concern itself with implementing the filtering specified by list_type and table_list, either, as the core server will automatically skip any returned commands for tables excluded according to those options. However, it's often useful to avoid the work of creating commands for excluded tables in the first place. The function IsImportableForeignTable() may be useful to test whether a given foreign-table name will pass the filter.

If the FDW does not support importing table definitions, the ImportForeignSchema pointer can be set to NULL.

56.2.9. FDW Routines for Parallel Execution

A ForeignScan node can, optionally, support parallel execution. A parallel ForeignScan will be executed in multiple processes and must return each row exactly once across all cooperating processes. To do this, processes can coordinate through fixed-size chunks of dynamic shared memory. This shared memory is not guaranteed to be mapped at the same address in every process, so it must not contain pointers. The following functions are all optional, but most are required if parallel execution is to be supported.

bool
IsForeignScanParallelSafe(PlannerInfo *root, RelOptInfo *rel,
                          RangeTblEntry *rte);

Test whether a scan can be performed within a parallel worker. This function will only be called when the planner believes that a parallel plan might be possible, and should return true if it is safe for that scan to run within a parallel worker. This will generally not be the case if the remote data source has transaction semantics, unless the worker's connection to the data can somehow be made to share the same transaction context as the leader.

If this function is not defined, it is assumed that the scan must take place within the parallel leader. Note that returning true does not mean that the scan itself can be done in parallel, only that the scan can be performed within a parallel worker. Therefore, it can be useful to define this method even when parallel execution is not supported.

Size
EstimateDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt);

Estimate the amount of dynamic shared memory that will be required for parallel operation. This may be higher than the amount that will actually be used, but it must not be lower. The return value is in bytes. This function is optional, and can be omitted if not needed; but if it is omitted, the next three functions must be omitted as well, because no shared memory will be allocated for the FDW's use.

void
InitializeDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt,
                         void *coordinate);

Initialize the dynamic shared memory that will be required for parallel operation. coordinate points to a shared memory area of size equal to the return value of EstimateDSMForeignScan. This function is optional, and can be omitted if not needed.

void
ReInitializeDSMForeignScan(ForeignScanState *node, ParallelContext *pcxt,
                           void *coordinate);

Re-initialize the dynamic shared memory required for parallel operation when the foreign-scan plan node is about to be re-scanned. This function is optional, and can be omitted if not needed. Recommended practice is that this function reset only shared state, while the ReScanForeignScan function resets only local state. Currently, this function will be called before ReScanForeignScan, but it's best not to rely on that ordering.

void
InitializeWorkerForeignScan(ForeignScanState *node, shm_toc *toc,
                            void *coordinate);

Initialize a parallel worker's local state based on the shared state set up by the leader during InitializeDSMForeignScan. This function is optional, and can be omitted if not needed.

void
ShutdownForeignScan(ForeignScanState *node);

Release resources when it is anticipated the node will not be executed to completion. This is not called in all cases; sometimes, EndForeignScan may be called without this function having been called first. Since the DSM segment used by parallel query is destroyed just after this callback is invoked, foreign data wrappers that wish to take some action before the DSM segment goes away should implement this method.

56.2.10. FDW Routines for Reparameterization of Paths

List *
ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
                                 RelOptInfo *child_rel);

This function is called while converting a path parameterized by the top-most parent of the given child relation child_rel to be parameterized by the child relation. The function is used to reparameterize any paths or translate any expression nodes saved in the given fdw_private member of a ForeignPath. The callback may use reparameterize_path_by_child, adjust_appendrel_attrs or adjust_appendrel_attrs_multilevel as required.

67.4. Implementation

This section covers B-Tree index implementation details that may be of use to advanced users. See src/backend/access/nbtree/README in the source distribution for a much more detailed, internals-focused description of the B-Tree implementation.

67.4.1. B-Tree Structure

PostgreSQL B-Tree indexes are multi-level tree structures, where each level of the tree can be used as a doubly-linked list of pages. A single metapage is stored in a fixed position at the start of the first segment file of the index. All other pages are either leaf pages or internal pages. Leaf pages are the pages on the lowest level of the tree. All other levels consist of internal pages. Each leaf page contains tuples that point to table rows. Each internal page contains tuples that point to the next level down in the tree. Typically, over 99% of all pages are leaf pages. Both internal pages and leaf pages use the standard page format described in Section 73.6.

New leaf pages are added to a B-Tree index when an existing leaf page cannot fit an incoming tuple. A page split operation makes room for items that originally belonged on the overflowing page by moving a portion of the items to a new page. Page splits must also insert a new downlink to the new page in the parent page, which may cause the parent to split in turn. Page splits “cascade upwards” in a recursive fashion. When the root page finally cannot fit a new downlink, a root page split operation takes place. This adds a new level to the tree structure by creating a new root page that is one level above the original root page.

67.4.2. Bottom-up Index Deletion

B-Tree indexes are not directly aware that under MVCC, there might be multiple extant versions of the same logical table row; to an index, each tuple is an independent object that needs its own index entry. “Version churn” tuples may sometimes accumulate and adversely affect query latency and throughput. This typically occurs with UPDATE-heavy workloads where most individual updates cannot apply the HOT optimization. Changing the value of only one column covered by one index during an UPDATE always necessitates a new set of index tuples — one for each and every index on the table. Note in particular that this includes indexes that were not “logically modified” by the UPDATE. All indexes will need a successor physical index tuple that points to the latest version in the table. Each new tuple within each index will generally need to coexist with the original “updated” tuple for a short period of time (typically until shortly after the UPDATE transaction commits).

B-Tree indexes incrementally delete version churn index tuples by performing bottom-up index deletion passes. Each deletion pass is triggered in reaction to an anticipated “version churn page split”. This only happens with indexes that are not logically modified by UPDATE statements, where concentrated build up of obsolete versions in particular pages would occur otherwise. A page split will usually be avoided, though it's possible that certain implementation-level heuristics will fail to identify and delete even one garbage index tuple (in which case a page split or deduplication pass resolves the issue of an incoming new tuple not fitting on a leaf page). The worst-case number of versions that any index scan must traverse (for any single logical row) is an important contributor to overall system responsiveness and throughput. A bottom-up index deletion pass targets suspected garbage tuples in a single leaf page based on qualitative distinctions involving logical rows and versions. This contrasts with the “top-down” index cleanup performed by autovacuum workers, which is triggered when certain quantitative table-level thresholds are exceeded (see Section 25.1.6).

Note

Not all deletion operations that are performed within B-Tree indexes are bottom-up deletion operations. There is a distinct category of index tuple deletion: simple index tuple deletion. This is a deferred maintenance operation that deletes index tuples that are known to be safe to delete (those whose item identifier's LP_DEAD bit is already set). Like bottom-up index deletion, simple index deletion takes place at the point that a page split is anticipated as a way of avoiding the split.

Simple deletion is opportunistic in the sense that it can only take place when recent index scans set the LP_DEAD bits of affected items in passing. Prior to PostgreSQL 14, the only category of B-Tree deletion was simple deletion. The main differences between it and bottom-up deletion are that only the former is opportunistically driven by the activity of passing index scans, while only the latter specifically targets version churn from UPDATEs that do not logically modify indexed columns.

Bottom-up index deletion performs the vast majority of all garbage index tuple cleanup for particular indexes with certain workloads. This is expected with any B-Tree index that is subject to significant version churn from UPDATEs that rarely or never logically modify the columns that the index covers. The average and worst-case number of versions per logical row can be kept low purely through targeted incremental deletion passes. It's quite possible that the on-disk size of certain indexes will never increase by even one single page/block despite constant version churn from UPDATEs. Even then, an exhaustive “clean sweep” by a VACUUM operation (typically run in an autovacuum worker process) will eventually be required as a part of collective cleanup of the table and each of its indexes.

Unlike VACUUM, bottom-up index deletion does not provide any strong guarantees about how old the oldest garbage index tuple may be. No index can be permitted to retain “floating garbage” index tuples that became dead prior to a conservative cutoff point shared by the table and all of its indexes collectively. This fundamental table-level invariant makes it safe to recycle table TIDs. This is how it is possible for distinct logical rows to reuse the same table TID over time (though this can never happen with two logical rows whose lifetimes span the same VACUUM cycle).

67.4.3. Deduplication

A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.

Deduplication works by periodically merging groups of duplicate tuples together, forming a single posting list tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of TIDs that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly.

Note

B-Tree deduplication is just as effective with “duplicates” that contain a NULL value, even though NULL values are never equal to each other according to the = member of any B-Tree operator class. As far as any part of the implementation that understands the on-disk B-Tree structure is concerned, NULL is just another value from the domain of indexed values.

The deduplication process occurs lazily, when a new item is inserted that cannot fit on an existing leaf page, though only when index tuple deletion could not free sufficient space for the new item (typically deletion is briefly considered and then skipped over). Unlike GIN posting list tuples, B-Tree posting list tuples do not need to expand every time a new duplicate is inserted; they are merely an alternative physical representation of the original logical contents of the leaf page. This design prioritizes consistent performance with mixed read-write workloads. Most client applications will at least see a moderate performance benefit from using deduplication. Deduplication is enabled by default.

CREATE INDEX and REINDEX apply deduplication to create posting list tuples, though the strategy they use is slightly different. Each group of duplicate ordinary tuples encountered in the sorted input taken from the table is merged into a posting list tuple before being added to the current pending leaf page. Individual posting list tuples are packed with as many TIDs as possible. Leaf pages are written out in the usual way, without any separate deduplication pass. This strategy is well-suited to CREATE INDEX and REINDEX because they are once-off batch operations.

Write-heavy workloads that don't benefit from deduplication due to having few or no duplicate values in indexes will incur a small, fixed performance penalty (unless deduplication is explicitly disabled). The deduplicate_items storage parameter can be used to disable deduplication within individual indexes. There is never any performance penalty with read-only workloads, since reading posting list tuples is at least as efficient as reading the standard tuple representation. Disabling deduplication isn't usually helpful.

It is sometimes possible for unique indexes (as well as unique constraints) to use deduplication. This allows leaf pages to temporarily “absorb” extra version churn duplicates. Deduplication in unique indexes augments bottom-up index deletion, especially in cases where a long-running transaction holds a snapshot that blocks garbage collection. The goal is to buy time for the bottom-up index deletion strategy to become effective again. Delaying page splits until a single long-running transaction naturally goes away can allow a bottom-up deletion pass to succeed where an earlier deletion pass failed.

Tip

A special heuristic is applied to determine whether a deduplication pass in a unique index should take place. It can often skip straight to splitting a leaf page, avoiding a performance penalty from wasting cycles on unhelpful deduplication passes. If you're concerned about the overhead of deduplication, consider setting deduplicate_items = off selectively. Leaving deduplication enabled in unique indexes has little downside.

Deduplication cannot be used in all cases due to implementation-level restrictions. Deduplication safety is determined when CREATE INDEX or REINDEX is run.

Note that deduplication is deemed unsafe and cannot be used in the following cases involving semantically significant differences among equal datums:

text, varchar, and char cannot use deduplication when a nondeterministic collation is used. Case and accent differences must be preserved among equal datums.
numeric cannot use deduplication. Numeric display scale must be preserved among equal datums.
jsonb cannot use deduplication, since the jsonb B-Tree operator class uses numeric internally.
float4 and float8 cannot use deduplication. These types have distinct representations for -0 and 0, which are nevertheless considered equal. This difference must be preserved.

There is one further implementation-level restriction that may be lifted in a future version of PostgreSQL:

Container types (such as composite types, arrays, or range types) cannot use deduplication.

There is one further implementation-level restriction that applies regardless of the operator class or collation used:

INCLUDE indexes can never use deduplication.

64.3. Extensibility

Traditionally, implementing a new index access method meant a lot of difficult work. It was necessary to understand the inner workings of the database, such as the lock manager and Write-Ahead Log. The GiST interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GiST layer itself takes care of concurrency, logging and searching the tree structure.

This extensibility should not be confused with the extensibility of the other standard search trees in terms of the data they can handle. For example, PostgreSQL supports extensible B-trees and hash indexes. That means that you can use PostgreSQL to build a B-tree or hash over any data type you want. But B-trees only support range predicates (<, =, >), and hash indexes only support equality queries.

So if you index, say, an image collection with a PostgreSQL B-tree, you can only issue queries such as “is imagex equal to imagey”, “is imagex less than imagey” and “is imagex greater than imagey”. Depending on how you define “equals”, “less than” and “greater than” in this context, this could be useful. However, by using a GiST based index, you could create ways to ask domain-specific questions, perhaps “find all images of horses” or “find all over-exposed images”.

All it takes to get a GiST access method up and running is to implement several user-defined methods, which define the behavior of keys in the tree. Of course these methods have to be pretty fancy to support fancy queries, but for all the standard queries (B-trees, R-trees, etc.) they're relatively straightforward. In short, GiST combines extensibility along with generality, code reuse, and a clean interface.

There are five methods that an index operator class for GiST must provide, and four that are optional. Correctness of the index is ensured by proper implementation of the same, consistent and union methods, while efficiency (size and speed) of the index will depend on the penalty and picksplit methods. Two optional methods are compress and decompress, which allow an index to have internal tree data of a different type than the data it indexes. The leaves are to be of the indexed data type, while the other tree nodes can be of any C struct (but you still have to follow PostgreSQL data type rules here, see about varlena for variable sized data). If the tree's internal data type exists at the SQL level, the STORAGE option of the CREATE OPERATOR CLASS command can be used. The optional eighth method is distance, which is needed if the operator class wishes to support ordered scans (nearest-neighbor searches). The optional ninth method fetch is needed if the operator class wishes to support index-only scans, except when the compress method is omitted.

consistent

Given an index entry p and a query value q, this function determines whether the index entry is “consistent” with the query; that is, could the predicate “indexed_column indexable_operator q” be true for any row represented by the index entry? For a leaf index entry this is equivalent to testing the indexable condition, while for an internal tree node this determines whether it is necessary to scan the subtree of the index represented by the tree node. When the result is true, a recheck flag must also be returned. This indicates whether the predicate is certainly true or only possibly true. If recheck = false then the index has tested the predicate condition exactly, whereas if recheck = true the row is only a candidate match. In that case the system will automatically evaluate the indexable_operator against the actual row value to see if it is really a match. This convention allows GiST to support both lossless and lossy index structures.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_consistent(internal, data_type, smallint, oid, internal)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_consistent);

Datum
my_consistent(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    data_type  *query = PG_GETARG_DATA_TYPE_P(1);
    StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
    /* Oid subtype = PG_GETARG_OID(3); */
    bool       *recheck = (bool *) PG_GETARG_POINTER(4);
    data_type  *key = DatumGetDataType(entry->key);
    bool        retval;

    /*
     * determine return value as a function of strategy, key and query.
     *
     * Use GIST_LEAF(entry) to know where you're called in the index tree,
     * which comes handy when supporting the = operator for example (you could
     * check for non empty union() in non-leaf nodes and equality in leaf
     * nodes).
     */

    *recheck = true;        /* or false if check is exact */

    PG_RETURN_BOOL(retval);
}

Here, key is an element in the index and query the value being looked up in the index. The StrategyNumber parameter indicates which operator of your operator class is being applied — it matches one of the operator numbers in the CREATE OPERATOR CLASS command.

Depending on which operators you have included in the class, the data type of query could vary with the operator, since it will be whatever type is on the righthand side of the operator, which might be different from the indexed data type appearing on the lefthand side. (The above code skeleton assumes that only one type is possible; if not, fetching the query argument value would have to depend on the operator.) It is recommended that the SQL declaration of the consistent function use the opclass's indexed data type for the query argument, even though the actual type might be something else depending on the operator.

union

This method consolidates information in the tree. Given a set of entries, this function generates a new index entry that represents all the given entries.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_union(internal, internal)
RETURNS storage_type
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_union);

Datum
my_union(PG_FUNCTION_ARGS)
{
    GistEntryVector *entryvec = (GistEntryVector *) PG_GETARG_POINTER(0);
    GISTENTRY  *ent = entryvec->vector;
    data_type  *out,
               *tmp,
               *old;
    int         numranges,
                i = 0;

    numranges = entryvec->n;
    tmp = DatumGetDataType(ent[0].key);
    out = tmp;

    if (numranges == 1)
    {
        out = data_type_deep_copy(tmp);

        PG_RETURN_DATA_TYPE_P(out);
    }

    for (i = 1; i < numranges; i++)
    {
        old = out;
        tmp = DatumGetDataType(ent[i].key);
        out = my_union_implementation(out, tmp);
    }

    PG_RETURN_DATA_TYPE_P(out);
}

As you can see, in this skeleton we're dealing with a data type where union(X, Y, Z) = union(union(X, Y), Z). It's easy enough to support data types where this is not the case, by implementing the proper union algorithm in this GiST support method.

The result of the union function must be a value of the index's storage type, whatever that is (it might or might not be different from the indexed column's type). The union function should return a pointer to newly palloc()ed memory. You can't just return the input value as-is, even if there is no type change.

As shown above, the union function's first internal argument is actually a GistEntryVector pointer. The second argument is a pointer to an integer variable, which can be ignored. (It used to be required that the union function store the size of its result value into that variable, but this is no longer necessary.)

compress

Converts a data item into a format suitable for physical storage in an index page. If the compress method is omitted, data items are stored in the index without modification.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_compress(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_compress);

Datum
my_compress(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    GISTENTRY  *retval;

    if (entry->leafkey)
    {
        /* replace entry->key with a compressed version */
        compressed_data_type *compressed_data = palloc(sizeof(compressed_data_type));

        /* fill *compressed_data from entry->key ... */

        retval = palloc(sizeof(GISTENTRY));
        gistentryinit(*retval, PointerGetDatum(compressed_data),
                      entry->rel, entry->page, entry->offset, FALSE);
    }
    else
    {
        /* typically we needn't do anything with non-leaf entries */
        retval = entry;
    }

    PG_RETURN_POINTER(retval);
}

You have to adapt compressed_data_type to the specific type you're converting to in order to compress your leaf nodes, of course.

decompress

Converts the stored representation of a data item into a format that can be manipulated by the other GiST methods in the operator class. If the decompress method is omitted, it is assumed that the other GiST methods can work directly on the stored data format. (decompress is not necessarily the reverse of the compress method; in particular, if compress is lossy then it's impossible for decompress to exactly reconstruct the original data. decompress is not necessarily equivalent to fetch, either, since the other GiST methods might not require full reconstruction of the data.)

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_decompress(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_decompress);

Datum
my_decompress(PG_FUNCTION_ARGS)
{
    PG_RETURN_POINTER(PG_GETARG_POINTER(0));
}

The above skeleton is suitable for the case where no decompression is needed. (But, of course, omitting the method altogether is even easier, and is recommended in such cases.)

penalty

Returns a value indicating the “cost” of inserting the new entry into a particular branch of the tree. Items will be inserted down the path of least penalty in the tree. Values returned by penalty should be non-negative. If a negative value is returned, it will be treated as zero.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_penalty(internal, internal, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;  -- in some cases penalty functions need not be strict

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_penalty);

Datum
my_penalty(PG_FUNCTION_ARGS)
{
    GISTENTRY  *origentry = (GISTENTRY *) PG_GETARG_POINTER(0);
    GISTENTRY  *newentry = (GISTENTRY *) PG_GETARG_POINTER(1);
    float      *penalty = (float *) PG_GETARG_POINTER(2);
    data_type  *orig = DatumGetDataType(origentry->key);
    data_type  *new = DatumGetDataType(newentry->key);

    *penalty = my_penalty_implementation(orig, new);
    PG_RETURN_POINTER(penalty);
}

For historical reasons, the penalty function doesn't just return a float result; instead it has to store the value at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.

The penalty function is crucial to good performance of the index. It'll get used at insertion time to determine which branch to follow when choosing where to add the new entry in the tree. At query time, the more balanced the index, the quicker the lookup.

picksplit

When an index page split is necessary, this function decides which entries on the page are to stay on the old page, and which are to move to the new page.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_picksplit(internal, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_picksplit);

Datum
my_picksplit(PG_FUNCTION_ARGS)
{
    GistEntryVector *entryvec = (GistEntryVector *) PG_GETARG_POINTER(0);
    GIST_SPLITVEC *v = (GIST_SPLITVEC *) PG_GETARG_POINTER(1);
    OffsetNumber maxoff = entryvec->n - 1;
    GISTENTRY  *ent = entryvec->vector;
    int         i,
                nbytes;
    OffsetNumber *left,
               *right;
    data_type  *tmp_union;
    data_type  *unionL;
    data_type  *unionR;
    GISTENTRY **raw_entryvec;

    maxoff = entryvec->n - 1;
    nbytes = (maxoff + 1) * sizeof(OffsetNumber);

    v->spl_left = (OffsetNumber *) palloc(nbytes);
    left = v->spl_left;
    v->spl_nleft = 0;

    v->spl_right = (OffsetNumber *) palloc(nbytes);
    right = v->spl_right;
    v->spl_nright = 0;

    unionL = NULL;
    unionR = NULL;

    /* Initialize the raw entry vector. */
    raw_entryvec = (GISTENTRY **) malloc(entryvec->n * sizeof(void *));
    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
        raw_entryvec[i] = &(entryvec->vector[i]);

    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
    {
        int         real_index = raw_entryvec[i] - entryvec->vector;

        tmp_union = DatumGetDataType(entryvec->vector[real_index].key);
        Assert(tmp_union != NULL);

        /*
         * Choose where to put the index entries and update unionL and unionR
         * accordingly. Append the entries to either v->spl_left or
         * v->spl_right, and care about the counters.
         */

        if (my_choice_is_left(unionL, curl, unionR, curr))
        {
            if (unionL == NULL)
                unionL = tmp_union;
            else
                unionL = my_union_implementation(unionL, tmp_union);

            *left = real_index;
            ++left;
            ++(v->spl_nleft);
        }
        else
        {
            /*
             * Same on the right
             */
        }
    }

    v->spl_ldatum = DataTypeGetDatum(unionL);
    v->spl_rdatum = DataTypeGetDatum(unionR);
    PG_RETURN_POINTER(v);
}

Notice that the picksplit function's result is delivered by modifying the passed-in v structure. The return value per se is ignored, though it's conventional to pass back the address of v.

Like penalty, the picksplit function is crucial to good performance of the index. Designing suitable penalty and picksplit implementations is where the challenge of implementing well-performing GiST indexes lies.

same

Returns true if two index entries are identical, false otherwise. (An “index entry” is a value of the index's storage type, not necessarily the original indexed column's type.)

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_same(storage_type, storage_type, internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_same);

Datum
my_same(PG_FUNCTION_ARGS)
{
    prefix_range *v1 = PG_GETARG_PREFIX_RANGE_P(0);
    prefix_range *v2 = PG_GETARG_PREFIX_RANGE_P(1);
    bool       *result = (bool *) PG_GETARG_POINTER(2);

    *result = my_eq(v1, v2);
    PG_RETURN_POINTER(result);
}

For historical reasons, the same function doesn't just return a Boolean result; instead it has to store the flag at the location indicated by the third argument. The return value per se is ignored, though it's conventional to pass back the address of that argument.

distance

Given an index entry p and a query value q, this function determines the index entry's “distance” from the query value. This function must be supplied if the operator class contains any ordering operators. A query using the ordering operator will be implemented by returning index entries with the smallest “distance” values first, so the results must be consistent with the operator's semantics. For a leaf index entry the result just represents the distance to the index entry; for an internal tree node, the result must be the smallest distance that any child entry could have.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_distance(internal, data_type, smallint, oid, internal)
RETURNS float8
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

And the matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_distance);

Datum
my_distance(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    data_type  *query = PG_GETARG_DATA_TYPE_P(1);
    StrategyNumber strategy = (StrategyNumber) PG_GETARG_UINT16(2);
    /* Oid subtype = PG_GETARG_OID(3); */
    /* bool *recheck = (bool *) PG_GETARG_POINTER(4); */
    data_type  *key = DatumGetDataType(entry->key);
    double      retval;

    /*
     * determine return value as a function of strategy, key and query.
     */

    PG_RETURN_FLOAT8(retval);
}

The arguments to the distance function are identical to the arguments of the consistent function.

Some approximation is allowed when determining the distance, so long as the result is never greater than the entry's actual distance. Thus, for example, distance to a bounding box is usually sufficient in geometric applications. For an internal tree node, the distance returned must not be greater than the distance to any of the child nodes. If the returned distance is not exact, the function must set *recheck to true. (This is not necessary for internal tree nodes; for them, the calculation is always assumed to be inexact.) In this case the executor will calculate the accurate distance after fetching the tuple from the heap, and reorder the tuples if necessary.

If the distance function returns *recheck = true for any leaf node, the original ordering operator's return type must be float8 or float4, and the distance function's result values must be comparable to those of the original ordering operator, since the executor will sort using both distance function results and recalculated ordering-operator results. Otherwise, the distance function's result values can be any finite float8 values, so long as the relative order of the result values matches the order returned by the ordering operator. (Infinity and minus infinity are used internally to handle cases such as nulls, so it is not recommended that distance functions return these values.)

fetch

Converts the compressed index representation of a data item into the original data type, for index-only scans. The returned data must be an exact, non-lossy copy of the originally indexed value.

The SQL declaration of the function must look like this:

CREATE OR REPLACE FUNCTION my_fetch(internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;

The argument is a pointer to a GISTENTRY struct. On entry, its key field contains a non-NULL leaf datum in compressed form. The return value is another GISTENTRY struct, whose key field contains the same datum in its original, uncompressed form. If the opclass's compress function does nothing for leaf entries, the fetch method can return the argument as-is. Or, if the opclass does not have a compress function, the fetch method can be omitted as well, since it would necessarily be a no-op.

The matching code in the C module could then follow this skeleton:

PG_FUNCTION_INFO_V1(my_fetch);

Datum
my_fetch(PG_FUNCTION_ARGS)
{
    GISTENTRY  *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
    input_data_type *in = DatumGetPointer(entry->key);
    fetched_data_type *fetched_data;
    GISTENTRY  *retval;

    retval = palloc(sizeof(GISTENTRY));
    fetched_data = palloc(sizeof(fetched_data_type));

    /*
     * Convert 'fetched_data' into the a Datum of the original datatype.
     */

    /* fill *retval from fetched_data. */
    gistentryinit(*retval, PointerGetDatum(converted_datum),
                  entry->rel, entry->page, entry->offset, FALSE);

    PG_RETURN_POINTER(retval);
}

If the compress method is lossy for leaf entries, the operator class cannot support index-only scans, and must not define a fetch function.

All the GiST support methods are normally called in short-lived memory contexts; that is, CurrentMemoryContext will get reset after each tuple is processed. It is therefore not very important to worry about pfree'ing everything you palloc. However, in some cases it's useful for a support method to cache data across repeated calls. To do that, allocate the longer-lived data in fcinfo->flinfo->fn_mcxt, and keep a pointer to it in fcinfo->flinfo->fn_extra. Such data will survive for the life of the index operation (e.g., a single GiST index scan, index build, or index tuple insertion). Be careful to pfree the previous value when replacing a fn_extra value, or the leak will accumulate for the duration of the operation.

67.3. B-Tree Support Functions

As shown in Table 38.9, btree defines one required and four optional support functions. The five user-defined methods are:

order

For each combination of data types that a btree operator family provides comparison operators for, it must provide a comparison support function, registered in pg_amproc with support function number 1 and amproclefttype/amprocrighttype equal to the left and right data types for the comparison (i.e., the same data types that the matching operators are registered with in pg_amop). The comparison function must take two non-null values A and B and return an int32 value that is < 0, 0, or > 0 when A < B, A = B, or A > B, respectively. A null result is disallowed: all values of the data type must be comparable. See src/backend/access/nbtree/nbtcompare.c for examples.

If the compared values are of a collatable data type, the appropriate collation OID will be passed to the comparison support function, using the standard PG_GET_COLLATION() mechanism.

sortsupport

Optionally, a btree operator family may provide sort support function(s), registered under support function number 2. These functions allow implementing comparisons for sorting purposes in a more efficient way than naively calling the comparison support function. The APIs involved in this are defined in src/include/utils/sortsupport.h.

in_range

Optionally, a btree operator family may provide in_range support function(s), registered under support function number 3. These are not used during btree index operations; rather, they extend the semantics of the operator family so that it can support window clauses containing the RANGE offset PRECEDING and RANGE offset FOLLOWING frame bound types (see Section 4.2.8). Fundamentally, the extra information provided is how to add or subtract an offset value in a way that is compatible with the family's data ordering.

An in_range function must have the signature

in_range(val type1, base type1, offset type2, sub bool, less bool)
returns bool

val and base must be of the same type, which is one of the types supported by the operator family (i.e., a type for which it provides an ordering). However, offset could be of a different type, which might be one otherwise unsupported by the family. An example is that the built-in time_ops family provides an in_range function that has offset of type interval. A family can provide in_range functions for any of its supported types and one or more offset types. Each in_range function should be entered in pg_amproc with amproclefttype equal to type1 and amprocrighttype equal to type2.

The essential semantics of an in_range function depend on the two Boolean flag parameters. It should add or subtract base and offset, then compare val to the result, as follows:

if !sub and !less, return val >= (base + offset)
if !sub and less, return val <= (base + offset)
if sub and !less, return val >= (base - offset)
if sub and less, return val <= (base - offset)

Before doing so, the function should check the sign of offset: if it is less than zero, raise error ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE (22013) with error text like “invalid preceding or following size in window function”. (This is required by the SQL standard, although nonstandard operator families might perhaps choose to ignore this restriction, since there seems to be little semantic necessity for it.) This requirement is delegated to the in_range function so that the core code needn't understand what “less than zero” means for a particular data type.

An additional expectation is that in_range functions should, if practical, avoid throwing an error if base + offset or base - offset would overflow. The correct comparison result can be determined even if that value would be out of the data type's range. Note that if the data type includes concepts such as “infinity” or “NaN”, extra care may be needed to ensure that in_range's results agree with the normal sort order of the operator family.

The results of the in_range function must be consistent with the sort ordering imposed by the operator family. To be precise, given any fixed values of offset and sub, then:

If in_range with less = true is true for some val1 and base, it must be true for every val2 <= val1 with the same base.
If in_range with less = true is false for some val1 and base, it must be false for every val2 >= val1 with the same base.
If in_range with less = true is true for some val and base1, it must be true for every base2 >= base1 with the same val.
If in_range with less = true is false for some val and base1, it must be false for every base2 <= base1 with the same val.

Analogous statements with inverted conditions hold when less = false.

If the type being ordered (type1) is collatable, the appropriate collation OID will be passed to the in_range function, using the standard PG_GET_COLLATION() mechanism.

in_range functions need not handle NULL inputs, and typically will be marked strict.

equalimage

Optionally, a btree operator family may provide equalimage (“equality implies image equality”) support functions, registered under support function number 4. These functions allow the core code to determine when it is safe to apply the btree deduplication optimization. Currently, equalimage functions are only called when building or rebuilding an index.

An equalimage function must have the signature

equalimage(opcintype oid) returns bool

The return value is static information about an operator class and collation. Returning true indicates that the order function for the operator class is guaranteed to only return 0 (“arguments are equal”) when its A and B arguments are also interchangeable without any loss of semantic information. Not registering an equalimage function or returning false indicates that this condition cannot be assumed to hold.

The opcintype argument is the pg_type.oid of the data type that the operator class indexes. This is a convenience that allows reuse of the same underlying equalimage function across operator classes. If opcintype is a collatable data type, the appropriate collation OID will be passed to the equalimage function, using the standard PG_GET_COLLATION() mechanism.

As far as the operator class is concerned, returning true indicates that deduplication is safe (or safe for the collation whose OID was passed to its equalimage function). However, the core code will only deem deduplication safe for an index when every indexed column uses an operator class that registers an equalimage function, and each function actually returns true when called.

Image equality is almost the same condition as simple bitwise equality. There is one subtle difference: When indexing a varlena data type, the on-disk representation of two image equal datums may not be bitwise equal due to inconsistent application of TOAST compression on input. Formally, when an operator class's equalimage function returns true, it is safe to assume that the datum_image_eq() C function will always agree with the operator class's order function (provided that the same collation OID is passed to both the equalimage and order functions).

The core code is fundamentally unable to deduce anything about the “equality implies image equality” status of an operator class within a multiple-data-type family based on details from other operator classes in the same family. Also, it is not sensible for an operator family to register a cross-type equalimage function, and attempting to do so will result in an error. This is because “equality implies image equality” status does not just depend on sorting/equality semantics, which are more or less defined at the operator family level. In general, the semantics that one particular data type implements must be considered separately.

The convention followed by the operator classes included with the core PostgreSQL distribution is to register a stock, generic equalimage function. Most operator classes register btequalimage(), which indicates that deduplication is safe unconditionally. Operator classes for collatable data types such as text register btvarstrequalimage(), which indicates that deduplication is safe with deterministic collations. Best practice for third-party extensions is to register their own custom function to retain control.

options

Optionally, a B-tree operator family may provide options (“operator class specific options”) support functions, registered under support function number 5. These functions define a set of user-visible parameters that control operator class behavior.

An options support function must have the signature

options(relopts local_relopts *) returns void

The function is passed a pointer to a local_relopts struct, which needs to be filled with a set of operator class specific options. The options can be accessed from other support functions using the PG_HAS_OPCLASS_OPTIONS() and PG_GET_OPCLASS_OPTIONS() macros.

Currently, no B-Tree operator class has an options support function. B-tree doesn't allow flexible representation of keys like GiST, SP-GiST, GIN and BRIN do. So, options probably doesn't have much application in the current B-tree index access method. Nevertheless, this support function was added to B-tree for uniformity, and will probably find uses during further evolution of B-tree in PostgreSQL.

VII. 資料庫進階

52. PostgreSQL 的內部架構

52.1. 處理查詢語句的流程

52.2. 連線是如何被建立的

52.3. 解析器階段

52.3.1. 解析器

註記

52.3.2.

52.4. The PostgreSQL Rule System

52.5. Planner/Optimizer

Note

50.5.1. Generating Possible Plans

52.6. Executor

53. 系統資訊目錄

51.3. pg_am

Table 53.3. pg_am Columns

Note

51.7. pg_attribute

Table 51.7. pg_attribute Columns

51.8. pg_authid

Table 51.8. pg_authid Columns

51.9. pg_auth_members

51.10. pg_cast

Table 51.10. pg_cast Columns

51.11 pg_class

Table 51.11. pg_class Columns

51.12. pg_collation

51.13. pg_constraint

Note

Note

51.21. pg_event_trigger

Table 51.21. pg_event_trigger Columns

51.22. pg_extension

51.26 pg_index

Table 51.26. pg_index Columns

51.29. pg_language

Table 51.29. pg_language Columns

51.32. pg_namespace

Table 51.32. pg_namespace Columns

51.33. pg_opclass

51.52. pg_subscription

Table 51.52. pg_subscription Columns

51.72. pg_indexes

Table 51.73. pg_indexes Columns

54.35. pg_views

Table 54.35. pg_views Columns

52.2. Message Flow

52.3. SASL Authentication

52.6. Message Data Types

52.7. Message Formats

52.8. Error and Notice Message Fields

52.9. Logical Replication Message Formats

52.10. Summary of Changes since Protocol 2.0

56. PostgreSQL 程式撰寫慣例

57. Native Language Support

67. B-Tree Indexes

67.2. Behavior of B-Tree Operator Classes

68. GiST Indexes

52. PostgreSQL 的內部架構

51.3. pg_am

Table 53.3. pg_am Columns

Note

51.11 pg_class

Table 51.11. pg_class Columns

52.3. 解析器階段

52.3.1. 解析器

註記

52.3.2.

VII. 資料庫進階

51.72. pg_indexes

Table 51.73. pg_indexes Columns

54.35. pg_views

Table 54.35. pg_views Columns

52.1. 處理查詢語句的流程

52.5. Planner/Optimizer

Note

50.5.1. Generating Possible Plans

52.4. The PostgreSQL Rule System

52.2. 連線是如何被建立的

51.9. pg_auth_members

Table 53.3. `pg_am` Columns

Table 51.7. `pg_attribute` Columns

Table 51.8. `pg_authid` Columns

Table 51.10. `pg_cast` Columns

Table 51.11. `pg_class` Columns

Table 51.21. `pg_event_trigger` Columns

Table 51.26. `pg_index` Columns

Table 51.29. `pg_language` Columns

Table 51.32. `pg_namespace` Columns

Table 51.52. `pg_subscription` Columns

Table 51.73. `pg_indexes` Columns

Table 54.35. `pg_views` Columns

Table 53.3. `pg_am` Columns

Table 51.11. `pg_class` Columns

Table 51.73. `pg_indexes` Columns

Table 54.35. `pg_views` Columns

Table 51.7. `pg_attribute` Columns

Table 51.8. `pg_authid` Columns

Table 51.10. `pg_cast` Columns

Table 51.21. `pg_event_trigger` Columns

Table 51.29. `pg_language` Columns

Table 51.26. `pg_index` Columns

Table 51.52. `pg_subscription` Columns

Table 51.32. `pg_namespace` Columns

Table 51.49. `pg_statistic` Columns

Table 51.50. `pg_statistic_ext` Columns

Table 51.53. `pg_subscription_rel` Columns

Table 51.67. `pg_available_extensions` Columns

Table 51.39. `pg_proc` Columns

Table 51.72. `pg_hba_file_rules` Columns

Table 51.54. `pg_tablespace` Columns

Table 51.68. `pg_available_extension_versions` Columns

Table 51.74. `pg_locks` Columns

Table 51.78. `pg_prepared_xacts` Columns

Table 54.24. `pg_settings` Columns

Table 54.25. `pg_shadow` Columns

Table 54.31. `pg_timezone_abbrevs` Columns

Table 54.27. `pg_stats` Columns

Table 54.30. `pg_tables` Columns