1 of 7

52. PostgreSQL 的內部架構

Author

This chapter originated as part of [sim98], Stefan Simkovics' Master's Thesis prepared at Vienna University of Technology under the direction of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

本章概述了 PostgreSQL 的內部架構。閱讀以下各節後，你應該可以了解一個查詢是如何被處理的。本章的目的不是詳細描述 PostgreSQL 的內部操作，因為那樣的說明太過於詳盡。本章旨在幫助讀者理解資料庫後端內部發生的一些操作程序，從接收查詢的開始到將結果回傳給用戶端之間所發生的事。\

52.1. 處理查詢語句的流程

在這裡，我們簡單概述查詢必須經過某些階段才能得到結果。

必須建立從應用程式到 PostgreSQL 伺服器的連線。應用程式將查詢語句發送到伺服器，並等待接收伺服器送回的結果。
查詢解析程序(parser)階段檢查應用程式所發送的查詢語句是否語法正確，並建立查詢樹(query tree)。
查詢改寫(rewrite)系統採用查詢解析程序階段所建立的查詢樹，查詢要使用於查詢樹的所有規則（記錄在系統目錄中）。它執行規則內容所敘述的改寫轉換方式。改寫系統的其中一種用途是檢視表(view)的實作。每當針對檢視表（或虛擬資料表）進行查詢時，改寫系統都會將使用者的查詢覆寫為存取檢視表定義中所宣告的基本資料表查詢。
計劃程序或最佳化程序採用已改寫的查詢樹並建立一個查詢計劃，該計劃將作為執行程序的輸入。
為此，首先建立所有可能產生相同結果的查詢路徑。例如，如果要掃描的關連上有索引，則有兩條掃描路徑。一種可能性是簡單的循序掃描，另一種可能性是使用索引。接下來，估計每條路徑的執行成本，並選擇最便宜的路徑。最便宜的路徑將被產生用於執行者可以使用的完整計劃。
執行程序以遞迴方式走遍計劃樹並以計劃規劃的方式檢索資料。執行程序利用儲存系統掃描資料關連、執行排序和集合、過濾條件，在最後回傳所產生的資料內容。

在以下各節中，我們將更詳細地介紹上述每個項目，以更好地理解 PostgreSQL 的內部控制和資料結構。

52.2. 連線是如何被建立的

連線是如何被建立的

PostgreSQL採用了一種“每個使用者一個程序”的客戶端/伺服器模型。在這種模型中，每個客戶端程序(client process)只連接到一個後端程序(backend process)。由於我們事先不知道會有多少連線，所以我們必須使用一個「監督程序」，每次收到連線請求時，它就產生一個新的後端程序。這個監督程序叫做postmaster，它在指定的 TCP/IP 連線埠上監聽傳入的連線。每當它檢測到一個連線請求，它就產生一個新的後端程序。這些後端程序之間以及與實例(instance)的其他程序使用 semaphores 和共享記憶體來溝通，以確保在同時間進行資料存取中的資料完整性

客戶端程序可以是任何理解 PostgreSQL 協定的程式，該協定說明在第 55 章中。許多客戶端是基於 C 語言函式庫 libpq，但也有一些是獨立實作該協定的程式，例如 Java 的 JDBC 驅動程式。

只要連線建立之後，該客戶端就能夠送查詢到對應的後端程序。查詢會以明文的方式傳送過去，客戶端不需要做解析的操作。而對應的後端程序會解析該查詢並建立一個執行計畫(execution plan)，接著執行該計畫並透過對應的連線回傳查詢到的每筆(row)資料。

52.3. 解析器階段

解析器階段

解析器階段由兩部分組合：

解析器是透過 Unix 工具 bison 跟 flex 實作出來的並定義在 gram.y 和 scan.l 中。
轉換處理(transformation process) 負責將解析器產生出來的資料結構進行修改和加強。

52.3.1. 解析器

語法解析器必須檢查送過來的明文查詢字串是否語法正確。如果語法正確會建立一個解析樹( parser tree)並回傳，否則將回傳一個錯誤。語法解析器與詞法解析器是透過著名的 Unix 工具 bison 跟 flex 實作的。

詞法解析器定義在 scan.l 檔案裡，負責解析 identifiers、SQL keywords 等等。對於每一個找到的 identifier、keyword 會產生一個 token 並回傳給語法解析器。

語法解析器定義在 gram.y 檔案裡由一組 grammer rules 和 actions 所組成，每當滿足一個 rule 的時候就會觸發對應的 action (由 C 語言實作) 並建立出對應的解析樹。

flex 程式將檔案 scan.l 轉換成 scan.c C 語言檔案，bison 將檔案 gram.y 轉換成 gram.c C 語言檔案。轉換結束後，C 編譯器就能編譯這些檔案並建立解析器。不應該對這些產生出來的 C 檔案做任何修改，因為每次 flex 或 bison 都會改寫這些檔案。

註記

前面提到的轉換和編譯是由定義在 PostgreSQL source code 內的 makefiles 所執行。

對於 bison 或是 gram.y 中的語法規則的介紹超出了本文件的教學範圍。有許多相關的書本或是文件都在介紹 flex 和 bison。建議在學習 gram.y 中的語法前，先了解 bison 的相關原理，才不會很難理解。

52.3.2.

52.4. The PostgreSQL Rule System

PostgreSQL supports a powerful rule system for the specification of views and ambiguous view updates. Originally the PostgreSQL rule system consisted of two implementations:

The first one worked using row level processing and was implemented deep in the executor. The rule system was called whenever an individual row had been accessed. This implementation was removed in 1995 when the last official release of the Berkeley Postgres project was transformed into Postgres95.
The second implementation of the rule system is a technique called query rewriting. The rewrite system is a module that exists between the parser stage and the planner/optimizer. This technique is still implemented.

The query rewriter is discussed in some detail in Chapter 40, so there is no need to cover it here. We will only point out that both the input and the output of the rewriter are query trees, that is, there is no change in the representation or level of semantic detail in the trees. Rewriting can be thought of as a form of macro expansion.

52.5. Planner/Optimizer

The task of the planner/optimizer is to create an optimal execution plan. A given SQL query (and hence, a query tree) can be actually executed in a wide variety of different ways, each of which will produce the same set of results. If it is computationally feasible, the query optimizer will examine each of these possible execution plans, ultimately selecting the execution plan that is expected to run the fastest.

Note

In some situations, examining each possible way in which a query can be executed would take an excessive amount of time and memory space. In particular, this occurs when executing queries involving large numbers of join operations. In order to determine a reasonable (not necessarily optimal) query plan in a reasonable amount of time, PostgreSQL uses a Genetic Query Optimizer (see Chapter 59) when the number of joins exceeds a threshold (see geqo_threshold).

The planner's search procedure actually works with data structures called paths, which are simply cut-down representations of plans containing only as much information as the planner needs to make its decisions. After the cheapest path is determined, a full-fledged plan tree is built to pass to the executor. This represents the desired execution plan in sufficient detail for the executor to run it. In the rest of this section we'll ignore the distinction between paths and plans.

50.5.1. Generating Possible Plans

The planner/optimizer starts by generating plans for scanning each individual relation (table) used in the query. The possible plans are determined by the available indexes on each relation. There is always the possibility of performing a sequential scan on a relation, so a sequential scan plan is always created. Assume an index is defined on a relation (for example a B-tree index) and a query contains the restriction relation.attribute OPR constant. If relation.attribute happens to match the key of the B-tree index and OPR is one of the operators listed in the index's operator class, another plan is created using the B-tree index to scan the relation. If there are further indexes present and the restrictions in the query happen to match a key of an index, further plans will be considered. Index scan plans are also generated for indexes that have a sort ordering that can match the query's ORDER BY clause (if any), or a sort ordering that might be useful for merge joining (see below).

If the query requires joining two or more relations, plans for joining relations are considered after all feasible plans have been found for scanning single relations. The three available join strategies are:

nested loop join: The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

When the query involves more than two relations, the final result must be built up by a tree of join steps, each with two inputs. The planner examines different possible join sequences to find the cheapest one.

If the query uses fewer than geqo_threshold relations, a near-exhaustive search is conducted to find the best join sequence. The planner preferentially considers joins between any two relations for which there exist a corresponding join clause in the WHERE qualification (i.e., for which a restriction like where rel1.attr1=rel2.attr2 exists). Join pairs with no join clause are considered only when there is no other choice, that is, a particular relation has no available join clauses to any other relation. All possible plans are generated for every join pair considered by the planner, and the one that is (estimated to be) the cheapest is chosen.

When geqo_threshold is exceeded, the join sequences considered are determined by heuristics, as described in Chapter 59. Otherwise the process is the same.

The finished plan tree consists of sequential or index scans of the base relations, plus nested-loop, merge, or hash join nodes as needed, plus any auxiliary steps needed, such as sort nodes or aggregate-function calculation nodes. Most of these plan node types have the additional ability to do selection (discarding rows that do not meet a specified Boolean condition) and projection (computation of a derived column set based on given column values, that is, evaluation of scalar expressions where needed). One of the responsibilities of the planner is to attach selection conditions from the WHERE clause and computation of required output expressions to the most appropriate nodes of the plan tree.

52.6. Executor

The executor takes the plan created by the planner/optimizer and recursively processes it to extract the required set of rows. This is essentially a demand-pull pipeline mechanism. Each time a plan node is called, it must deliver one more row, or report that it is done delivering rows.

To provide a concrete example, assume that the top node is a MergeJoin node. Before any merge can be done two rows have to be fetched (one from each subplan). So the executor recursively calls itself to process the subplans (it starts with the subplan attached to lefttree). The new top node (the top node of the left subplan) is, let's say, a Sort node and again recursion is needed to obtain an input row. The child node of the Sort might be a SeqScan node, representing actual reading of a table. Execution of this node causes the executor to fetch a row from the table and return it up to the calling node. The Sort node will repeatedly call its child to obtain all the rows to be sorted. When the input is exhausted (as indicated by the child node returning a NULL instead of a row), the Sort code performs the sort, and finally is able to return its first output row, namely the first one in sorted order. It keeps the remaining rows stored so that it can deliver them in sorted order in response to later demands.

The MergeJoin node similarly demands the first row from its right subplan. Then it compares the two rows to see if they can be joined; if so, it returns a join row to its caller. On the next call, or immediately if it cannot join the current pair of inputs, it advances to the next row of one table or the other (depending on how the comparison came out), and again checks for a match. Eventually, one subplan or the other is exhausted, and the MergeJoin node returns NULL to indicate that no more join rows can be formed.

Complex queries can involve many levels of plan nodes, but the general approach is the same: each node computes and returns its next output row each time it is called. Each node is also responsible for applying any selection or projection expressions that were assigned to it by the planner.

The executor mechanism is used to evaluate all four basic SQL query types: SELECT, INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs to send each row returned by the query plan tree off to the client. For INSERT, each returned row is inserted into the target table specified for the INSERT. This is done in a special top-level plan node called ModifyTable. (A simple INSERT ... VALUES command creates a trivial plan tree consisting of a single Result node, which computes just one result row, and ModifyTable above it to perform the insertion. But INSERT ... SELECT can demand the full power of the executor mechanism.) For UPDATE, the planner arranges that each computed row includes all the updated column values, plus the TID (tuple ID, or row ID) of the original target row; this data is fed into a ModifyTable node, which uses the information to create a new updated row and mark the old row deleted. For DELETE, the only column that is actually returned by the plan is the TID, and the ModifyTable node simply uses the TID to visit each target row and mark it deleted.

52.6. Executor

52.5. Planner/Optimizer

Note

50.5.1. Generating Possible Plans

nested loop join: The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)
merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.
hash join: the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

When geqo_threshold is exceeded, the join sequences considered are determined by heuristics, as described in Chapter 59. Otherwise the process is the same.