如果主伺服器發生故障,則備用伺服器應該開始故障轉移程序。
如果備用伺服器發生故障,則毌須進行故障轉移。如果備用伺服器可以重新啟動,即使是在某個時間點之後,也可以利用可重新啟動的還原功能立即重新啟動還原程序。如果備用伺服器無法重新啟動,則應該重新建立一個完整的備用伺服器。
如果主伺服器發生故障,並且備用伺服器成為新的主伺服器,然後舊的主伺服器重新啟動,則必須具有一種機制,通知舊的主伺服器不再是主伺服器。這有時被稱為 STONITH(Shoot The Other Node In The Head),這是避免兩個系統都認為它們是主要系統所必須要做的事,這種情況會導致混亂並導致資料損毁。
Many failover systems use just two systems, the primary and the standby, connected by some kind of heartbeat mechanism to continually verify the connectivity between the two and the viability of the primary. It is also possible to use a third system (called a witness server) to prevent some cases of inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with sufficient care and rigorous testing.
PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration.
Once failover to the standby occurs, there is only a single server in operation. This is known as a degenerate state. The former standby is now the primary, but the former primary is down and might stay down. To return to normal operation, a standby server must be recreated, either on the former primary system when it comes up, or on a third, possibly new, system. The pg_rewind utility can be used to speed up this process on large clusters. Once complete, the primary and standby can be considered to have switched roles. Some people choose to use a third server to provide backup for the new primary until the new standby server is recreated, though clearly this complicates the system configuration and operational processes.
So, switching from primary to standby server can be fast but requires some time to re-prepare the failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it will really work when you need it. Written administration procedures are advised.
要觸發日誌傳送備用伺服器的故障轉移,請執行 pg_ctl promote
、呼叫 pg_promote
或建立一個事件觸發的執行腳本檔案,該檔案名稱及路徑由 promot_trigger_file 指定。如果您打算使用 pg_ctl promote
或呼叫 pg_promote
進行故障轉移,則不需要 promote_trigger_file。 如果要設定僅用於從主伺服器唯讀查詢(而不是出於高可用性目的)的報表伺服器,則毌須進行故障轉移。
An alternative to the built-in standby mode described in the previous sections is to use a restore_command
that polls the archive location. This was the only option available in versions 8.4 and below. See the pg_standby module for a reference implementation of this.
Note that in this mode, the server will apply WAL one file at a time, so if you use the standby server for queries (see Hot Standby), there is a delay between an action in the master and when the action becomes visible in the standby, corresponding the time it takes to fill up the WAL file. archive_timeout
can be used to make that delay shorter. Also note that you can't combine streaming replication with this method.
The operations that occur on both primary and standby servers are normal continuous archiving and recovery tasks. The only point of contact between the two database servers is the archive of WAL files that both share: primary writing to the archive, standby reading from the archive. Care must be taken to ensure that WAL archives from separate primary servers do not become mixed together or confused. The archive need not be large if it is only required for standby operation.
The magic that makes the two loosely coupled servers work together is simply a restore_command
used on the standby that, when asked for the next WAL file, waits for it to become available from the primary. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next WAL file to be unavailable, so the standby must wait for it to appear. For files ending in .history
there is no need to wait, and a non-zero return code must be returned. A waiting restore_command
can be written as a custom script that loops after polling for the existence of the next WAL file. There must also be some way to trigger failover, which should interrupt the restore_command
, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal server.
Pseudocode for a suitable restore_command
is:
A working example of a waiting restore_command
is provided in the pg_standby module. It should be used as a reference on how to correctly implement the logic described above. It can also be extended as needed to support specific configurations and environments.
The method for triggering failover is an important part of planning and design. One potential option is the restore_command
command. It is executed once for each WAL file, but the process running the restore_command
is created and dies for each file, so there is no daemon or server process, and signals or a signal handler cannot be used. Therefore, the restore_command
is not suitable to trigger failover. It is possible to use a simple timeout facility, especially if used in conjunction with a known archive_timeout
setting on the primary. However, this is somewhat error prone since a network problem or busy primary server might be sufficient to initiate failover. A notification mechanism such as the explicit creation of a trigger file is ideal, if this can be arranged.
The short procedure for configuring a standby server using this alternative method is as follows. For full details of each step, refer to previous sections as noted.
Set up primary and standby systems as nearly identical as possible, including two identical copies of PostgreSQL at the same release level.
Set up continuous archiving from the primary to a WAL archive directory on the standby server. Ensure that archive_mode, archive_command and archive_timeout are set appropriately on the primary (see Section 25.3.1).
Make a base backup of the primary server (see Section 25.3.2), and load this data onto the standby.
Begin recovery on the standby server from the local WAL archive, using restore_command
that waits as described previously (see Section 25.3.4).
Recovery treats the WAL archive as read-only, so once a WAL file has been copied to the standby system it can be copied to tape at the same time as it is being read by the standby database server. Thus, running a standby server for high availability can be performed at the same time as files are stored for longer term disaster recovery purposes.
For testing purposes, it is possible to run both primary and standby servers on the same system. This does not provide any worthwhile improvement in server robustness, nor would it be described as HA.
It is also possible to implement record-based log shipping using this alternative method, though this requires custom development, and changes will still only become visible to hot standby queries after a full WAL file has been shipped.
An external program can call the pg_walfile_name_offset()
function (see Section 9.27) to find out the file name and the exact byte offset within it of the current end of WAL. It can then access the WAL file directly and copy the data from the last known end of WAL through the current end over to the standby servers. With this approach, the window for data loss is the polling cycle time of the copying program, which can be very small, and there is no wasted bandwidth from forcing partially-used segment files to be archived. Note that the standby servers' restore_command
scripts can only deal with whole WAL files, so the incrementally copied data is not ordinarily made available to the standby servers. It is of use only when the primary dies — then the last partial WAL file is fed to the standby before allowing it to come up. The correct implementation of this process requires cooperation of the restore_command
script with the data copying program.
Starting with PostgreSQL version 9.0, you can use streaming replication (see Section 26.2.5) to achieve the same benefits with less effort.
共享磁碟的故障轉移透過僅使用一份資料庫檔案來避免同步程序的花費。它使用由多個伺服器共享的同一個磁碟陣列。如果主要資料庫伺服器發生故障,備用伺服器就可以掛載並啟動資料庫,就好像它正在從資料庫崩潰後恢復一樣。這樣可以進行快速的故障轉移而不會遺失資料。
共享硬體的功能在網路儲存設備中很常見。儘管必須注意檔案系統具有完整的 POSIX 行為,但也可以使用網路檔案系統(請參閱 18.2.2.1 節)。此方法的一個重要限制是,如果共享磁碟陣列發生故障或損壞,則主要伺服器和備用伺服器都將無法運作。另一個問題是,在主要伺服器運行時,備用伺服器永遠都不應存取共享的資料庫檔案。
共享硬碟功能的另一種版本是檔案系統複製,其中對檔案系統的所有變更都將鏡像同步到另一台主機上的檔案系統。唯一的限制是必須確保備用伺服器具有檔案系統一致副本的方式進行鏡像-特別是,寫入備用資料庫的順序必須與主要伺服器上的順序相同。DRBD 是針對 Linux 常用檔案系統複製的解決方案。
透過讀取預寫日誌(WAL)記錄串流,可以使熱備用伺服器保持最新狀態。如果主要伺服器發生故障,則備用資料庫幾乎將包含主要伺服器的所有資料,並且可以迅速成為新的主要資料庫伺服器。這可以是同步或非同步的,不過只能以整個資料庫伺服器為單位來實行。
備用伺服器可以使用基於檔案的日誌傳送(第 26.2 節)或串流式複寫(請參閱第 26.2.5 節)或兩者的結合來實現。有關熱備用伺服器的說明,請參閱第 26.5 節。
邏輯複寫讓資料庫伺服器可以將資料修改的過程發送到另一台伺服器。 PostgreSQL 的邏輯複寫會從 WAL 產生邏輯資料修改串流。邏輯複寫允許複寫單個資料表中的資料變更。邏輯複寫不需要將特定的伺服器指定為主要伺服器或備用伺服器,而是允許資料沿多個方向流動。有關邏輯複寫的更多資訊,請參閱第 30 章。透過邏輯解譯介面(第 48 章),第三方延伸套件也有機會提供類似的功能。
A master-standby replication setup sends all data modification queries to the master server. The master server asynchronously sends data changes to the standby server. The standby can answer read-only queries while the master server is running. The standby server is ideal for data warehouse queries.
Slony-I is an example of this type of replication, with per-table granularity, and support for multiple standby servers. Because it updates the standby server asynchronously (in batches), there is possible data loss during fail over.
With SQL-based replication middleware, a program intercepts every SQL query and sends it to one or all servers. Each server operates independently. Read-write queries must be sent to all servers, so that every server receives any changes. But read-only queries can be sent to just one server, allowing the read workload to be distributed among them.
If queries are simply broadcast unmodified, functions like random()
, CURRENT_TIMESTAMP
, and sequences can have different values on different servers. This is because each server operates independently, and because SQL queries are broadcast (and not actual modified rows). If this is unacceptable, either the middleware or the application must query such values from a single server and then use those values in write queries. Another option is to use this replication option with a traditional master-standby setup, i.e., data modification queries are sent only to the master and are propagated to the standby servers via master-standby replication, not by the replication middleware. Care must also be taken that all transactions either commit or abort on all servers, perhaps using two-phase commit (PREPARE TRANSACTION and COMMIT PREPARED). Pgpool-II and Continuent Tungsten are examples of this type of replication.
對於不定期連線或通訊網路速度較慢的伺服器(例如筆記型電腦或遠端伺服器),保持伺服器之間的資料一致性是一個挑戰。使用 Asynchronous Multimaster Replication,每個伺服器獨立工作,並定期與其他伺服器溝通以識別衝突的交易事務。可以透過使用者或訂定規則來解決衝突。Bucardo 是這種複寫方式的一個範例。
在 Synchronous Multimaster Replication 中,每個伺服器都可以接受寫入請求,並且在每個事務提交之前,已修改的資料將從原始伺服器傳輸到所有其他伺服器之中。繁重的寫入活動可能導致過多的鎖定和提交延遲,從而導致效能下降。讀取請求可以發送到任何一個伺服器。有一些實作方案使用共享磁碟來減少通訊成本。同步多重主要伺服器複寫最適合大多數為讀取工作的負載情況,儘管它的最大優點是任何伺服器都可以接受寫入請求-毋須在主要伺服器和備用伺服器之間區分工作負載,並且因為資料變更是由一台伺服器發送到另一台伺服器的,像 random() 這樣的不確定結果的函數也沒有問題。
PostgreSQL 並不提供這種複寫機制,儘管可以使用 PostgreSQL 兩階段提交(PREPARE TRANSACTION 和 COMMIT PREPARED)在應用程式或中間層服務中實作這種複寫方式。
Table 26.1 總結了上面各種解決方案的功能。
There are a few solutions that do not fit into the above categories:
Data partitioning splits tables into data sets. Each set can be modified by only one server. For example, data can be partitioned by offices, e.g., London and Paris, with a server in each office. If queries combining London and Paris data are necessary, an application can query both servers, or master/standby replication can be used to keep a read-only copy of the other office's data on each server.
Many of the above solutions allow multiple servers to handle multiple queries, but none allow a single query to use multiple servers to complete faster. This solution allows multiple servers to work concurrently on a single query. It is usually accomplished by splitting the data among servers and having each server execute its part of the query and return results to a central server where they are combined and returned to the user. This can be implemented using the PL/Proxy tool set.
It should also be noted that because PostgreSQL is open source and easily extended, a number of companies have taken PostgreSQL and created commercial closed-source solutions with unique failover, replication, and load balancing capabilities. These are not discussed here.
Feature | Shared Disk | File System Repl. | Write-Ahead Log Shipping | Logical Repl. | Trigger-Based Repl. | SQL Repl. Middle-ware | Async. MM Repl. | Sync. MM Repl. |
---|---|---|---|---|---|---|---|---|
Popular examples
NAS
DRBD
built-in streaming repl.
built-in logical repl., pglogical
Londiste, Slony
pgpool-II
Bucardo
Comm. method
shared disk
disk blocks
WAL
logical decoding
table rows
SQL
table rows
table rows and row locks
No special hardware required
•
•
•
•
•
•
•
Allows multiple master servers
•
•
•
•
No master server overhead
•
•
•
•
No waiting for multiple servers
•
with sync off
with sync off
•
•
Master failure will never lose data
•
•
with sync on
with sync on
•
•
Replicas accept read-only queries
with hot standby
•
•
•
•
•
Per-table granularity
•
•
•
•
No conflict resolution necessary
•
•
•
•
•
•
資料庫伺服器可以協同工作,以便在主要伺服器故障時允許第二台伺服器快速的接管(高可用性 High Availability),或者允許多台伺服器提供相同的資料(負載平衡 Loading Balancing)。理想狀況下,資料伺服器可以無縫接軌地協同工作。網頁伺服器提供靜態網頁可以被相當簡單的組合,僅透過負載平衡把網頁請求分配到多台機器上。事實上,只提供讀取的資料庫伺服器也可以相對容易地被組合。不幸地是大多數地資料庫伺服器具有讀/寫請求的組合,可是具備讀/寫請求資料伺服器被組合起來是相當地困難。這是因為儘管只供讀取的資料只被放進每台伺服器一次,但是必須將對任何伺服器寫入的資料傳播到所有的伺服器中,以便將來對這些伺服器發送讀取請求能夠返回一致的結果。
這種同步化的問題算是伺服器協同工作上的基本難題。由於沒有單一得解決方案可以消除所有使用案例同步問題的影響,因此有多許多種解決方案。每種解決方案都以不同的方式解決問題,並最小化該問題對特定工作負載的影響。
有些解決方案處理同步化是藉由只讓單一伺服器可以修改資料。可以修改資料伺服器被稱之為 read/write、master或primary的伺服器。 可以追蹤master伺服器改變的伺服器我們稱為standby或secondary伺服器。standby伺服器不能接受連線上直到他被提升為maaster伺服器才能連線的伺服器稱之為warm standby伺服器,另一種可以接受連線且只提供其他伺服器作讀取查詢的稱之為hot standby伺服器。
一些解決方案是同步的,代表說一個資料修改的交易是不被認為提交直到所有伺服器都已經提交這些交易。這保證故障轉移不會遺失掉任何資料,和不論哪一台資料庫伺服器被查詢時,所有負載平衡的伺服器都可以返回一致的結果。相反地,非同步解決方案允許在提交交易時間和傳播到其他伺服器之間存在些許延遲,從而可能會在切換到備份伺服器時遺失某些交易,且負載平衡伺服器可能會返回一些稍微過時的結果。非同步解決方案被運用在當同步解決方案太慢的時候。
解決方案也可以被依照規模分類,某些解決方案只能處理整個資料庫伺服器,然而其他的解決方案允許處理控制在每個表或每個資料庫等級。
做任何選擇都必須考慮到其性能。通常必須在功能和性能之間取其權衡。例如一個完整的同步解決方案可能會讓性能降低一半以上,而異步解決方案可能會對性能有比較小的影響。
本節的其餘部分概述了各種故障轉移、複製和負載平衡解決方案。
版本:11
持續性歸檔可用於建構高可用性(HA)的叢集配置,其中一個或多個備用伺服器準備好在主伺服器發生故障時接管操作。此功能被廣泛稱為熱備份(warm standby)或日誌轉送(Log-Shipping)。
伺服器們是人為的相依,由主伺服器和備用伺服器協同工作以提供此功能。主伺服器以持續性歸檔模式運行,而每個備用伺服器以連續恢復模式運行,從主伺服器讀取 WAL 檔案。毌須更改資料庫的資料表即可啟用此功能,因此與其他一些複寫解決方案相比,它可以提供較低的管理成本。此配置對主伺服器的效能影響也相對較低。
直接將 WAL 記錄從一個資料庫伺服器移動到另一個資料庫伺服器通常被稱為日誌轉送。PostgreSQL 透過一次傳輸 WAL 記錄一個檔案(WAL 段落)來實現基於檔案的日誌轉送。WAL 檔案(16MB)可以在任何距離上輕鬆便宜地運輸,無論是相鄰系統,同一站點的另一個系統,還是地球另一端的其他系統。此技術所需的頻寬依主伺服器的事務速率而變化。基於記錄的日誌傳送更精細,並且通過網路連連逐步更改 WAL(請參閱第 26.2.5 節)。
應該注意的是,日誌輸送是非同步的,即 WAL 記錄在事務提交之後被傳送。因此,如果主伺服器遭受災難性故障,則存在資料遺失的可能性;尚未提交的交易將會失去。基於檔案的日誌轉送中的資料遺失的大小可以透過使用 archive_timeout 參數來限制,該參數可以設定低至數秒鐘。然而,這種低的設定將大大增加檔案傳送所需的頻寬。 串流複寫(參閱第 26.2.5 節)允許更小的資料遺失大小。
回復的效率很高,一旦備用轉為主要,備用資料庫通常只需要幾分鐘即可完全可用。因此,這稱為熱備用配置,可提供高可用性。從歸檔的基本備份和回溯還原伺服器將花費相當長的時間,因此該技術僅提供災難恢復的解決方案,而不是高可用性。備用伺服器也可用於唯讀查詢,在這種情況下,它稱為熱備份伺服器。有關更多訊息,請參閱第 26.5 節。
建立主伺服器和備用伺服器通常是好的規畫,使它們可以盡可能相似,至少從資料庫伺服器的角度來看。特別是,與資料表空間關聯的路徑名稱將在未修改的情況下傳遞。因此,如果使用此功能,主伺服器和備用伺服器必須具有相同的資料表空間的安裝路徑。請記住,如果在主伺服器上執行 CREATE TABLESPACE,則必須在執行命令之前在主伺服器和所有備用伺服器上建立所需的所有新安裝點。硬體不需要完全相同,但經驗上,維護兩個相同的系統會比在應用系統的生命週期內維護兩個不同的系統更容易。不過在硬體架構則必須相同 - 例如,從 32 位元到 64 位元系統的搭配則無法運作。
一般來說,無法在不同主要 PostgreSQL 版本的伺服器之間進行日誌傳送。PostgreSQL 全球開發團隊的原則是不要在次要版本升級期間更改磁碟格式,因此在主伺服器和備用伺服器上使用不同的次要版本可能會成功執行。 但是,並沒有保證正式支持,建議您盡可能將主伺服器和備用伺服器保持在同一版本。更新到新的次要版本時,最安全的策略是先更新備用伺服器 - 新的次要版本更有可能從先前的次要版本讀取 WAL 檔案,反過來則不一定。
在備用模式下,伺服器連續套用從主要伺服器所接收的 WAL。備用伺服器可以透過 TCP 連線(串流複寫)從 WAL 歸檔(請參閱 restore_command)。備用伺服器也會嘗試恢復在備用集群的 pg_wal 目錄中能找到的任何 WAL。這通常發生在伺服器重新啟動之後,當備用資料庫再次重新執行在重新啟動之前從主服務器串流傳輸的 WAL 時,您也可以隨時手動將檔案複製到 pg_wal 以重新執行它們。
在啟動時,備用資料庫首先恢復存檔路徑中的所有可用的 WAL,然後呼叫 restore_command。一旦達到 WAL 可用的尾端並且 restore_command 失敗,它就會嘗試恢復 pg_wal 目錄中可用的任何WAL。如果失敗,並且已啟用串流複寫,則備用資料庫會嘗試連到主伺服器,並從 archive 或 pg_wal 中找到的最後一個有效記錄開始串流傳輸 WAL。 如果失敗或未啟用串流複寫,或者稍後中斷連線,則備用資料庫將返回步驟 1 並嘗試再次從存檔中還原交易。pg_wal 和串流複寫的重試循環一直持續到伺服器停止或觸發故障轉移為止。
退出備用模式,當執行 pg_ctl promote 或找到觸發器檔案(trigger_file)時,伺服器將切換到正常操作。在故障轉移之前,將恢復存檔或 pg_wal 中立即可用的 WAL,但不會嘗試連線到主要伺服器。
Set up continuous archiving on the primary to an archive directory accessible from the standby, as described in Section 25.3. The archive location should be accessible from the standby even when the master is down, i.e. it should reside on the standby server itself or another trusted server, not on the master server.
If you want to use streaming replication, set up authentication on the primary server to allow replication connections from the standby server(s); that is, create a role and provide a suitable entry or entries in pg_hba.conf
with the database field set to replication
. Also ensure max_wal_senders
is set to a sufficiently large value in the configuration file of the primary server. If replication slots will be used, ensure that max_replication_slots
is set sufficiently high as well.
Take a base backup as described in Section 25.3.2 to bootstrap the standby server.
To set up the standby server, restore the base backup taken from primary server (see Section 25.3.4). Create a recovery command file recovery.conf
in the standby's cluster data directory, and turn on standby_mode
. Set restore_command
to a simple command to copy files from the WAL archive. If you plan to have multiple standby servers for high availability purposes, set recovery_target_timeline
to latest
, to make the standby server follow the timeline change that occurs at failover to another standby.
Do not use pg_standby or similar tools with the built-in standby mode described here. restore_command
should return immediately if the file does not exist; the server will retry the command again if necessary. See Section 26.4 for using tools like pg_standby.
If you want to use streaming replication, fill in primary_conninfo
with a libpq connection string, including the host name (or IP address) and any additional details needed to connect to the primary server. If the primary needs a password for authentication, the password needs to be specified in primary_conninfo
as well.
If you're setting up the standby server for high availability purposes, set up WAL archiving, connections and authentication like the primary server, because the standby server will work as a primary server after failover.
If you're using a WAL archive, its size can be minimized using the archive_cleanup_command parameter to remove files that are no longer required by the standby server. The pg_archivecleanup utility is designed specifically to be used with archive_cleanup_command
in typical single-standby configurations, see pg_archivecleanup. Note however, that if you're using the archive for backup purposes, you need to retain files needed to recover from at least the latest base backup, even if they're no longer needed by the standby.
A simple example of a recovery.conf
is:
You can have any number of standby servers, but if you use streaming replication, make sure you set max_wal_senders
high enough in the primary to allow them to be connected simultaneously.
Streaming replication allows a standby server to stay more up-to-date than is possible with file-based log shipping. The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled.
Streaming replication is asynchronous by default (see Section 26.2.8), in which case there is a small delay between committing a transaction in the primary and the changes becoming visible in the standby. This delay is however much smaller than with file-based log shipping, typically under one second assuming the standby is powerful enough to keep up with the load. With streaming replication, archive_timeout
is not required to reduce the data loss window.
If you use streaming replication without file-based continuous archiving, the server might recycle old WAL segments before the standby has received them. If this occurs, the standby will need to be reinitialized from a new base backup. You can avoid this by setting wal_keep_segments
to a value large enough to ensure that WAL segments are not recycled too early, or by configuring a replication slot for the standby. If you set up a WAL archive that's accessible from the standby, these solutions are not required, since the standby can always use the archive to catch up provided it retains enough segments.
To use streaming replication, set up a file-based log-shipping standby server as described in Section 26.2. The step that turns a file-based log-shipping standby into streaming replication standby is setting primary_conninfo
setting in the recovery.conf
file to point to the primary server. Set listen_addresses and authentication options (see pg_hba.conf
) on the primary so that the standby server can connect to the replication
pseudo-database on the primary server (see Section 26.2.5.1).
On systems that support the keepalive socket option, setting tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count helps the primary promptly notice a broken connection.
Set the maximum number of concurrent connections from the standby servers (see max_wal_senders for details).
When the standby is started and primary_conninfo
is set correctly, the standby will connect to the primary after replaying all WAL files available in the archive. If the connection is established successfully, you will see a walreceiver process in the standby, and a corresponding walsender process in the primary.
It is very important that the access privileges for replication be set up so that only trusted users can read the WAL stream, because it is easy to extract privileged information from it. Standby servers must authenticate to the primary as a superuser or an account that has the REPLICATION
privilege. It is recommended to create a dedicated user account with REPLICATION
and LOGIN
privileges for replication. While REPLICATION
privilege gives very high permissions, it does not allow the user to modify any data on the primary system, which the SUPERUSER
privilege does.
Client authentication for replication is controlled by a pg_hba.conf
record specifying replication
in the database
field. For example, if the standby is running on host IP 192.168.1.100
and the account name for replication is foo
, the administrator can add the following line to the pg_hba.conf
file on the primary:
The host name and port number of the primary, connection user name, and password are specified in the recovery.conf
file. The password can also be set in the ~/.pgpass
file on the standby (specify replication
in the database
field). For example, if the primary is running on host IP 192.168.1.50
, port 5432
, the account name for replication is foo
, and the password is foopass
, the administrator can add the following line to the recovery.conf
file on the standby:
An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL write location on the primary with the last WAL location received by the standby. These locations can be retrieved using pg_current_wal_lsn
on the primary and pg_last_wal_receive_lsn
on the standby, respectively (see Table 9.79 and Table 9.80 for details). The last WAL receive location in the standby is also displayed in the process status of the WAL receiver process, displayed using the ps
command (see Section 28.1 for details).
You can retrieve a list of WAL sender processes via the pg_stat_replication
view. Large differences between pg_current_wal_lsn
and the view's sent_lsn
field might indicate that the master server is under heavy load, while differences between sent_lsn
and pg_last_wal_receive_lsn
on the standby might indicate network delay, or that the standby is under heavy load.
Replication slots provide an automated way to ensure that the master does not remove WAL segments until they have been received by all standbys, and that the master does not remove rows which could cause a recovery conflict even when the standby is disconnected.
In lieu of using replication slots, it is possible to prevent the removal of old WAL segments using wal_keep_size, or by storing the segments in an archive using archive_command. However, these methods often result in retaining more WAL segments than required, whereas replication slots retain only the number of segments known to be needed. On the other hand, replication slots can retain so many WAL segments that they fill up the space allocated for pg_wal
; max_slot_wal_keep_size limits the size of WAL files retained by replication slots.
Similarly, hot_standby_feedback and vacuum_defer_cleanup_age provide protection against relevant rows being removed by vacuum, but the former provides no protection during any time period when the standby is not connected, and the latter often needs to be set to a high value to provide adequate protection. Replication slots overcome these disadvantages.
Each replication slot has a name, which can contain lower-case letters, numbers, and the underscore character.
Existing replication slots and their state can be seen in the pg_replication_slots
view.
Slots can be created and dropped either via the streaming replication protocol (see Section 52.4) or via SQL functions (see Section 9.27.6).
You can create a replication slot like this:
To configure the standby to use this slot, primary_slot_name
should be configured on the standby. Here is a simple example:
The cascading replication feature allows a standby server to accept replication connections and stream WAL records to other standbys, acting as a relay. This can be used to reduce the number of direct connections to the master and also to minimize inter-site bandwidth overheads.
A standby acting as both a receiver and a sender is known as a cascading standby. Standbys that are more directly connected to the master are known as upstream servers, while those standby servers further away are downstream servers. Cascading replication does not place limits on the number or arrangement of downstream servers, though each standby connects to only one upstream server which eventually links to a single master/primary server.
A cascading standby sends not only WAL records received from the master but also those restored from the archive. So even if the replication connection in some upstream connection is terminated, streaming replication continues downstream for as long as new WAL records are available.
Cascading replication is currently asynchronous. Synchronous replication (see Section 26.2.8) settings have no effect on cascading replication at present.
Hot Standby feedback propagates upstream, whatever the cascaded arrangement.
If an upstream standby server is promoted to become new master, downstream servers will continue to stream from the new master if recovery_target_timeline
is set to 'latest'
(the default).
To use cascading replication, set up the cascading standby so that it can accept replication connections (that is, set max_wal_senders and hot_standby, and configure host-based authentication). You will also need to set primary_conninfo
in the downstream standby to point to the cascading standby.
PostgreSQL streaming replication is asynchronous by default. If the primary server crashes then some transactions that were committed may not have been replicated to the standby server, causing data loss. The amount of data loss is proportional to the replication delay at the time of failover.
Synchronous replication offers the ability to confirm that all changes made by a transaction have been transferred to one or more synchronous standby servers. This extends that standard level of durability offered by a transaction commit. This level of protection is referred to as 2-safe replication in computer science theory, and group-1-safe (group-safe and 1-safe) when synchronous_commit
is set to remote_write
.
When requesting synchronous replication, each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server. The only possibility that data can be lost is if both the primary and the standby suffer crashes at the same time. This can provide a much higher level of durability, though only if the sysadmin is cautious about the placement and management of the two servers. Waiting for confirmation increases the user's confidence that the changes will not be lost in the event of server crashes but it also necessarily increases the response time for the requesting transaction. The minimum wait time is the round-trip time between primary to standby.
Read only transactions and transaction rollbacks need not wait for replies from standby servers. Subtransaction commits do not wait for responses from standby servers, only top-level commits. Long running actions such as data loading or index building do not wait until the very final commit message. All two-phase commit actions require commit waits, including both prepare and commit.
A synchronous standby can be a physical replication standby or a logical replication subscriber. It can also be any other physical or logical WAL replication stream consumer that knows how to send the appropriate feedback messages. Besides the built-in physical and logical replication systems, this includes special programs such as pg_receivewal
and pg_recvlogical
as well as some third-party replication systems and custom programs. Check the respective documentation for details on synchronous replication support.
Once streaming replication has been configured, configuring synchronous replication requires only one additional configuration step: synchronous_standby_names must be set to a non-empty value. synchronous_commit
must also be set to on
, but since this is the default value, typically no change is required. (See Section 19.5.1 and Section 19.6.2.) This configuration will cause each commit to wait for confirmation that the standby has written the commit record to durable storage. synchronous_commit
can be set by individual users, so it can be configured in the configuration file, for particular users or databases, or dynamically by applications, in order to control the durability guarantee on a per-transaction basis.
After a commit record has been written to disk on the primary, the WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless wal_receiver_status_interval
is set to zero on the standby. In the case that synchronous_commit
is set to remote_apply
, the standby sends reply messages when the commit record is replayed, making the transaction visible. If the standby is chosen as a synchronous standby, according to the setting of synchronous_standby_names
on the primary, the reply messages from that standby will be considered along with those from other synchronous standbys to decide when to release transactions waiting for confirmation that the commit record has been received. These parameters allow the administrator to specify which standby servers should be synchronous standbys. Note that the configuration of synchronous replication is mainly on the master. Named standbys must be directly connected to the master; the master knows nothing about downstream standby servers using cascaded replication.
Setting synchronous_commit
to remote_write
will cause each commit to wait for confirmation that the standby has received the commit record and written it out to its own operating system, but not for the data to be flushed to disk on the standby. This setting provides a weaker guarantee of durability than on
does: the standby could lose the data in the event of an operating system crash, though not a PostgreSQL crash. However, it's a useful setting in practice because it can decrease the response time for the transaction. Data loss could only occur if both the primary and the standby crash and the database of the primary gets corrupted at the same time.
Setting synchronous_commit
to remote_apply
will cause each commit to wait until the current synchronous standbys report that they have replayed the transaction, making it visible to user queries. In simple cases, this allows for load balancing with causal consistency.
Users will stop waiting if a fast shutdown is requested. However, as when using asynchronous replication, the server will not fully shutdown until all outstanding WAL records are transferred to the currently connected standby servers.
Synchronous replication supports one or more synchronous standby servers; transactions will wait until all the standby servers which are considered as synchronous confirm receipt of their data. The number of synchronous standbys that transactions must wait for replies from is specified in synchronous_standby_names
. This parameter also specifies a list of standby names and the method (FIRST
and ANY
) to choose synchronous standbys from the listed ones.
The method FIRST
specifies a priority-based synchronous replication and makes transaction commits wait until their WAL records are replicated to the requested number of synchronous standbys chosen based on their priorities. The standbys whose names appear earlier in the list are given higher priority and will be considered as synchronous. Other standby servers appearing later in this list represent potential synchronous standbys. If any of the current synchronous standbys disconnects for whatever reason, it will be replaced immediately with the next-highest-priority standby.
An example of synchronous_standby_names
for a priority-based multiple synchronous standbys is:
In this example, if four standby servers s1
, s2
, s3
and s4
are running, the two standbys s1
and s2
will be chosen as synchronous standbys because their names appear early in the list of standby names. s3
is a potential synchronous standby and will take over the role of synchronous standby when either of s1
or s2
fails. s4
is an asynchronous standby since its name is not in the list.
The method ANY
specifies a quorum-based synchronous replication and makes transaction commits wait until their WAL records are replicated to at least the requested number of synchronous standbys in the list.
An example of synchronous_standby_names
for a quorum-based multiple synchronous standbys is:
In this example, if four standby servers s1
, s2
, s3
and s4
are running, transaction commits will wait for replies from at least any two standbys of s1
, s2
and s3
. s4
is an asynchronous standby since its name is not in the list.
The synchronous states of standby servers can be viewed using the pg_stat_replication
view.
Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilize system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention.
PostgreSQL allows the application developer to specify the durability level required via replication. This can be specified for the system overall, though it can also be specified for specific users or connections, or even individual transactions.
For example, an application workload might consist of: 10% of changes are important customer details, while 90% of changes are less important data that the business can more easily survive if it is lost, such as chat messages between users.
With synchronous replication options specified at the application level (on the primary) we can offer synchronous replication for the most important changes, without slowing down the bulk of the total workload. Application level options are an important and practical tool for allowing the benefits of synchronous replication for high performance applications.
You should consider that the network bandwidth must be higher than the rate of generation of WAL data.
synchronous_standby_names
specifies the number and names of synchronous standbys that transaction commits made when synchronous_commit
is set to on
, remote_apply
or remote_write
will wait for responses from. Such transaction commits may never be completed if any one of synchronous standbys should crash.
The best solution for high availability is to ensure you keep as many synchronous standbys as requested. This can be achieved by naming multiple potential synchronous standbys using synchronous_standby_names
.
In a priority-based synchronous replication, the standbys whose names appear earlier in the list will be used as synchronous standbys. Standbys listed after these will take over the role of synchronous standby if one of current ones should fail.
In a quorum-based synchronous replication, all the standbys appearing in the list will be used as candidates for synchronous standbys. Even if one of them should fail, the other standbys will keep performing the role of candidates of synchronous standby.
When a standby first attaches to the primary, it will not yet be properly synchronized. This is described as catchup
mode. Once the lag between standby and primary reaches zero for the first time we move to real-time streaming
state. The catch-up duration may be long immediately after the standby has been created. If the standby is shut down, then the catch-up period will increase according to the length of time the standby has been down. The standby is only able to become a synchronous standby once it has reached streaming
state. This state can be viewed using the pg_stat_replication
view.
If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.
If you really cannot keep as many synchronous standbys as requested then you should decrease the number of synchronous standbys that transaction commits must wait for responses from in synchronous_standby_names
(or disable it) and reload the configuration file on the primary server.
If the primary is isolated from remaining standby servers you should fail over to the best candidate of those other remaining standby servers.
If you need to re-create a standby server while transactions are waiting, make sure that the commands pg_start_backup() and pg_stop_backup() are run in a session with synchronous_commit
= off
, otherwise those requests will wait forever for the standby to appear.
When continuous WAL archiving is used in a standby, there are two different scenarios: the WAL archive can be shared between the primary and the standby, or the standby can have its own WAL archive. When the standby has its own WAL archive, set archive_mode
to always
, and the standby will call the archive command for every WAL segment it receives, whether it's by restoring from the archive or by streaming replication. The shared archive can be handled similarly, but the archive_command
must test if the file being archived exists already, and if the existing file has identical contents. This requires more care in the archive_command
, as it must be careful to not overwrite an existing file with different contents, but return success if the exactly same file is archived twice. And all that must be done free of race conditions, if two servers attempt to archive the same file at the same time.
If archive_mode
is set to on
, the archiver is not enabled during recovery or standby mode. If the standby server is promoted, it will start archiving after the promotion, but will not archive any WAL it did not generate itself. To get a complete series of WAL files in the archive, you must ensure that all WAL is archived, before it reaches the standby. This is inherently true with file-based log shipping, as the standby can only restore files that are found in the archive, but not if streaming replication is enabled. When a server is not in recovery mode, there is no difference between on
and always
modes.
Hot Standby is the term used to describe the ability to connect to the server and run read-only queries while the server is in archive recovery or standby mode. This is useful both for replication purposes and for restoring a backup to a desired state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal operation while users continue running queries and/or keep their connections open.
Running queries in hot standby mode is similar to normal query operation, though there are several usage and administrative differences explained below.
When the hot_standby parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written.
The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Once the commit record for a transaction is replayed on the standby, the changes made by that transaction will be visible to any new snapshots taken on the standby. Snapshots may be taken at the start of each query or at the start of each transaction, depending on the current transaction isolation level. For more details, see Section 13.2.
Transactions started during hot standby may issue the following commands:
Query access: SELECT
, COPY TO
Cursor commands: DECLARE
, FETCH
, CLOSE
Settings: SHOW
, SET
, RESET
Transaction management commands:
BEGIN
, END
, ABORT
, START TRANSACTION
SAVEPOINT
, RELEASE
, ROLLBACK TO SAVEPOINT
EXCEPTION
blocks and other internal subtransactions
LOCK TABLE
, though only when explicitly in one of these modes: ACCESS SHARE
, ROW SHARE
or ROW EXCLUSIVE
.
Plans and resources: PREPARE
, EXECUTE
, DEALLOCATE
, DISCARD
Plugins and extensions: LOAD
UNLISTEN
Transactions started during hot standby will never be assigned a transaction ID and cannot write to the system write-ahead log. Therefore, the following actions will produce error messages:
Data Manipulation Language (DML): INSERT
, UPDATE
, DELETE
, COPY FROM
, TRUNCATE
. Note that there are no allowed actions that result in a trigger being executed during recovery. This restriction applies even to temporary tables, because table rows cannot be read or written without assigning a transaction ID, which is currently not possible in a Hot Standby environment.
Data Definition Language (DDL): CREATE
, DROP
, ALTER
, COMMENT
. This restriction applies even to temporary tables, because carrying out these operations would require updating the system catalog tables.
SELECT ... FOR SHARE | UPDATE
, because row locks cannot be taken without updating the underlying data files.
Rules on SELECT
statements that generate DML commands.
LOCK
that explicitly requests a mode higher than ROW EXCLUSIVE MODE
.
LOCK
in short default form, since it requests ACCESS EXCLUSIVE MODE
.
Transaction management commands that explicitly set non-read-only state:
BEGIN READ WRITE
, START TRANSACTION READ WRITE
SET TRANSACTION READ WRITE
, SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
SET transaction_read_only = off
Two-phase commit commands: PREPARE TRANSACTION
, COMMIT PREPARED
, ROLLBACK PREPARED
because even read-only transactions need to write WAL in the prepare phase (the first phase of two phase commit).
Sequence updates: nextval()
, setval()
LISTEN
, NOTIFY
In normal operation, “read-only” transactions are allowed to use LISTEN
and NOTIFY
, so Hot Standby sessions operate under slightly tighter restrictions than ordinary read-only sessions. It is possible that some of these restrictions might be loosened in a future release.
During hot standby, the parameter transaction_read_only
is always true and may not be changed. But as long as no attempt is made to modify the database, connections during hot standby will act much like any other database connection. If failover or switchover occurs, the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Once hot standby finishes, it will be possible to initiate read-write transactions (even from a session begun during hot standby).
Users will be able to tell whether their session is read-only by issuing SHOW transaction_read_only
. In addition, a set of functions (Table 9.86) allow users to access information about the standby server. These allow you to write programs that are aware of the current state of the database. These can be used to monitor the progress of recovery, or to allow you to write complex programs that restore the database to particular states.
The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O.
There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be canceled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts. Conflict cases include:
Access Exclusive locks taken on the primary server, including both explicit LOCK
commands and various DDL actions, conflict with table accesses in standby queries.
Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files.
Dropping a database on the primary conflicts with sessions connected to that database on the standby.
Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still “see” any of the rows to be removed.
Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible.
On the primary server, these cases simply result in waiting; and the user might choose to cancel either of the conflicting actions. However, on the standby there is no choice: the WAL-logged action already occurred on the primary so the standby must not fail to apply it. Furthermore, allowing WAL application to wait indefinitely may be very undesirable, because the standby's state will become increasingly far behind the primary's. Therefore, a mechanism is provided to forcibly cancel standby queries that conflict with to-be-applied WAL records.
An example of the problem situation is an administrator on the primary server running DROP TABLE
on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE
is applied on the standby. If this situation occurred on the primary, the DROP TABLE
would wait until the other query had finished. But when DROP TABLE
is run on the primary, the primary doesn't have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE
can be applied.
When a conflicting query is short, it's typically desirable to allow it to complete by delaying WAL application for a little bit; but a long delay in WAL application is usually not desirable. So the cancel mechanism has parameters, max_standby_archive_delay and max_standby_streaming_delay, that define the maximum allowed delay in WAL application. Conflicting queries will be canceled once it has taken longer than the relevant delay setting to apply any newly-received WAL data. There are two parameters so that different delay values can be specified for the case of reading WAL data from an archive (i.e., initial recovery from a base backup or “catching up” a standby server that has fallen far behind) versus reading WAL data via streaming replication.
In a standby server that exists primarily for high availability, it's best to set the delay parameters relatively short, so that the server cannot fall far behind the primary due to delays caused by standby queries. However, if the standby server is meant for executing long-running queries, then a high or even infinite delay value may be preferable. Keep in mind however that a long-running query could cause other sessions on the standby server to not see recent changes on the primary, if it delays application of WAL records.
Once the delay specified by max_standby_archive_delay
or max_standby_streaming_delay
has been exceeded, conflicting queries will be canceled. This usually results just in a cancellation error, although in the case of replaying a DROP DATABASE
the entire conflicting session will be terminated. Also, if the conflict is over a lock held by an idle transaction, the conflicting session is terminated (this behavior might change in the future).
Canceled queries may be retried immediately (after beginning a new transaction, of course). Since query cancellation depends on the nature of the WAL records being replayed, a query that was canceled may well succeed if it is executed again.
Keep in mind that the delay parameters are compared to the elapsed time since the WAL data was received by the standby server. Thus, the grace period allowed to any one query on the standby is never more than the delay parameter, and could be considerably less if the standby has already fallen behind as a result of waiting for previous queries to complete, or as a result of being unable to keep up with a heavy update load.
The most common reason for conflict between standby queries and WAL replay is “early cleanup”. Normally, PostgreSQL allows cleanup of old row versions when there are no transactions that need to see them to ensure correct visibility of data according to MVCC rules. However, this rule can only be applied for transactions executing on the master. So it is possible that cleanup on the master will remove row versions that are still visible to a transaction on the standby.
Experienced users should note that both row version cleanup and row version freezing will potentially conflict with standby queries. Running a manual VACUUM FREEZE
is likely to cause conflicts even on tables with no updated or deleted rows.
Users should be clear that tables that are regularly and heavily updated on the primary server will quickly cause cancellation of longer running queries on the standby. In such cases the setting of a finite value for max_standby_archive_delay
or max_standby_streaming_delay
can be considered similar to setting statement_timeout
.
Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. The first option is to set the parameter hot_standby_feedback
, which prevents VACUUM
from removing recently-dead rows and so cleanup conflicts do not occur. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server, and you are still getting the benefit of off-loading execution onto the standby. If standby servers connect and disconnect frequently, you might want to make adjustments to handle the period when hot_standby_feedback
feedback is not being provided. For example, consider increasing max_standby_archive_delay
so that queries are not rapidly canceled by conflicts in WAL archive files during disconnected periods. You should also consider increasing max_standby_streaming_delay
to avoid rapid cancellations by newly-arrived streaming WAL entries after reconnection.
Another option is to increase vacuum_defer_cleanup_age on the primary server, so that dead rows will not be cleaned up as quickly as they normally would be. This will allow more time for queries to execute before they are canceled on the standby, without having to set a high max_standby_streaming_delay
. However it is difficult to guarantee any specific execution-time window with this approach, since vacuum_defer_cleanup_age
is measured in transactions executed on the primary server.
The number of query cancels and the reason for them can be viewed using the pg_stat_database_conflicts
system view on the standby server. The pg_stat_database
system view also contains summary information.
If hot_standby
is on
in postgresql.conf
(the default value) and there is a standby.signal
file present, the server will run in Hot Standby mode. However, it may take some time for Hot Standby connections to be allowed, because the server will not accept connections until it has completed sufficient recovery to provide a consistent state against which queries can run. During this period, clients that attempt to connect will be refused with an error message. To confirm the server has come up, either loop trying to connect from the application, or look for these messages in the server logs:
Consistency information is recorded once per checkpoint on the primary. It is not possible to enable hot standby when reading WAL written during a period when wal_level
was not set to replica
or logical
on the primary. Reaching a consistent state can also be delayed in the presence of both of these conditions:
A write transaction has more than 64 subtransactions
Very long-lived write transactions
If you are running file-based log shipping ("warm standby"), you might need to wait until the next WAL file arrives, which could be as long as the archive_timeout
setting on the primary.
The setting of some parameters on the standby will need reconfiguration if they have been changed on the primary. For these parameters, the value on the standby must be equal to or greater than the value on the primary. Therefore, if you want to increase these values, you should do so on all standby servers first, before applying the changes to the primary server. Conversely, if you want to decrease these values, you should do so on the primary server first, before applying the changes to all standby servers. If these parameters are not set high enough then the standby will refuse to start. Higher values can then be supplied and the server restarted to begin recovery again. These parameters are:
max_connections
max_prepared_transactions
max_locks_per_transaction
max_wal_senders
max_worker_processes
It is important that the administrator select appropriate settings for max_standby_archive_delay and max_standby_streaming_delay. The best choices vary depending on business priorities. For example if the server is primarily tasked as a High Availability server, then you will want low delay settings, perhaps even zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it might be acceptable to set the maximum delay values to many hours, or even -1 which means wait forever for queries to complete.
Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes even though all users are read-only; no changes occur to the data values themselves. Users will still write large sort temporary files and re-generate relcache info files, so no part of the database is truly read-only during hot standby mode. Note also that writes to remote databases using dblink module, and other operations outside the database using PL functions will still be possible, even though the transaction is read-only locally.
The following types of administration commands are not accepted during recovery mode:
Data Definition Language (DDL): e.g., CREATE INDEX
Privilege and Ownership: GRANT
, REVOKE
, REASSIGN
Maintenance commands: ANALYZE
, VACUUM
, CLUSTER
, REINDEX
Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary.
As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.
pg_cancel_backend()
and pg_terminate_backend()
will work on user backends, but not the Startup process, which performs recovery. pg_stat_activity
does not show recovering transactions as active. As a result, pg_prepared_xacts
is always empty during recovery. If you wish to resolve in-doubt prepared transactions, view pg_prepared_xacts
on the primary and issue commands to resolve transactions there or resolve them after the end of recovery.
pg_locks
will show locks held by backends, as normal. pg_locks
also shows a virtual transaction managed by the Startup process that owns all AccessExclusiveLocks
held by transactions being replayed by recovery. Note that the Startup process does not acquire locks to make database changes, and thus locks other than AccessExclusiveLocks
do not show in pg_locks
for the Startup process; they are just presumed to exist.
The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The check_postgres monitoring script will also work, though some reported values could give different or confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on the standby. Vacuums running on the primary do still send their changes to the standby.
WAL file control commands will not work during recovery, e.g., pg_start_backup
, pg_switch_wal
etc.
Dynamically loadable modules work, including pg_stat_statements
.
Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which they are acquired.
Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers.
New OIDs cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database.
Currently, temporary table creation is not allowed during read only transactions, so in some cases existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both a SQL Standard compliance issue and a technical issue.
DROP TABLESPACE
can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces
parameter. If there are temporary files in the tablespace, all active queries are canceled to ensure that temporary files are removed, so the tablespace can be removed and WAL replay can continue.
Running DROP DATABASE
or ALTER DATABASE ... SET TABLESPACE
on the primary will generate a WAL entry that will cause all users connected to that database on the standby to be forcibly disconnected. This action occurs immediately, whatever the setting of max_standby_streaming_delay
. Note that ALTER DATABASE ... RENAME
does not disconnect users, which in most cases will go unnoticed, though might in some cases cause a program confusion if it depends in some way upon database name.
In normal (non-recovery) mode, if you issue DROP USER
or DROP ROLE
for a role with login capability while that user is still connected then nothing happens to the connected user — they remain connected. The user cannot reconnect however. This behavior applies in recovery also, so a DROP USER
on the primary does not disconnect that user on the standby.
The statistics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug.
Autovacuum is not active during recovery. It will start normally at the end of recovery.
The checkpointer process and the background writer process are active during recovery. The checkpointer process will perform restartpoints (similar to checkpoints on the primary) and the background writer process will perform normal block cleaning activities. This can include updates of the hint bit information stored on the standby server. The CHECKPOINT
command is accepted during recovery, though it performs a restartpoint rather than a new checkpoint.
Various parameters have been mentioned above in Section 26.5.2 and Section 26.5.3.
On the primary, parameters wal_level and vacuum_defer_cleanup_age can be used. max_standby_archive_delay and max_standby_streaming_delay have no effect if set on the primary.
On the standby, parameters hot_standby, max_standby_archive_delay and max_standby_streaming_delay can be used. vacuum_defer_cleanup_age has no effect as long as the server remains in standby mode, though it will become relevant if the standby becomes primary.
There are several limitations of Hot Standby. These can and probably will be fixed in future releases:
Full knowledge of running transactions is required before snapshots can be taken. Transactions that use large numbers of subtransactions (currently greater than 64) will delay the start of read only connections until the completion of the longest running write transaction. If this situation occurs, explanatory messages will be sent to the server log.
Valid starting points for standby queries are generated at each checkpoint on the master. If the standby is shut down while the master is in a shutdown state, it might not be possible to re-enter Hot Standby until the primary is started up, so that it generates further starting points in the WAL logs. This situation isn't a problem in the most common situations where it might happen. Generally, if the primary is shut down and not available anymore, that's likely due to a serious failure that requires the standby being converted to operate as the new primary anyway. And in situations where the primary is being intentionally taken down, coordinating to make sure the standby becomes the new primary smoothly is also standard procedure.
At the end of recovery, AccessExclusiveLocks
held by prepared transactions will require twice the normal number of lock table entries. If you plan on running either a large number of concurrent prepared transactions that normally take AccessExclusiveLocks
, or you plan on having one large transaction that takes many AccessExclusiveLocks
, you are advised to select a larger value of max_locks_per_transaction
, perhaps as much as twice the value of the parameter on the primary server. You need not consider this at all if your setting of max_prepared_transactions
is 0.
The Serializable transaction isolation level is not yet available in hot standby. (See Section 13.2.3 and Section 13.4.1 for details.) An attempt to set a transaction to the serializable isolation level in hot standby mode will generate an error.