Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Each heap and index relation, except for hash indexes, has a Free Space Map (FSM) to keep track of available space in the relation. It's stored alongside the main relation data in a separate relation fork, named after the filenode number of the relation, plus a _fsm
suffix. For example, if the filenode of a relation is 12345, the FSM is stored in a file called 12345_fsm
, in the same directory as the main relation file.
The Free Space Map is organized as a tree of FSM pages. The bottom level FSM pages store the free space available on each heap (or index) page, using one byte to represent each such page. The upper levels aggregate information from the lower levels.
Within each FSM page is a binary tree, stored in an array with one byte per node. Each leaf node represents a heap page, or a lower level FSM page. In each non-leaf node, the higher of its children's values is stored. The maximum value in the leaf nodes is therefore stored at the root.
See src/backend/storage/freespace/README
for more details on how the FSM is structured, and how it's updated and searched. The pg_freespacemap module can be used to examine the information stored in free space maps.
Each unlogged table, and each index on an unlogged table, has an initialization fork. The initialization fork is an empty table or index of the appropriate type. When an unlogged table must be reset to empty due to a crash, the initialization fork is copied over the main fork, and any other forks are erased (they will be recreated automatically as needed).
本章概述了 PostgreSQL 資料庫使用的實體儲存格式。
版本:11
本節概述了 TOAST(The Oversized-Attribute Storage Technique,超大型屬性儲存技術)。
PostgreSQL 使用固定的頁面大小(通常為 8 kB),並且不允許 tuple 跨越多個頁面。 因此,不可能直接儲存非常大的字串。為了克服這個限制,將大字串壓縮和分解成多個實體資料列。這對使用者而言是無感的,對大多數後端程式碼的影響很小。該技術被親切地稱為 TOAST(或「切片麵包以來最好的東西」)。TOAST 基礎結構還用於改進記憶體中大資料值的處理。
只有某些資料型別支援 TOAST - 不需要對無法産生大字串的資料型別增加成本。為了支援 TOAST,資料型別必須具有可變長度(varlena)表示,其中,通常,任何儲存值的第一個 4 bytes 包含以 byte 為單位的總長度(包括其自身)。TOAST不會限制資料型別表示的其餘部分。統稱為 TOASTed 的特殊值表示透過修改或重新解釋此初始長度字來起作用。因此,支援 TOAST-able 資料型別的 C 語言函數必須注意它們如何處理可能的 TOASTed 輸入值:輸入實際上可能不包含 4 bytes 長度的字和內容,直到它被解除 TOAST。(這通常透過在對輸入值執行任何操作之前呼叫 PG_DETOAST_DATUM 來完成,但在某些情況下可以採用更有效的方法。有關更多詳細訊息,請參閱第 37.13.1 節。)
TOAST 使用 varlena 長度的兩位元(big-endian 機器上的高位元,little-endian 機器上的低位元),從而將 TOAST-able 資料型別的任何值的邏輯大小限制為 1 GB。當兩個位元都為零時,該值是資料型別的普通值非 TOAST,長度位元組的其餘位以位元組為單位記錄總資料大小(包括長度位元組)。當設定最高位或最低位時,該值只有一個單位元組標頭而不是普通的四位元組標頭,該位元組的其餘位元表示以位元組為單位的總資料大小(包括長度位元組) 。此額外的方案支援空間高效率儲存短於 127 位元組的值,同時仍允許資料型別在需要時增長到 1 GB。具有單位元組標頭的值不在任何特定邊界上對齊,而具有四位元組標頭的值在至少四位元組邊界上對齊;與短值相比,這種省略對齊填充提供了額外的空間節省。作為特殊情況,如果單位元組標頭的剩餘位全部為零(對於自包含長度而言這是不可能的),則該值是指向外部資料的指標,具有如所描述的幾種可能的替代方案,如下所示。這種 TOAST 指標的型別和大小由儲存在資料的第二個位元組中的代碼決定。最後,當最高位元或最低位元清除為零但相鄰位置時,資料的內容已被壓縮,必須在使用前解壓縮。在這種情況下,四位元組長度字的剩餘位表示壓縮資料的總大小,而不是原始資料。請注意,對於外部資料也可以進行壓縮,但 varlena 標頭不會告訴它是否已經發生 - 而 TOAST 指標的內容則說明這件事。
如上所述,有多種類型的 TOAST 指標基準。最舊和最常見的類型是指向儲存在 TOAST 資料表中的外部資料的指標,該資料表與包含 TOAST 指標資料本身的資料表分開但與之相關聯。當要儲存在磁碟上的 tuple 太大而無法按原樣儲存時,這些磁碟指標基準由 TOAST 管理代碼(在 access/heap/tuptoaster.c 中)建立。更多細節見第 68.2.1 節。或者,TOAST 指標資料可以包含指向出現在記憶體中其他位置外部資料的指標。這些資料必然是短暫的,並且永遠不會出現在磁碟上,但它們對於避免複製和冗餘處理大量資料值非常有用。更多細節見第 68.2.2 節。
用於壓縮資料的壓縮技術是 LZ 系列壓縮技術中相當簡單且非常快速的方法。有關詳細訊息,請參閱 src/common/pg_lzcompress.c。
如果資料表的任何欄位都是可以 TOAST 的,則該資料表將擁有關連的 TOAST 資料表,其 OID 儲存在資料表的 pg_class.reltoastrelid 項目中。磁盤上 TOAST 後的值保留在 TOAST 資料表中,下面將有更詳細的描述。
將 out-of-line 的內容(在壓縮後使用)分割為最多 TOAST_MAX_CHUNK_SIZE 個字元的區塊(預設情況下,選擇此值使得四個區塊的資料列行剛好放進一個 page,大約為 2000 個字元)。每個區塊都屬於其所有資料表的 TOAST 資料表中單獨的資料列來儲存。每個 TOAST 資料表都有欄位的 chunk_id(識別特定有 TOAST 值的 OID),chunk_seq(其值中區塊的序列號)和 chunk_data(區塊的實際資料)。chunk_id 和 chunk_seq 上的唯一索引提供了對內容的快速檢索。表示線上磁碟 TOAST 值的指標資料需要儲存要查看的 TOAST 資料表 OID 以及特定值的 OID(其chunk_id)。為方便起見,指標 datum 還儲存邏輯上的 datum 大小(原始未壓縮字串長度)和實際上的儲存大小(如果套用了壓縮則會不同)。因此,允許 varlena 標頭字元,磁碟 TOAST 指標資料的總大小為 18 個位元組,不論其所表示字串大小。
僅當要儲存在資料表中的資料列內容大於 TOAST_TUPLE_THRESHOLD 字元(通常為2 kB)時,才會觸發 TOAST 機制。TOAST 程式將會壓縮或移動字串內容,直到資料列小於 TOAST_TUPLE_TARGET 個字元(通常也是 2 kB)或者不能再獲得更多的增益。在 UPDATE 操作期間,未變更字串的內容通常就保持原樣;因此,如果沒有任何 out-of-line 需要變更,則具有 out-of-line 的資料列更新就不會產生任何 TOAST 成本。
TOAST 機制識別用於在磁碟上儲存可 TOAST 欄位有四種不同策略:
PLAIN 可防止壓縮或 out-of-line 儲存方式;此外,它禁止使用 varlena 類型的單字元標頭。對於非 TOAST-capable 資料型別欄位,這是唯一可行的策略。
EXTENDED 允許壓縮和 out-of-line 儲存。這是大多數 TOAST-capable 資料型別的預設方式。首先嘗試壓縮,然後在資料列仍然太大的情況下進行 out-of-line 儲存。
EXTERNAL 允許 out-of-line 儲存但不允許壓縮。使用 EXTERNAL 將使大量文字和 bytea 欄位上的子字串操作更快(以增加的儲存空間為代價),因為這些操作被最佳化為在未壓縮時僅獲取 out-of-line 內容所需的部分。
MAIN 允許壓縮但不允許 out-of-line 儲存。(實際上,仍然會為這些欄位執行 out-of-line 儲存,但只有在沒有其他方法使資料列足夠小到適合頁面時才做的最後手段。)
每個 TOAST-able 資料型別會為該型別的欄位指定預設策略,但是可以使用 ALTER TABLE ... SET STORAGE 變更指定資料表欄位的策略。
可以使用 ALTER TABLE ... SET(toast_tuple_target = N)
為每個資料表調整 TOAST_TUPLE_TARGET
的值。
與更直覺的方法(例如允許資料列內容跨越頁面)相比,此方案具有許多優點。假設查詢通常透過與相對較小的鍵值進行比較來過濾,執行程序的大部分工作將使用主要欄位完成。 TOASTed 屬性的大量內容只會在結果集發送到用戶端時被取出(如果選中的話)。因此,與沒有任何外部儲存的情況相比,主要資料表更小並且其更多資料列置於共享緩衝區高速處理。排序集合也會縮小,而排序通常完全在記憶體中完成。一個小小的測試顯示,包含典型 HTML 頁面及其 URL 的資料儲存在大約一半的原始資料大小(包括 TOAST 資料表)中,並且主要資料表僅包含大約 10% 的內容(URL 和一些小的 HTML)。與未轉換的相比,並沒有執行時間差異,其中所有 HTML 頁面都被削減到 7 kB 以適應頁面。
TOAST pointers can point to data that is not on disk, but is elsewhere in the memory of the current server process. Such pointers obviously cannot be long-lived, but they are nonetheless useful. There are currently two sub-cases: pointers to indirect data and pointers to expanded data.
Indirect TOAST pointers simply point at a non-indirect varlena value stored somewhere in memory. This case was originally created merely as a proof of concept, but it is currently used during logical decoding to avoid possibly having to create physical tuples exceeding 1 GB (as pulling all out-of-line field values into the tuple might do). The case is of limited use since the creator of the pointer datum is entirely responsible that the referenced data survives for as long as the pointer could exist, and there is no infrastructure to help with this.
Expanded TOAST pointers are useful for complex data types whose on-disk representation is not especially suited for computational purposes. As an example, the standard varlena representation of a PostgreSQL array includes dimensionality information, a nulls bitmap if there are any null elements, then the values of all the elements in order. When the element type itself is variable-length, the only way to find the N
'th element is to scan through all the preceding elements. This representation is appropriate for on-disk storage because of its compactness, but for computations with the array it's much nicer to have an “expanded” or “deconstructed” representation in which all the element starting locations have been identified. The TOAST pointer mechanism supports this need by allowing a pass-by-reference Datum to point to either a standard varlena value (the on-disk representation) or a TOAST pointer that points to an expanded representation somewhere in memory. The details of this expanded representation are up to the data type, though it must have a standard header and meet the other API requirements given in src/include/utils/expandeddatum.h
. C-level functions working with the data type can choose to handle either representation. Functions that do not know about the expanded representation, but simply apply PG_DETOAST_DATUM
to their inputs, will automatically receive the traditional varlena representation; so support for an expanded representation can be introduced incrementally, one function at a time.
TOAST pointers to expanded values are further broken down into read-write and read-only pointers. The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not; it must first create a copy if it wants to make a modified version of the value. This distinction and some associated conventions make it possible to avoid unnecessary copying of expanded values during query execution.
For all types of in-memory TOAST pointer, the TOAST management code ensures that no such pointer datum can accidentally get stored on disk. In-memory TOAST pointers are automatically expanded to normal in-line varlena values before storage — and then possibly converted to on-disk TOAST pointers, if the containing tuple would otherwise be too big.
This section describes the storage format at the level of files and directories.
Traditionally, the configuration and data files used by a database cluster are stored together within the cluster's data directory, commonly referred to as PGDATA
(after the name of the environment variable that can be used to define it). A common location for PGDATA
is /var/lib/pgsql/data
. Multiple clusters, managed by different server instances, can exist on the same machine.
The PGDATA
directory contains several subdirectories and control files, as shown in Table 68.1. In addition to these required items, the cluster configuration files postgresql.conf
, pg_hba.conf
, and pg_ident.conf
are traditionally stored in PGDATA
, although it is possible to place them elsewhere.
Table 68.1. Contents of PGDATA
PG_VERSION
A file containing the major version number of PostgreSQL
base
Subdirectory containing per-database subdirectories
current_logfiles
File recording the log file(s) currently written to by the logging collector
global
Subdirectory containing cluster-wide tables, such as pg_database
pg_commit_ts
Subdirectory containing transaction commit timestamp data
pg_dynshmem
Subdirectory containing files used by the dynamic shared memory subsystem
pg_logical
Subdirectory containing status data for logical decoding
pg_multixact
Subdirectory containing multitransaction status data (used for shared row locks)
pg_notify
Subdirectory containing LISTEN/NOTIFY status data
pg_replslot
Subdirectory containing replication slot data
pg_serial
Subdirectory containing information about committed serializable transactions
pg_snapshots
Subdirectory containing exported snapshots
pg_stat
Subdirectory containing permanent files for the statistics subsystem
pg_stat_tmp
Subdirectory containing temporary files for the statistics subsystem
pg_subtrans
Subdirectory containing subtransaction status data
pg_tblspc
Subdirectory containing symbolic links to tablespaces
pg_twophase
Subdirectory containing state files for prepared transactions
pg_wal
Subdirectory containing WAL (Write Ahead Log) files
pg_xact
Subdirectory containing transaction commit status data
postgresql.auto.conf
A file used for storing configuration parameters that are set by ALTER SYSTEM
postmaster.opts
A file recording the command-line options the server was last started with
postmaster.pid
A lock file recording the current postmaster process ID (PID), cluster data directory path, postmaster start timestamp, port number, Unix-domain socket directory path (empty on Windows), first valid listen_address (IP address or *
, or empty if not listening on TCP), and shared memory segment ID (this file is not present after server shutdown)
For each database in the cluster there is a subdirectory within PGDATA/base
, named after the database's OID in pg_database
. This subdirectory is the default location for the database's files; in particular, its system catalogs are stored there.
Note that the following sections describe the behavior of the builtin heap
table access method, and the builtin index access methods. Due to the extensible nature of PostgreSQL, other access methods might work differently.
Each table and index is stored in a separate file. For ordinary relations, these files are named after the table or index's filenode number, which can be found in pg_class
.relfilenode
. But for temporary relations, the file name is of the form t
BBB
_FFF
, where BBB
is the backend ID of the backend which created the file, and FFF
is the filenode number. In either case, in addition to the main file (a/k/a main fork), each table and index has a free space map (see Section 68.3), which stores information about free space available in the relation. The free space map is stored in a file named with the filenode number plus the suffix _fsm
. Tables also have a visibility map, stored in a fork with the suffix _vm
, to track which pages are known to have no dead tuples. The visibility map is described further in Section 68.4. Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init
(see Section 68.5).
Note that while a table's filenode often matches its OID, this is not necessarily the case; some operations, like TRUNCATE
, REINDEX
, CLUSTER
and some forms of ALTER TABLE
, can change the filenode while preserving the OID. Avoid assuming that filenode and table OID are the same. Also, for certain system catalogs including pg_class
itself, pg_class
.relfilenode
contains zero. The actual filenode number of these catalogs is stored in a lower-level data structure, and can be obtained using the pg_relation_filenode()
function.
When a table or index exceeds 1 GB, it is divided into gigabyte-sized segments. The first segment's file name is the same as the filenode; subsequent segments are named filenode.1, filenode.2, etc. This arrangement avoids problems on platforms that have file size limitations. (Actually, 1 GB is just the default segment size. The segment size can be adjusted using the configuration option --with-segsize
when building PostgreSQL.) In principle, free space map and visibility map forks could require multiple segments as well, though this is unlikely to happen in practice.
A table that has columns with potentially large entries will have an associated TOAST table, which is used for out-of-line storage of field values that are too large to keep in the table rows proper. pg_class
.reltoastrelid
links from a table to its TOAST table, if any. See Section 68.2 for more information.
The contents of tables and indexes are discussed further in Section 68.6.
Tablespaces make the scenario more complicated. Each user-defined tablespace has a symbolic link inside the PGDATA/pg_tblspc
directory, which points to the physical tablespace directory (i.e., the location specified in the tablespace's CREATE TABLESPACE
command). This symbolic link is named after the tablespace's OID. Inside the physical tablespace directory there is a subdirectory with a name that depends on the PostgreSQL server version, such as PG_9.0_201008051
. (The reason for using this subdirectory is so that successive versions of the database can use the same CREATE TABLESPACE
location value without conflicts.) Within the version-specific subdirectory, there is a subdirectory for each database that has elements in the tablespace, named after the database's OID. Tables and indexes are stored within that directory, using the filenode naming scheme. The pg_default
tablespace is not accessed through pg_tblspc
, but corresponds to PGDATA/base
. Similarly, the pg_global
tablespace is not accessed through pg_tblspc
, but corresponds to PGDATA/global
.
The pg_relation_filepath()
function shows the entire path (relative to PGDATA
) of any relation. It is often useful as a substitute for remembering many of the above rules. But keep in mind that this function just gives the name of the first segment of the main fork of the relation — you may need to append a segment number and/or _fsm
, _vm
, or _init
to find all the files associated with the relation.
Temporary files (for operations such as sorting more data than can fit in memory) are created within PGDATA/base/pgsql_tmp
, or within a pgsql_tmp
subdirectory of a tablespace directory if a tablespace other than pg_default
is specified for them. The name of a temporary file has the form pgsql_tmp
PPP
.NNN
, where PPP
is the PID of the owning backend and NNN
distinguishes different temporary files of that backend.
每個 heap 關連都有一個可見性映射表(VM,Visibility Map),用於追踪哪些頁面僅包含已知對所有活動事務可見的 tuple;它還追踪哪些頁面僅包含凍結的 tuple。 它與主要的關連資料一起儲存在一個單獨的關連分支中,以關連的 filenode 編號命名,加上 _vm 後綴。例如,如果關連的 filenode 是 12345,則 VM 儲存在名稱為 12345_vm 的檔案中,與主要關連檔案位於同一目錄中。請注意,索引沒有 VM。
可見性映射表將每個 heap 頁面儲存 2 個位元。第一個位元(如果為 1)表示頁面全部可見,或者換句話說,頁面不包含任何需要清理的 tuple。索引限定掃描也可以使用此訊息來索引限定掃描 tuple 來回答查詢。第二個位元(如果為 1)表示頁面上的所有 tuple 都已凍結。這意味著即使是防止交易重疊清理也不需要重新讀取頁面。
映射表是保守的,因為我們得確保無論何時設定一個位元,我們都知道條件為真,但如果沒有設定一個位元,它可能會也可能不會成立。可見性映射位元僅由 vacuum 設定,但可以透過頁面上的任何資料修改操作清除。
pg_visibility 模組可用於檢查可見性映射表中儲存的訊息。
This section provides an overview of the page format used within PostgreSQL tables and indexes.[15] Sequences and TOAST tables are formatted just like a regular table.
In the following explanation, a byte is assumed to contain 8 bits. In addition, the term item refers to an individual data value that is stored on a page. In a table, an item is a row; in an index, an item is an index entry.
Every table and index is stored as an array of pages of a fixed size (usually 8 kB, although a different page size can be selected when compiling the server). In a table, all the pages are logically equivalent, so a particular item (row) can be stored in any page. In indexes, the first page is generally reserved as a metapage holding control information, and there can be different types of pages within the index, depending on the index access method.
Table 68.2 shows the overall layout of a page. There are five parts to each page.
Table 68.2. Overall Page Layout
PageHeaderData
24 bytes long. Contains general information about the page, including free space pointers.
ItemIdData
Array of item identifiers pointing to the actual items. Each entry is an (offset,length) pair. 4 bytes per item.
Free space
The unallocated space. New item identifiers are allocated from the start of this area, new items from the end.
Items
The actual items themselves.
Special space
Index access method specific data. Different methods store different data. Empty in ordinary tables.
The first 24 bytes of each page consists of a page header (PageHeaderData
). Its format is detailed in Table 68.3. The first field tracks the most recent WAL entry related to this page. The second field contains the page checksum if data checksums are enabled. Next is a 2-byte field containing flag bits. This is followed by three 2-byte integer fields (pd_lower
, pd_upper
, and pd_special
). These contain byte offsets from the page start to the start of unallocated space, to the end of unallocated space, and to the start of the special space. The next 2 bytes of the page header, pd_pagesize_version
, store both the page size and a version indicator. Beginning with PostgreSQL 8.3 the version number is 4; PostgreSQL 8.1 and 8.2 used version number 3; PostgreSQL 8.0 used version number 2; PostgreSQL 7.3 and 7.4 used version number 1; prior releases used version number 0. (The basic page layout and header format has not changed in most of these versions, but the layout of heap row headers has.) The page size is basically only present as a cross-check; there is no support for having more than one page size in an installation. The last field is a hint that shows whether pruning the page is likely to be profitable: it tracks the oldest un-pruned XMAX on the page.
Table 68.3. PageHeaderData Layout
pd_lsn
PageXLogRecPtr
8 bytes
LSN: next byte after last byte of WAL record for last change to this page
pd_checksum
uint16
2 bytes
Page checksum
pd_flags
uint16
2 bytes
Flag bits
pd_lower
LocationIndex
2 bytes
Offset to start of free space
pd_upper
LocationIndex
2 bytes
Offset to end of free space
pd_special
LocationIndex
2 bytes
Offset to start of special space
pd_pagesize_version
uint16
2 bytes
Page size and layout version number information
pd_prune_xid
TransactionId
4 bytes
Oldest unpruned XMAX on page, or zero if none
All the details can be found in src/include/storage/bufpage.h
.
Following the page header are item identifiers (ItemIdData
), each requiring four bytes. An item identifier contains a byte-offset to the start of an item, its length in bytes, and a few attribute bits which affect its interpretation. New item identifiers are allocated as needed from the beginning of the unallocated space. The number of item identifiers present can be determined by looking at pd_lower
, which is increased to allocate a new identifier. Because an item identifier is never moved until it is freed, its index can be used on a long-term basis to reference an item, even when the item itself is moved around on the page to compact free space. In fact, every pointer to an item (ItemPointer
, also known as CTID
) created by PostgreSQL consists of a page number and the index of an item identifier.
The items themselves are stored in space allocated backwards from the end of unallocated space. The exact structure varies depending on what the table is to contain. Tables and sequences both use a structure named HeapTupleHeaderData
, described below.
The final section is the “special section” which can contain anything the access method wishes to store. For example, b-tree indexes store links to the page's left and right siblings, as well as some other data relevant to the index structure. Ordinary tables do not use a special section at all (indicated by setting pd_special
to equal the page size).
Figure 68.1 illustrates how these parts are laid out in a page.
Figure 68.1. Page Layout
All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data. The header is detailed in Table 68.4. The actual user data (columns of the row) begins at the offset indicated by t_hoff
, which must always be a multiple of the MAXALIGN distance for the platform. The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask
. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, the number of bits that equals the attribute count in t_infomask2
). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. The object ID is only present if the HEAP_HASOID_OLD bit is set in t_infomask
. If present, it appears just before the t_hoff
boundary. Any padding needed to make t_hoff
a MAXALIGN multiple will appear between the null bitmap and the object ID. (This in turn ensures that the object ID is suitably aligned.)
t_xmin
TransactionId
4 bytes
insert XID stamp
t_xmax
TransactionId
4 bytes
delete XID stamp
t_cid
CommandId
4 bytes
insert and/or delete CID stamp (overlays with t_xvac)
t_xvac
TransactionId
4 bytes
XID for VACUUM operation moving a row version
t_ctid
ItemPointerData
6 bytes
current TID of this or newer row version
t_infomask2
uint16
2 bytes
number of attributes, plus various flag bits
t_infomask
uint16
2 bytes
various flag bits
t_hoff
uint8
1 byte
offset to user data
All the details can be found in src/include/access/htup_details.h
.
Interpreting the actual data can only be done with information obtained from other tables, mostly pg_attribute
. The key values needed to identify field locations are attlen
and attalign
. There is no way to directly get a particular attribute, except when there are only fixed width fields and no null values. All this trickery is wrapped up in the functions heap_getattr, fastgetattr and heap_getsysattr.
To read the data you need to examine each attribute in turn. First check whether the field is NULL according to the null bitmap. If it is, go to the next. Then make sure you have the right alignment. If the field is a fixed width field, then all the bytes are simply placed. If it's a variable length field (attlen = -1) then it's a bit more complicated. All variable-length data types share the common header structure struct varlena
, which includes the total length of the stored value and some flag bits. Depending on the flags, the data can be either inline or in a TOAST table; it might be compressed, too (see Section 68.2).\
[15] Actually, use of this page format is not required for either table or index access methods. The heap
table access method always uses this format. All the existing index methods also use the basic format, but the data kept on index metapages usually doesn't follow the item layout rules.