WWW Interdb JP PG pgsql01 HTML
WWW Interdb JP PG pgsql01 HTML
WWW Interdb JP PG pgsql01 HTML
Chapter 1
T
his chapter and the next chapter summarize the basic knowledge of PostgreSQL to help to read the subsequent chapters. In this chapter, following topics are
described:
If you are already familiar with them, you may skip over this chapter.
All the database objects in PostgreSQL are internally managed by respective object identifiers (OIDs), which are unsigned 4-byte integers. The relations between database
objects and the respective OIDs are stored in appropriate system catalogs, depending on the type of objects. For example, OIDs of databases and heap tables are stored in
pg_database and pg_class respectively, so you can find out the OIDs you want to know by issuing the queries such as the following:
datname | oid
----------+-------
sampledb | 16384
(1 row)
relname | oid
-----------+-------
sampletbl | 18740
(1 row)
In the following subsections, the layout of a database cluster, databases, files associated with tables and indexes, and the tablespace in PostgreSQL are described.
table 1.1: Layout of files and subdirectories under the base directory (From the official document)
files description
postgresql.auto.conf A file used for storing configuration parameters that are set in ALTER SYSTEM (version 9.4 or later)
postmaster.opts A file recording the command line options the server was last started with
subdirectories description
pg_clog/ (Version 9.6 or Subdirectory containing transaction commit state data. It is renamed to pg_xact in Version 10. CLOG will be described in Section 5.4.
earlier)
pg_dynshmem/ Subdirectory containing files used by the dynamic shared memory subsystem. Version 9.4 or later.
pg_logical/ Subdirectory containing status data for logical decoding. Version 9.4 or later.
pg_multixact/ Subdirectory containing multitransaction status data (used for shared row locks)
pg_serial/ Subdirectory containing information about committed serializable transactions (version 9.1 or later)
pg_wal/ (Version 10 or Subdirectory containing WAL (Write Ahead Logging) segment files. It is renamed from pg_xlog in Version 10.
later)
pg_xact/ (Version 10 or Subdirectory containing transaction commit state data. It is renamed from pg_clog in Version 10.
CLOG will be described in Section 5.4.
later)
pg_xlog/ (Version 9.6 or Subdirectory containing WAL (Write Ahead Logging) segment files. It is renamed to pg_wal in Version 10.
earlier)
$ ls -ld base/16384
-----------+-------+-------------
(1 row)
From the result above, you can see that both oid and relfilenode values are equal. You can also see that the data file path of the table sampletbl is 'base/16384/18740'.
$ cd $PGDATA
$ ls -la base/16384/18740
The relfilenode values of tables and indexes are changed by issuing some commands (e.g., TRUNCATE, REINDEX, CLUSTER). For example, if we truncate the table
sampletbl, PostgreSQL assigns a new relfilenode (18812) to the table, removes the old data file (18740), and creates a new one (18812).
sampledb=# TRUNCATE sampletbl;
TRUNCATE TABLE
sampledb=# SELECT relname, oid, relfilenode FROM pg_class WHERE relname = 'sampletbl';
-----------+-------+-------------
(1 row)
In version 9.0 or later, the built-in function pg_relation_filepath is useful as this function returns the file path name of the relation with the specified OID or name.
pg_relation_filepath
----------------------
base/16384/18812
(1 row)
When the file size of tables and indexes exceeds 1GB, PostgreSQL creates a new file named like relfilenode.1 and uses it. If the new file has been filled up, next new file
named like relfilenode.2 will be created, and so on.
$ cd $PGDATA
$ ls -la -h base/16384/19427*
...
The maximum file size of tables and indexes can be changed using the configuration, option --with-segsize when building PostgreSQL.
Looking carefully at the database subdirectories, you will find out that each table has two associated files suffixed respectively with '_fsm' and '_vm'. Those are referred to
as free space map and visibility map, storing the information of the free space capacity and the visibility on each page within the table file, respectively (see more detail in
Section 5.3.4 and Section 6.2). Indexes only have individual free space maps and don't have visibility map.
$ cd $PGDATA
$ ls -la base/16384/18751*
They may also be internally referred to as the forks of each relation; the free space map is the first fork of the table/index data file (the fork number is 1), the visibility map
the second fork of the table's data file (the fork number is 2). The fork number of the data file is 0.
1.2.4. Tablespaces
A tablespace in PostgreSQL is an additional data area outside the base directory. This function has been implemented in version 8.0.
Figure 1.3 shows the internal layout of a tablespace, and the relationship with the main data area.
A tablespace is created under the directory specified when you issue CREATE TABLESPACE statement, and under that directory, the version-specific subdirectory (e.g.,
PG_14_202011044) will be created. The naming method for version-specific one is shown below.
PG _ 'Major version' _ 'Catalogue version number'
For example, if you create a tablespace 'new_tblspc' at '/home/postgres/tblspc', whose oid is 16386, a subdirectory such as 'PG_14_202011044' would be created under the
tablespace.
$ ls -l /home/postgres/tblspc/
total 4
The tablespace directory is addressed by a symbolic link from the pg_tblspc subdirectory, and the link name is the same as the OID value of tablespace.
$ ls -l $PGDATA/pg_tblspc/
total 0
If you create a new database (OID is 16387) under the tablespace, its directory is created under the version-specific subdirectory.
$ ls -l /home/postgres/tblspc/PG_14_202011044/
total 4
If you create a new table which belongs to the database created under the base directory, first, the new directory, whose name is the same as the existing database OID, is
created under the version specific subdirectory, and then the new table file is placed under the created directory.
pg_relation_filepath
---------------------------------------------
pg_tblspc/16386/PG_14_202011044/16384/18894
Internal layout of pages depends on the data file types. In this section, the table layout is described as the information will be required in the following chapters.
1. heap tuple(s) –
A heap tuple is a record data itself. They are stacked in order from the bottom of the page. The internal structure of tuple is described in Section 5.2 and
Chapter 9 as the knowledge of both Concurrency Control(CC) and WAL in PostgreSQL are required.
2. line pointer(s) –
A line pointer is 4 byte long and holds a pointer to each heap tuple. It is also called an item pointer.
Line pointers form a simple array, which plays the role of index to the tuples. Each index is numbered sequentially from 1, and called offset number. When a new
tuple is added to the page, a new line pointer is also pushed onto the array to point to the new one.
3. header data –
A header data defined by the structure PageHeaderData is allocated in the beginning of the page. It is 24 byte long and contains general information
about the page. The major variables of the structure are described below.
pd_lsn –
This variable stores the LSN of XLOG record written by the last change of this page. It is an 8-byte unsigned integer, related to the WAL (Write-Ahead
Logging) mechanism. The details are described in Chapter 9.
pd_checksum – This variable stores the checksum value of this page. (Note that this variable is supported in version 9.3 or later; in earlier versions, this part had
stored the timelineId of the page.)
pd_lower, pd_upper – pd_lower points to the end of line pointers, and pd_upper to the beginning of the newest heap tuple.
pd_special –
This variable is for indexes. In the page within tables, it points to the end of the page. (In the page within indexes, it points to the beginning of special
space which is the data area held only by indexes and contains the particular data according to the kind of index types such as B-tree, GiST, GiN, etc.)
An empty space between the end of line pointers and the beginning of the newest tuple is referred to as free space or hole.
To identify a tuple within the table, tuple identifier (TID) is internally used. A TID comprises a pair of values: the block number of the page that contains the tuple, and the
offset number of the line pointer that points to the tuple. A typical example of its usage is index. See more detail in Section 1.4.2.
In the field of computer science, this type of page is called a slotted page,
and the line pointers correspond to a slot array.
In addition, heap tuple whose size is greater than about 2 KB (about 1/4 of 8 KB) is stored and managed using a method called TOAST (The Oversized-Attribute Storage
Technique). Refer PostgreSQL documentation for details.
When the second tuple is inserted, it is placed after the first one. The second line pointer is pushed onto the first one, and it points to the second tuple. The pd_lower changes
to point to the second line pointer, and the pd_upper to the second heap tuple. See Fig. 1.5(b). Other header data within this page (e.g., pd_lsn, pg_checksum, pg_flag) are
also rewritten to appropriate values; more details are described in Section 5.3 and Chapter 9.
Sequential scan –
All tuples in all pages are sequentially read by scanning all line pointers in each page. See Fig. 1.6(a).
B-tree index scan –
An index file contains index tuples, each of which is composed of an index key and a TID pointing to the target heap tuple. If the index tuple with
the key that you are looking for has been found, PostgreSQL reads the desired heap tuple using the obtained TID value. (The description of the way to find the index
tuples in B-tree index is not explained here as it is very common and the space here is limited. See the relevant materials.) For example, in Fig. 1.6(b), TID value of the
obtained index tuple is ‘(block = 7, Offset = 2)’.
It means that the target heap tuple is 2nd tuple in the 7th page within the table, so PostgreSQL can read the desired
heap tuple without unnecessary scanning in the pages.
Indexes Internals
This document does not explain indexes in details. To understand them, I recommend to read the valuable posts shown below:
Indexes in PostgreSQL — 1
Indexes in PostgreSQL — 2
Indexes in PostgreSQL — 3 (Hash)
Indexes in PostgreSQL — 4 (Btree)
Indexes in PostgreSQL — 5 (GiST)
Indexes in PostgreSQL — 6 (SP-GiST)
Indexes in PostgreSQL — 7 (GIN)
Indexes in PostgreSQL — 9 (BRIN)
TID-Scan is a method that accesses a tuple directly by using TID of the desired tuple. For example, to find the 1st tuple in the 0-th page within the table, issue the following query:
-------+-----------
(0,1) | AAAAAAAAA
(1 row)
Index-Only-Scan will be described in details in Chapter 7.