Optimizing PostgreSQL: Mastering Speed and Efficiency in Databases
Written on
Imagine you have a table named Employees in your PostgreSQL database containing four entries. During a transaction, you need to update two of these entries. By the end of the transaction, how many records will remain in the Employees table? If you guessed four, you’re about to discover something interesting.
When you update a record, PostgreSQL does not simply replace it. Instead, it marks the old record as a dead tuple (similar to a soft delete) and creates a new entry with the updated values. Consequently, the total number of records becomes six. The following GIF illustrates this process.
Here’s an example of creating and updating the Employees table:
CREATE TABLE employees (id SERIAL PRIMARY KEY, name TEXT, salary INTEGER);
INSERT INTO employees (name, salary) VALUES ('Alice', 50000), ('Bob', 60000), ('Matt', 42000), ('Chris', 86000);
SELECT xmin, xmax, ctid, * FROM employees;
UPDATE employees SET salary = 65000 WHERE id = 2;
UPDATE employees SET name = 'Bob' WHERE id = 4;
PostgreSQL tables include hidden columns such as xmin, xmax, and ctid.
Here's what each of those columns signifies:
- xmin: The transaction ID (TXID) that generated the record. A value of 0 indicates the record was created before PostgreSQL began tracking TXIDs.
- xmax: The TXID of the last transaction that modified or deleted the record. A null value indicates that the record has never been altered or removed.
- ctid: The physical location of the record within the table, represented by two values.
For instance, a ctid value of (0,5) signifies that the row is located in block 0 (often referred to as a page) as the fifth row within that block. But why does PostgreSQL operate this way?
This behavior arises from PostgreSQL's multi-version concurrency control (MVCC) mechanism.
What’s the purpose of retaining those dead tuples?
Multi-Version Concurrency Control
Multi-Version Concurrency Control (MVCC) is a technique employed by PostgreSQL that enables multiple transactions to access the same data concurrently without interfering with one another. MVCC is executed using snapshot isolation and versioning.
In MVCC, each transaction views a snapshot of the database at a particular point in time, rather than its current state. This snapshot is created by duplicating the database's state at the beginning of the transaction and storing it in memory. As the transaction processes data, it only sees the version that was active at the start of the transaction.
In MySQL, if Transaction A reads data from Table X while Transaction B writes new data to Table X simultaneously, both transactions can proceed without blocking each other if they operate on different rows. However, if they target the same row, Transaction B must wait for Transaction A to release its lock before it can proceed.
Conversely, in PostgreSQL, Transaction A can read data from Table X without hindering Transaction B, even if they both involve the same row. This is due to PostgreSQL’s MVCC, which allows multiple transactions to read the same data concurrently while ensuring consistency and preventing conflicts. Transaction B can secure an exclusive lock on the row it intends to change, while other transactions can still access the previous version until Transaction B finalizes its modifications.
MVCC in PostgreSQL is an effective mechanism that supports high concurrency and scalability in a multi-user environment. However, it does require additional storage and processing power since multiple versions of each row are stored in the database.
By now, you should understand why executing SELECT COUNT(*) could be inefficient, particularly in PostgreSQL. Instead, consider using the following query if an exact count is not necessary. You can delve deeper into Slow Counting and Count Estimate.
SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name';
> When it comes to storage, PostgreSQL resembles a teenager’s room — filled with clutter and bloat that needs occasional cleanup.
Understanding PostgreSQL Bloat
In PostgreSQL, bloat refers to storage inefficiencies that may arise as tables and indexes expand over time. Various types of bloat can manifest in PostgreSQL:
- Dead Rows: As previously mentioned, when rows are updated or deleted, they remain in the table as "dead" until a VACUUM operation is performed. This can lead to tables accumulating numerous dead rows, consuming disk space and slowing query performance.
- Index Bloat: Indexes can become inefficient and occupy excessive disk space as they grow. Fragmentation, where index pages are only partially filled, or duplicate entries can contribute to this bloat.
- Table Bloat: A table can bloat due to a surplus of dead rows or outdated statistics used by the query planner, resulting in slower queries and increased disk usage.
- Transaction ID Wraparound: PostgreSQL employs a 32-bit transaction ID to track database changes. Once the transaction ID counter reaches its limit, it wraps around to zero, potentially causing issues if there are still active transactions with IDs exceeding the wraparound threshold. This can lead to data corruption and necessitate a manual VACUUM or VACUUM FULL.
- Toast Table Bloat: Large entries stored in TEXT, VARCHAR, and BYTE columns are kept in a separate "toast" table. Frequent updates or very large values can lead to bloat in the toast table, slowing down queries.
To verify if a table utilizes the toast table in PostgreSQL, you can run the following query. The pg_class catalog table holds information about all tables, including their attributes and storage options.
SELECT relname, relkind, reltoastrelid
FROM pg_class
WHERE relname = 'tableName';
If the reltoastrelid column is not null, the table is utilizing the toast table. The relkind column will display 'r' for a regular table or 't' for a toast table.
Note that a table may use the toast table for some columns but not others. You can inspect the attstorage column in the pg_attribute catalog table to identify which columns are stored in the toast table. A value of 'x' in this column indicates that the column is stored in the toast table.
So, what can we do about this bloat?
Vacuum: The Cleanup Solution
Vacuum is a maintenance operation in PostgreSQL that recovers storage space and enhances database performance by removing dead rows, updating statistics, and compacting pages. Vacuum is crucial for avoiding performance decline and ensuring the database runs efficiently.
- Reclaims Disk Space: When data is updated or deleted, the space occupied by the old data is not immediately released. Vacuum frees this space, returning it to the operating system, shrinking the database, and improving performance.
- Improves Query Performance: Vacuum updates statistics regarding data distribution, which the query optimizer uses to formulate better query plans, resulting in faster and more efficient queries.
- Prevents Transaction ID Wraparound: Transaction IDs are essential for tracking database changes. If the transaction ID counter wraps around, it can lead to significant issues. Vacuum aids in preventing this by eliminating dead rows and freeing transaction IDs.
- Compacts Tables: Over time, tables can fragment with data dispersed across multiple pages. Vacuum compacts tables by consolidating data into fewer pages, minimizing the number of disk I/O operations necessary to access the data.
In summary, vacuum addresses bloat data in PostgreSQL. But why did we discuss all of this?
While vacuuming is a crucial maintenance task in PostgreSQL, it does come with potential drawbacks and trade-offs.
Here are some downsides to consider:
- Resource Usage: Vacuuming can be resource-intensive, especially for large tables with numerous dead rows. It may consume significant CPU, memory, and I/O resources, potentially affecting the performance of other processes.
- Locking: Vacuuming necessitates an exclusive lock on the table, preventing other processes from reading or writing until the operation concludes. This can create contention and delays if the table is frequently accessed.
- Index Maintenance: As noted, vacuuming impacts indexes, requiring scans to find and eliminate obsolete entries. This can be costly, especially for large tables with numerous indexes, potentially extending the vacuum duration.
- Slowdowns during Peak Usage: Executing vacuuming on a large table during peak hours can hinder application performance. Scheduling vacuuming during off-peak times is advisable to minimize user impact.
- Increased Storage Requirements: Vacuuming creates a new version of the table file without dead rows, which may end up larger than the original due to unclaimed empty space. This can be a concern for systems with limited disk capacity.
PostgreSQL employs the AccessExclusiveLock during vacuuming, which prevents any other sessions from performing operations on the table until the vacuum completes.
The AccessExclusiveLock is a robust lock that prohibits other sessions from acquiring any lock on the same table, including shared locks, until it is released. This ensures the vacuum operation has exclusive access to perform necessary maintenance without interference.
It is crucial to balance the advantages of vacuuming against its downsides and to meticulously plan and schedule vacuum operations to mitigate potential system impact.
Typically, if a company executes fewer updates and experiences thousands or millions of inserts daily, disabling auto-vacuum is advisable instead of allowing it to activate during bulk operations and lock everything down. Instead, implement a cron job that performs `VACUUM ANALYZE`, as Zerodha, the largest stock broker in India, does. For further insights, refer to their tech team’s article.
To conclude, while there is much to explore regarding MVCC, bloat data, and vacuum in PostgreSQL, we aimed to keep this blog straightforward and digestible.
For more content like this, please follow me on Medium and subscribe to my newsletter. You can also find me on Twitter, Instagram, and LinkedIn.