This post will first take you through Change Data Capture (CDC) and Postgres as standalone entities before explaining the various forms of Postgres CDC along with their pros and cons.
Change Data Capture (CDC) Explained
CDC or Change Data Capture is a software pattern. It tracks and monitors changes made to a database so that some action can be taken later based on those changes. Postgres CDC reduces data warehousing costs as it can extract and load changes made to a database into a data storage repository or a data warehouse.
The benefit of CDC is that it considers only changes made to a database which might also be termed as incremental data. Hence, there is no need to constantly refresh databases in the future to know what changes have been made to them.
Postgres Open-Source Relational Database Explained
PostgreSQL, generally called Postgres, is among the most popular and widely used open-source relational databases in the world. It is a very common platform for several activities ranging from data warehousing and analytics to undertaking Online Transaction Processing (OLTP) workloads. Businesses typically use relational databases to handle transactional activities and then carry out combined analytics and reporting through separate data warehouses.
While using PostgreSQL, it is also important to make sure that a data storage repository such as a data warehouse consists of updated data only in its transactional database. Otherwise, time-bound reporting will be badly affected since periodical batch requirements, hourly or daily, are not in sync between databases. Here, the Change Data Capture feature in Postgres ensures continuous sync between databases.
There are several benefits that Postgres CDC brings to the table.
The first and most important is that Postgres CDC captures changes in a database in real-time, The fallout is that downstream systems and data warehouses are always in sync with PostgreSQL. Additionally, Postgres CDC processes only the changes made to a database and hence, the workload of PostgreSQL is substantially reduced. Finally, Postgres Change Data Capture facilitates use cases that require access to changes made in PostgreSQL. This is done without the need to change or modify the application code.
Forms of Postgres CDC – Their Advantages and Drawbacks Explained
Let us now make an in-depth analysis of the various forms of Postgres CDC and understand the benefits and drawbacks of each of them.
# Trigger-based Postgres CDC
In this form of Postgres CDC, users can identify changes such as Insert, Update, or Delete in the table of interest. A changelog can be built by inserting a row into a change table for each change made. Here, all changes made can be stored in the audit.logged function. Even though all changes are stored in Postgres only, it is possible for users to repeatedly query the change table to save the changes to any database system or data warehouse.
This Trigger-based mode of Postgres CDC is supported by the 9.1 and later versions of PostgreSQL.
Advantages of Trigger-based Postgres CDC
- Postgres CDC immediately captures all change events in real-time.
- Useful metadata is automatically added to change events by the Trigger-based function of the Postgres Change Data Capture feature.
Drawbacks of Trigger-based Postgres CDC
- The execution time of the original statement is increased by the Trigger-based Postgres CDC thereby adversely affecting the performance of PostgreSQL.
- A separate data pipeline must be incorporated that will fulfill the trigger function.
- The triggers will only work if any changes are made to the Postgres database.
Query-based Postgres CDC
In this form of Postgres CDC, when the monitored database schema has a timestamp column, Postgres must be repeatedly queried using that column. The timestamp identifies the time the last change was done using that column. Hence, users get all the changes done since the last time that Postgres was queried. This Postgres CDC can only capture Insert and Update changes and not Delete.
Advantages of Query-based Postgres CDC
- To deploy query-based CDC, changes need not be made to PostgreSQL. This is because the schema already has a timestamp showing when the modifications to rows were last done.
Drawbacks of Query-based Postgres CDC
- PostgreSQL faces an additional workload as this form of Postgres CDC extracts data through the query layer.
- It requires continuous polling of the monitored table, leading to wasted resources if no change has been made to the data. Moreover, an additional column is required to track the time when changes were last made.
- Delete change cannot be captured by query-based Postgres CDC.
Logical Replication-based Postgres CDC
The Logical Replication-based form of Postgres CDC was launched with the 9.4 version of PostgreSQL. It is an optimized model that can easily replicate data to different instances of PostgreSQL running on separate systems. It is mainly a write-ahead log that effortlessly captures all data changes including Insert, Update, and Delete.
The replication-based Postgres CDC version has to be set up by enabling changes to the configuration file and is not there by default. A decoding plugin starts the logical replication automatically though it must be done manually for earlier than 10 versions of PostgreSQL. AWS RDS, Google Cloud SQL, or Azure Database provide logical replication support to most managed PostgreSQL functions.
Advantages of Logical Replication-based Postgres CDC
- Continual access to current data from the Postgres platform to downstream applications is provided and hence, this log-based CDC captures data in real-time.
- Insert, Update, and Delete changes are captured by this form of Postgres CDC.
- Implementing changes to the database does not affect the performance of PostgreSQL since logical replication-based Postgres CDC offers direct access to file systems,
Drawback of Logical Replication-based Postgres CDC
- This mode of Postgres CDC is not supported by versions of PostgreSQL earlier than the 9.4 version.
Of the three versions of Postgres CDC, the logical replication-based version is the most technologically advanced one.