Change Data Capture
Change data capture (CDC) is the process of capturing changes made at the data source and applying them throughout the enterprise. CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
There are four methods to handle Change Data Capture(CDC):
1. Timestamp-based CDC
Timestamp column in the source table is used to capture the date and time of the last change, whether it’s a new entry or an update to an existing row
* Simple Method
* Cannot identify deleted records
2. Trigger-based CDC
Database triggers are added to the source tables so all changes (inserts, updates, deletes) are replicated in a second set of tables specifically used for the CDC process. Only the “changed” records that are captured in the CDC tables are used to update the data warehouse during the ETL process
* Complex
* Can identify new, updated, and/or deleted records
* Suitable for low-latency (near real-time) data warehouse updates
3. Snapshot Merge
This is a simple technique where regularly scheduled table exports (“snapshots”) or staging tables are used to identify changed records. By calculating the difference between the current and previous snapshots, all new, updated or deleted records can be captured and loaded into the data warehouse
* Can identify new, updated, and/or deleted records
* Has database design and performance implications
* Not practical for very large tables unless using efficient snapshot systems
* Not suitable for low-latency data warehouse updates
4. Log Scraping
Database applications can be configured to track all activity in log files. For CDC purposes, those application log files can be scanned and parsed (“scraped”) to identify when changes occur and capture those records
* Low impact on database
* No impact on source application
* Can deliver “real-time” (very low-latency) updates to data warehouse
* Complex log formats make parsing difficult
* Log formats are different between database applications and often change
There are four methods to handle Change Data Capture(CDC):
1. Timestamp-based CDC
Timestamp column in the source table is used to capture the date and time of the last change, whether it’s a new entry or an update to an existing row
* Simple Method
* Cannot identify deleted records
2. Trigger-based CDC
Database triggers are added to the source tables so all changes (inserts, updates, deletes) are replicated in a second set of tables specifically used for the CDC process. Only the “changed” records that are captured in the CDC tables are used to update the data warehouse during the ETL process
* Complex
* Can identify new, updated, and/or deleted records
* Suitable for low-latency (near real-time) data warehouse updates
3. Snapshot Merge
This is a simple technique where regularly scheduled table exports (“snapshots”) or staging tables are used to identify changed records. By calculating the difference between the current and previous snapshots, all new, updated or deleted records can be captured and loaded into the data warehouse
* Can identify new, updated, and/or deleted records
* Has database design and performance implications
* Not practical for very large tables unless using efficient snapshot systems
* Not suitable for low-latency data warehouse updates
4. Log Scraping
Database applications can be configured to track all activity in log files. For CDC purposes, those application log files can be scanned and parsed (“scraped”) to identify when changes occur and capture those records
* Low impact on database
* No impact on source application
* Can deliver “real-time” (very low-latency) updates to data warehouse
* Complex log formats make parsing difficult
* Log formats are different between database applications and often change
Comments
Post a Comment