Achieving High Availability and Disaster Recovery with IBM DB2

IT systems do fail. It’s not a question of if, but rather when. Being prepared for such failures is necessary in today’s enterprise landscape, where data is critical to operations. IBM® DB2® for Linux, UNIX and Windows provides multiple ways to prevent interruptions to data availability. This article looks at the High Availability Disaster Recovery (HADR) feature of DB2, which includes new capabilities in the latest DB2 release, version 10.1.

Building on a solid foundation

All of the most recent DB2 releases have included the HADR feature. It’s a proven technology many businesses use to achieve a higher level of database availability. The principle behind HADR is the synchronization of data between a primary (hot) server and a standby (cold) server. With HADR, a DBA can switch manually to the standby server in case of failure or use cluster software—such as IBM Tivoli® System Automation or a different failover cluster product—to automatically detect the failure and switch connections to the standby server. In DB2 9.7.1, IBM introduced the ability to provide read operations from the standby server, increasing the utilization of the cluster. This capability enables example reports to run against the now-warm standby server, sparing the primary database server from that load.

DB2 10.1 now supports three standby servers, which not only facilitates high availability within the same data center but also improves the capabilities for disaster recovery configurations across multiple sites.

Instead of using HADR only to deliver high availability and another solution for disaster recovery, you can use it to handle both, simplifying the software stack. DBAs can deploy the principal standby server in the same location as the primary database server for quick failovers and local network speed transfer rates. Two additional standby servers (called auxiliary servers) can be located in remote locations to protect against a larger disaster affecting the entire site. In case of a sitewide outage that affects the primary and principal standby servers, the DBA can issue a takeover command from either one of the auxiliary servers, which then becomes the new primary and principal standby server. All standby servers, whether principal or auxiliary servers, provide support for read operations.

Providing protection against application errors

Occasionally, applications produce errors that affect data. This problem is compounded if those errors are replicated to standby databases. To avoid replicating errors, HADR in DB2 10.1 introduces the delayed replay feature, which helps shield data from application errors. By enabling the hadr_replay_delay option on the standby server, the DBA can delay any changes to that data—for example, by 24 hours—providing enough time to discover any problems and restore from a previous point in time.

The delayed replay compares time stamps in the log stream, which is generated on the primary server, with the current time on the standby server. Therefore, the time on both the primary and the standby servers must always remain synchronized.

Transaction commit is replayed on the standby server according to the following equation:

(current time on the standby – value of the hadr_replay_delay configuration parameter >= time stamp of the committed log record

It is a good idea to set the hadr_replay_delay parameter to a large-enough value that lets you detect any errant transactions on the primary server and react in time. Because DB2 10.1 allows you to have multiple standby servers, you can now keep one standby server current with the primary server for high-availability purposes and one standby using the delayed replay feature to protect against data errors.

Preventing spikes on throughput with log spooling

Depending on the synchronization configuration of the cluster, a situation may occur when the primary server has to wait for the standby server to finish the transaction before processing on the primary server can continue. HADR log spooling is a new feature in DB2 10.1 that lets the DBA specify additional space where logs can be spooled on the standby server. This capability helps avoid back-pressure issues on the primary server caused by sudden spikes in logging activity on the standby server.

You enable log spooling by using the hadr_spool_limit database configuration parameter, which sets an upper limit on how much data is written—or “spooled”—to disk if the log receive buffer fills up.

The log replay feature on the standby server can later read the log data from disk, which allows transactions on the primary server to progress without having to wait for the log replay on the standby server.

Log spooling does not compromise the high availability and disaster recovery protection provided by the DB2 HADR feature. The data shipped from the primary database is still replicated to the standby using the specified synchronization mode; it just takes time for the table spaces on the standby server to replay the data.

Building in high availability and business continuity

The HADR feature in DB2 has come a long way. From a simple replication solution, it has developed into a full-fledged high availability and disaster recovery solution that spans multiple servers and even remote sites to provide ultimate data protection.

How will you use the new HADR features to keep data available and reduce the impact of disasters? Let us know in the comments.

This article was also published in IBM data management Magazine

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s