Oracle RAC Continuous Availability with SRDF/Metro – Part I (the why)

I’ve divided this blog into two parts, where in the first I go into the reasons for using Oracle Extended RAC and why it makes sense to deploy it with SRDF/Metro, and in the second part I discuss the setup steps. I’m including a demo link showing SRDF/Metro and Oracle RAC behavior when the Biased (“winner”) side is “lost”, and how SRDF/Metro uses the Witness to resume operations in the correct side.

So, let’s get started…

Background — The problem with “failover” solutions

Every mission critical database requires some notion of High Availability (HA) and Disaster Recovery (DR). These come in all shapes and forms and for Oracle databases we’re looking at two main solutions – HA using Oracle RAC (which provides automatic failover and load balancing), and DR using remote replications, driven by the database, such as Oracle Data Guard, or by the storage, such as Dell EMC SRDF.

RAC allows full read and write access to data across all nodes (database instances). If a node fails, the cluster goes through a short period of reconfiguration to remove the failed node, and to recover its transactions so they are correctly reflected in the data files. Once that’s done, workload proceeds. RAC also allows for user sessions from the failed nodes to automatically reconnect to surviving nodes.

In other words, RAC allows workloads to continue even if some of the cluster nodes become unavailable. However, RAC relies on access to local shared storage. If the storage access goes away, all the nodes go down and the database has to failover to a DR site.

DR solutions protect against a failure of the source database by keeping a remote database copy up-to-date using storage or database replications. If a failure occurred to the source database, operations fail-over to the replication’s target database. In other words, DR solutions protect from unplanned localized database or storage unavailability, but have to deal with a failover, or migration of database activity.

Unplanned failover of the database is not an event anyone desires. Users’ transactions are dropped, a short period of downtime is involved, and often the applications need to failover as well (to be in close proximity to the database).

Oracle Extended RAC is a way to increase database resiliency to typically two locations, while still benefiting from the cluster ability to recover from failed nodes without incurring actual database failover.

As a side note, keep in mind that Oracle Extended RAC is an improved HA solution, but not a replacement for a longer-distance DR strategy, which is why it is still recommended to use asynchronous DR replications such as SRDF/A when deploying Oracle Extended RAC.

Oracle Extended RAC and Continuous Availability

In an Oracle Extended (or “stretched”) RAC deployment, instead of relaying on a single, local shared storage system for the database, both source and target storage systems are used simultaneously for both reads and writes. Some cluster nodes are co-located next to one storage system, and others next to the other. All database writes are mirrored to both storage systems, and reads are served from the storage system closest to the node requesting it.

Oracle Extended RAC can extend the distance between storage systems with their co-located cluster nodes to the next aisle, room, building, or data center, as long as the latencies incurred by the distance allow. All cluster nodes can be active in both locations, with full read-write access to data, fully utilizing all servers and storage systems.

In a case of an unplanned downtime in one location, the other location continues operations (hence the term “continuous availability”). In addition, user sessions from failed nodes can automatically re-connect to surviving nodes, just like normal RAC behavior. While users may experience a short freeze, their sessions are not disconnected.

Keep in mind that Oracle Extended RAC is not new and has been offered for many years. The main reasons we don’t see high deployment rate are performance concerns, server resource overhead, and the complexity of the solution.

Let’s review some of the reasons why Oracle Extended RAC solution hasn’t been very popular:

  • They are typically based on host-based mirroring, where all database writes are mirrored by a logical volume-manager (LVM or ASM) software on the host.
  • Host resources are consumed by the data mirroring. For example, all write I/Os are doubled due to the host-based mirroring. Host CPU cycles are also consumed due to the increase in host writes activity.
  • The number of devices visible to each cluster node doubles, to include the target storage system devices, so the host-based mirroring software can write to them as well.
  • Management of the host-based mirroring is not necessarily easy, as each cluster node has its own agent responsible for mirroring and health monitoring.
  • In most host-based Oracle Extended RAC solutions there is a requirement for an “arbitrator” site to determine a surviving site in case of a disaster. That adds complexity to the host-based mirroring software setup, and adds to the responsibilities of the DBAs, as they carefully need to place GI quorum files in each of the sites, including the the arbitrator.
  • The added latency as distance increases is just the physics of mirroring I/Os. To some databases any added latency is not an option, to others, downtime isn’t.

Therefore, while Oracle Extended RAC improves HA, legacy implementations had many challenges, and as such, are not often used.

Enters SRDF/Metro…

SRDF/Metro approach to Oracle Extended RAC

In the picture below we see a typical SRDF/Metro deployment example, with RAC nodes co-located next to each of the replicating storage systems. ASM disk groups are mounted on all cluster nodes (using ASM external redundancy, or no-ASM mirroring), and the database instances are active on all cluster nodes, performing read and write operations.

SRDF/Metro brings a way to deploy Oracle Extended RAC that overcome many of the challenges mentioned above. It does it by leveraging a few factors:

SRDF/Metro creates a true active/active storage replications

  1. SRDF/Metro extends the capabilities of SRDF/Synchronous (SRDF/S). Typically in an SRDF/S replications between two storage systems, only the replicated source devices (referred to as R1 devices) provide read/write data access. The replicated target devices (referred to as R2 devices) are only accessible for read/write after a failover. SRDF/Metro changes this paradigm. While still relying on synchronous replications, it allows for both R1 and R2 devices to have full read/write data access while synchronized. To do so, it replicates writes in both directions, while reads are served from the local storage system.
  2. Each PowerMax storage device has two SCSI personalities — internal (their true WWN and geometry), and external (which usually matches the internal, but can be ‘spoofed’ to match another device). When SRDF/Metro paired devices are synchronized, the R2 devices’ external SCSI personality is set to match the R1 devices’. As a result, to the host each synchronized R1 and R2 devices look just like multiple paths to a single storage device.
  3. SRDF/Metro fully protects from the possibility of a “split-brain” situation (explained in more details later) to always preserve data consistency if replications drop and the cluster get partitioned.

Since any of the local and remote SRDF/Metro paired devices can accept reads and writes, maintain data consistency, and look identical to the host, we can use SRDF/Metro for an Oracle Extended RAC deployment.

SRDF/Metro simplifies Oracle Extended RAC deployment

That is because SRDF/Metro is based on synchronous storage replications. As a result:

  • It doesn’t compete with the database for host I/Os and CPU resources.
  • It can leverage “cross-links” in the deployment, however, cross-links are not requiredCross-links means that each cluster node has visibility to both storage systems. That allows a server that lost connectivity to its nearest storage system to continue processing transactions using the other storage system. To keep the configuration simple, cross-links are not required and the nodes only need visibility to the closest storage system. However, even when cross-links are used, the deployment is simpler than pure host-based mirroring:
    • Each host consider the remote devices as additional paths and not as additional devices (they both have the same external SCSI personality). There are a few implications. First, there is no need for a new ‘agent’ or mirroring software on the database host. The multipathing software will manage these paths just like other paths. Second, the multipathing software can be set to not use the cross-linked paths unless all the main paths to the nearest storage system have failed. That means that host I/Os will not double, and I/O latencies will remain minimal as I/Os are directed to the local storage while it is available.
  • The overall management and deployment is also simplified: we only need to monitor the replication health across the two storage systems, and there is no need for agents on every database server. Also, by making the R1 and R2 devices seem identical to the host, GI is not even aware that the storage is extended or stretched. As a result, during GI setup we don’t need to worry where quorum files are located.

The interaction between SRDF/Metro and Oracle Grid Infrastructure

It is important to understand that SRDF/Metro doesn’t require an integration with GI that complicates RAC and GI configuration and management. By simply stopping I/Os on one storage system, and allowing them to resume on the other, RAC will reconfigure on its own accordingly — nodes that can perform I/Os to the quorum files remain up and running, and nodes that can’t are kicked out of the cluster. Keep it simple.

For this to work, it is essential that SRDF/Metro resumes I/Os faster than RAC long disk timeout, to avoid a race conditions in which by the time the storage resumes I/Os, RAC already determined that its quorum files can’t be reached anywhere across the cluster and brings all nodes down. This doesn’t happen because SRDF/Metro resumes I/Os in case of replication failure in a matter of a few seconds, where RAC long disk timeout value is default to 200 seconds.

A few more details:

When SRDF/Metro replications stop unexpectedly, both R1 and R2 devices freeze all I/Os temporarily, while SRDF/Metro determines which side of the replication is allowed to resume I/Os. All this takes just a few seconds as we only have to deal with the two storage systems (and Witness, as described later), unlike host-based solutions that have to consider the state of each and every cluster node.

The SRDF/Metro devices on the side that resumes I/Os become (or remain) R1 devices, take a RW state (available for host read and write I/O operations), and maintain the same external SCSI personality as during the active/active replications. The SRDF devices on the side that stopped servicing I/Os become (or remain) R2 devices, take a WD state (write disabled state), and don’t maintain the same external SCSI personality they had during the replications — just another safety mechanism.

SRDF/Metro adds performance to Oracle Extended RAC deployment

Of course when we need to maintain consistent data across distance performance is one of the main concerns. SRDF/Metro helps to maintain high performance for the solution in a few ways:

  • It utilizes the closest storage system to the Oracle cluster node for all I/Os. Even if cross-links are used, they are in passive (standby) mode, as long as the paths to the local storage system are working.
  • It optimizes database read I/O latencies, as PowerMax has a large capacity cache. When the required data is in the PowerMax cache, reads are satisfied directly from the cache. Otherwise, the data is fetched from the local storage NVMe flash media, utilizing features such as Optimized Read Miss to make the transfer even faster.
  • It optimizes database write I/O latencies, as writes are sent to the local storage system’s cache (no need to wait for writes to destage to the flash media, as PowerMax cache is considered persistent). From there, they are replicated synchronously to the remote storage system’s cache (again, no need to wait for writes to destage to the flash media) and get acknowledged.

As a result of the factors described above, SRDF/Metro is well suited for Oracle Extended RAC deployments, and is fast and easy to use.

In the next section I cover SRDF/Metro protection from a “split-brain” situation. Reviewing all the possible failure scenarios and how SRDF/Metro deals with each is too long to cover here. Instead, I’ll cover the basics of how SRDF/Metro is designed to handle them. I encourage you to use the links in the reference section at the end for more details.

SRDF/Metro protection from a “split-brain” situation

A “split-brain” happens when two or more nodes in a cluster can’t communicate, and yet, believe that they are the only survivor and should keep writing to storage, essentially corrupting the data.

SRDF/Metro takes a deterministic approach to not allow for a “split-brain” possibility in case of cluster partitioning (i.e. the two storage systems can’t communicate). SRDF/Metro protection against a “split-brain” situation is based on two complimentary methods: Bias-rules, and Witness-rules.

SRDF/Metro Bias rules

Under Bias-rules, one side of the SRDF/Metro paired devices (usually where the R1 devices are) is pre-determined to “win”, or resume I/Os if replications stopped unexpectedly. The other side (usually where the R2 devices are), is pre-determined to immediately stop servicing I/Os. Note that with PowerMaxOS 5978, SRDF/Metro regularly takes additional factors in account to determine the Bias side, such as which of the storage systems has a synchronized SRDF/A leg (SRDF/A in a Consistent state), available storage directors, etc. That means that the Bias may change dynamically during the replications to reflect these factors.

SRDF/Metro synchronized device-pairs protected by Bias-rules show a state of ‘ActiveBias’.

While using Bias-rules is a bullet-proof method for preventing a “split-brain” (only one side can ever resume I/Os), it is not flexible, as a “disaster” may occur where the Bias is, requiring manual intervention to make the non-Biased side available. That’s where SRDF/Metro Witness comes in.

SRDF/Metro Witness-rules

SRDF/Metro Witness is an added component that serves as a real-time arbitrator when an active/active SRDF group unexpectedly stop replicating. SRDF/Metro, with the help of a Witness quickly determines in real-time which storage system is deemed best to continue servicing I/Os for that SRDF group, based on the situation and failure conditions.

There are a few important factors to know when working with Witness:

  • A Bias is still set for every SRDF group in an SRDF/Metro configuration, even if Witness is used. That is in case the Witness is removed or can’t be reached. However, if a Witness is configured, it overrides the Bias.
  • A Witness can be array-based (or “physical” Witness), which is when another VMAX or PowerMax storage systems’ SRDF links are used just for the purpose of health communication (no data is sent across them). A Witness can also be Virtual (or vWitness), which is when one or more VMware virtual appliances (vApps) are communicating with SRDF/Metro over IP about the cluster health. These are Solutions Enabler or Unisphere for VMAX or PowerMax vApps, which already include a vWitness component.
  • Multiple Witnesses should be configured, but only one Witness can be in effect at a time. SRDF/Metro will only start using a Witness when both storage systems can reach it and agree to use it. If both physical and virtual Witnesses are configured, the physical Witness takes precedence.
  • As SRDF/Metro replications granularity is at an SRDF group level (a collection of paired R1 / R2 devices), also the choice of a Witness is done at that granularity — per SRDF group. Therefore, different SRDF groups in active/active state may use different Witnesses, or a single Witness may serve multiple SRDF groups.
  • As mentioned earlier, if a Witness is not configured, or communication with the last Witness is lost while SRDF/Metro is already in active/active state, SRDF/Metro reverts to using Bias-rules.

SRDF devices protected by by Witness-rules show a state of “ActiveActive”.

That’s it for now for the theory. Following is a video showing an example of a failure scenario where the R1 Biased storage system, or “pre-determined winner”, lose visibility of both its R2 peer and the Witness (but not the database servers). In a matter of seconds I/Os are resumed at the R2 system, followed by Oracle RAC automatic reconfiguration and user sessions reconnect, without any user intervention.

The slide explaining the demo setup:

 

Link to demo video:

To read about how-to deploy SRDF/Metro with Oracle Extended RAC, follow part II of the blog.

References:

SRDF/Metro Overview and Best Practices Technical Note

SRDF/Metro vWitness Configuration Guide

Drew’s blog and white paper (multiple posts about SRDF/Metro deployment under vSphere)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: