“RAC Aware Software Development” Discussion

If you ask this question(is there a concept like RAC Aware Software Development?) to Oracle employees since RAC is positioned to be completely application transparent they will not accept this concept. So this is another chance for me to have the comfort of not being an Oracle employee :)

First of all, you pay additionally for RAC option, so it will of course be a rational behaviour to get out of it. Also RAC is only an Oracle database tier HA solution(and it provides features to help application HA). HA on the other layers should also be analyzed and solved, so an important question to answer is “How long can you tolerate a service interruption? How much money do you loose when your service is down n minutes?”. Because the resulting architecture can change dramatically according to the answer of this question and each HA solution can provide a certain amount of availability, these all come with a certain cost. This costing is primarily a business decision rather than a technical decision, technical side is easy once the business decision is made and the budget is accepted.

Another important question to answer is “Why do need RAC?”, most possible answers are High Availability and/or Horizontal Scalability needs, and these are really strong motivations if you really need them so as a result you may become willing to have this major change/migration on your environment. But what if all your migration efforts may end up with a failure, meaning if your applications still fail when a node fails and/or your applications can not scale with multiple nodes? This kind of a result may harm your position in your organization if you are the trusted Oracle guy around there, so it is important to communicate the positioning of this Oracle option for your environment with your managers.

It is always best to experience problems or not expected behaviour during testing phase, not production where you have your on going business. But the hardest part of testing is to create a workload similar to the production load and your test environment-scenarios must be as close as possible to the real workload. To achieve this RAT(Real Application Testing) option(11g new feature) provides a way to capture&run the real workload onto a test system, but depending to your application(like lots of distributed Oracle databases calls over dblinks) the product may not be mature enough yet.

Oracle believes that a single node environment can not be more high available than its RAC alternative even without any application change, but for me this comment is something to consume carefully. The internal mechanisms that play a role in a RAC environment is more complex than a single instance environment, as a result if the DBA group responsible to manage this new environment is not experienced or is afraid of it this will cause additional downtime(I have experienced this scenario several times with ASM and RMAN). Also Oracle believes that most of the problems related to RAC migrations are caused by misconfiguration or wrong sizing of hardware and/or software, but we also experienced at a recent 11g 5 node Linux consolidation project that even if the sizing was more than enough and all best practices were followed you still may hit bugs. When these problems popped out our project was at production, we contacted with the Oracle developers immediately and got fixes in one day after the diagnosis(we were not able to capture these bugs during testing since RAT(Real Application Testing) had also several bugs). We all need to accept that there can always be bugs or limitations on any kind of software, even it is Oracle database :)

Also not specific to RAC migrations, any big change(like changing OS version, adding new hardware, changing driver version etc.) in a system can magnify the existing problems. So one of the initial steps of a RAC migration project may be checking tuning possibilities(like SQL query tuning, avoid parsing overhead, partitioning, indexing and clustering etc. for high load SQLs) for the single node environment. All these strategies and as a result some additional development and testing will decrease the interconnect traffic dramatically.

In order to increase your applications'(not database) HA you need to develop special connection methods and exception handling so that you can understand a node failure and refresh connection pools at mid-tier for example. But these exceptions you may manage may not be available to your technology stack if you are not using Oracle JDBC, OCI or ODP.NET drivers.

Conclusions; if you follow shared best practices(I shared my favorite links below) and architect your systems according to your needs, have careful load testing, you will minimize the risk of having problems. Also having a close relationship with Oracle is an important success factor here. As a result to my perspective, in order to get most out of RAC option you need to develop RAC aware software! :)

Oracle Real Application Clusters Sample Code

Oracle® Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide
Chapter 6 Introduction to Workload Management

OC4J Data Sources: Implicit Connection Caching and Fast Connection Failover by Frances Zhao

Oracle RAC Tuning Tips by Joel Goodman

Understanding RAC Internals by Barb Lundhild