CERN Accelerating science

Database

Jump to related content

The year 2010 was somewhat of a thrilling moment for the Database Competence Centre (DCC): would all of the preparations—especially for the recording of controls configuration and logging data for the accelerator itself and for the recording and export of conditions data for the experiments—be up to the challenges of the first “production” run of the LHC and the data-taking and analysis for the experiments? A year later, the answer is a resounding “Yes”. As shown on the figures, the replication rate for conditions data from the ATLAS experiment to the different WLCG Tier-1 sites, and the statistics concerning data volumes logged for the LHC accelerator both reflect the change to production running for the LHC during 2010.

Securing constant access to the physics data is paramount. As a consequence, priority is given to service stability in terms of database requirements. This, together with the developing multi-year schedule for accelerator operations, led to a postponement of the deployment of Oracle Database 11g Release 2 for large scale production. Hence, the numerous new features offered by this release—reported on by the DCC team previously- do not benefit the users as yet. Nevertheless, interesting development and evaluation work has continued, notably in the areas of database replication, virtualisation and monitoring.

Database replication

Background

Oracle Streams is an essential tool for the distribution of experiment metadata to the Tier-1 sites as well as within CERN and has been the focus of significant development effort within CERN openlab. However, in CERN’s loosely controlled environment, user changes to the source schema can easily disrupt replication unless the equivalent changes are applied manually to the target(s). Since two new options for replication, Oracle Active Data Guard and Oracle GoldenGate, are now available and look to be more robust against user errors, it was natural for these to be the subject of the DCC team’s study this year. 

Oracle Active Data Guard is an extension in Oracle 11g to the previous Data Guard replication tool. Whilst the Data Guard database replicas are inactive secondary copies of a database system, waiting to be called into service in the event of a failure on the primary system, Active Data Guard replicas support read access. Active Data Guard can thus potentially address CERN and the Worldwide LHC Computing Grid’s (WLCG) needs for the distribution of metadata as only read access is required at the Tier-1 sites. However, discussions with Oracle replication experts confirmed the need for a tight coupling of database patching and upgrade schedules at the source and target sites. As this is not achievable in our multi-organisation collaboration, Active Data Guard cannot be exploited in this way. However, it remains of definite interest for the replication of data within CERN—for example replication of data from database systems at the experiment sites to systems housed in the computer centre.

Fortunately, the other newly available replication technology, GoldenGate, does not require such tight coupling between the source and target sites—indeed, it is even designed to enable replication between different RDBMS (Relational Database Management System) implementations, for example between Oracle and MySQL—and has thus been investigated extensively over the past year.

Oracle GoldenGate evaluation

A dedicated testbed configuration was established to enable tests of two different GoldenGate versions against two database releases. Initial performance tests with the default software configurations established that GoldenGate performs slightly better than Oracle Database 10g Streams but that it cannot reach the throughput offered by the latest version of Streams provided with the Oracle Database 11g. However, exploiting GoldenGate’s “BatchSQL” optimisation mode enabled the team to demonstrate throughput similar to that of Streams 11g for generic data. Tests with the latest GoldenGate version revealed significant improvements in data delivery parallelism, resulting in higher overall throughput, as seen on the related figure.

As well as raw replication performance, stability and reliability aspects are essential. Here, a long term stability test gave remarkable results as the team achieved over four months of continuous data propagation without any negative impact on the source database. Additionally, GoldenGate’s recording of data changes from master database (in files called trail files) delivers great advantages in terms of data recovery. As trail files are copied to all replicas, these local trail files can be used to re-apply any lost transactions in the event of loss of a slave (replica) database to restore full consistency between master and replica. No additional data retransmission from source database is needed—a very beneficial feature for such a widely distributed environment.

Whilst GoldenGate delivers fast and reliable replication of data in cases where the structure of the database schema is static—representing the majority of industry use cases—replication of schema structure updates is of interest to CERN. Indeed, the LHC experiment supervisory controls and data acquisition systems have data management optimisations which update schema structure in order to improve data collection performance. Unfortunately, the team was unable to achieve the needed performance when simulating a real LHC workload as this use of schema updates prevents the metadata collection applications from leveraging optimisations within the GoldenGate product. Nevertheless, both CERN and Oracle remain hopeful that this may evolve in the future with Jagdev Dhillon, Oracle GoldenGate’s Senior Director of Product Development, saying:

"We appreciate CERN’s feedback on the GoldenGate product and look forward to continuing our joint efforts to further improve the performance of Oracle GoldenGate with the LHC application workload."

Based on expertise gained during the development of monitoring tools for Streams, the CERN openlab team also gave some feedback on Director, the GoldenGate monitoring interface, which may help to extend the utility of this effective and intuitive interface when used in a complex distributed environment.

Virtualisation

Previous work in openlab has delivered the seamless integration of Oracle’s virtualisation platform, OracleVM, with CERN’s Extremely Large Fabric management system, ELFms. In particular, the DCC team demonstrated the uniform treatment of virtual images and physical machines by the provisioning systems at the end of 2009. 

Building on this initial work, in 2010 the focus switched to delivering a production-quality environment with, notably, support for Gigabit Ethernet, Ethernet bonding and high availability support through provision of machine pools for different purposes (development, test, production) and exploitation of live migration. Specifically, a command line interface combining XEN and OracleVM commands was developed to simplify complex operations. As a result, it takes just one command to create a virtual machine instance from scratch, and monitoring ongoing instance behaviour has been greatly simplified. Advances such as these have helped to increase the utilisation and reliability of Oracle WebLogic Server on JRockit Virtual Edition environment, now approaching 8 months continuous up-time without any interruption.

The virtualisation team hosted two openlab summer students. The first student explored memory ballooning, a technology that allows running virtual machines with more allocated memory than the real memory of the hosts. These tests demonstrated the possibility to allocate, and stress, up to 30 dual-core virtual machines with 4 GiB of Random Access Memory (RAM) in an eight-core host with 48 GiB of RAM, while no more than 12 hosts could have been allocated without memory ballooning. In cases where the applications running on the virtual machines made little use of the memory allocation, the team was even able to operate with up to 120 virtual machines running on the server.

The task of the second summer student was to integrate Oracle WebLogic Server on JRockit Virtual Edition into the management and distribution system. This work led to a presentation of this technology at Oracle OpenWorld and to a press release. The DCC team also had the pleasure of hosting a visit from the Oracle WebLogic/JRockit team at CERN.

Monitoring

The work in the monitoring area during the year aptly reflected CERN openlab’s motto: “You make it, we break it”. Practical, careful testing prior to deployment cannot guarantee a zero rate for incidents in the production phase. Some anomalies were reported after the deployment of the 11g release of Oracle Enterprise Manager, but the close collaboration between the Oracle Enterprise Manager Development team and CERN engineers, enabled through CERN openlab contacts, made a swift solution possible. The investigations successfully identified a small memory leak, the effects of which were vastly magnified at CERN—partly due to the deployment scale and partly due to the use of a wide range of Web browsers for connecting to Enterprise Manager. 

Once the problem was solved, the DCC team could fully exploit the fact that Enterprise Manager 11g runs on and provides monitoring for WebLogic, Oracle’s new application server family. Actually, this feature was a strong motive to deploy the 11g release in the context of the migration from Oracle iAS to WebLogic. Indeed, the team had developed a very effective automatic discovery process for WebLogic domains using the Enterprise Manager 11g command line interface. Furthermore, the Oracle JRockit Mission Control monitoring and profiling capabilities proved to be very useful to CERN application developers, helping them resolve memory leaks on Java Virtual Machine level.

A third CERN openlab summer student joined the DCC team and tested the Corrective Actions feature of Enterprise Manager, which enables the automatic execution of corrective scripts when exceptions are detected. This feature proved to be very useful during the system instability described above: the memory leak invariably led to problems during the night, leading to some delays in the debugging. By exploiting the Corrective Actions feature, the information needed to debug the issue could be automatically made available.

This ability to take corrective action automatically is related to an idea which has often been discussed in the openlab context: the possibility to link Enterprise Manager to external, non-database events. The DCC team made progress on this topic during the year by enabling the import of information related to the LHC beam energy into Enterprise manager. This should facilitate the analysis and correlation of LHC databases performance and workload with the state of the machine and thus enable to act—automatically—in an appropriate manner depending on the machine operation phase. More developments in this area are expected in the coming year and as we look ahead to openlab IV.

Related content