Thursday, March 29, 2018

Optimizing the Performance & Scalability of Java Applications that use an RDBMS

Preamble

There is an abundant literature on Java performance (books, articles, blogs, websites, and so on); a Google search returns more than 5 millions hits. To name a few, the Effective Java programming language guide, Java Performance the definitive guide, Java performance tuning newsletter and associated http://www.javaperformancetuning.com website. This is not the purpose of this post.

The goal of this post is to revisit the known best practices for speeding up and scaling database operations for Java applications then discuss database proxy and the upcoming standard Java API for asynchronous database access (ADBA).

Even those familiar with Java optimization techniques will learn new tips!

Speeding up Java applications that use an RDBMS


Optimizing database operations for Java applications includes: speeding up database connectivity, speeding up SQL statements processing, optimizing network traffic, and in-place processing. 

Speeding up Database Connectivity

Connection establishment is the most expensive database operation; the obvious optimization that
Java developers have been using for ages is connection pooling which avoid creating connections at runtime. 



Client-side Connection Pools

Java connection pools such as the Apache Commons DBCP, C3P0, as well as the Oracle  
Universal Connection Pool (UCP) and many others,  run as part of your stand-alone Java/JDBC
applications  along the JDBC libraries or as part of Java EE containers datasources (e.g.,
Tomcat, Weblogic, WebSphere and others). Java EE containers usually furnish their own
connection pools but they also allow replacing theirs with 3rd party pools (see using
UCP with Tomcat, UCP with Weblogic). 

Most Java developers use these client-side or mid-tier connection pools for sustaining small and
 medium workloads however, these connection pools are confined to the JRE/JDK instance
(i.e., can't be shared beyond the boundary of the JRE/JDK) and unpractical when deploying
thens of thousands of mid-tiers or Web servers. Even with very small pool size each, the RDBMS
 server  is overwhelmed by thens of thousands of pre-allocated connections that are idle (more
than 90% of the time).



Proxy Connection Pools

Proxy connection pools such as MySQL Router, Oracle Database Connection Manager in Traffic
 Director Mode (CMAN-TDM), and others, are part of proxy servers that sit between the database
 clients (i.e., Java apps) and the RDBMS. These allow thousands of database clients to share a 
common connection pool. I will discuss this a bit more, near the end of this post.

The Oracle database also furnishes database-side connection pools such as  the Shared Servers, and the Database Resident Connection Pool (DRCP). We will not discuss those in this post.

Other connection optimization features include: deferring connection health check and the 
de-prioritization of failed nodes.
   Deferring Connection Health Check

The ability of a connection pool such as Oracle's Universal Connection Pool (UCP) to avoid
checking the health of connections for a defined period of time, improves the latency of
connection check-out (i.e., getConnection() returns faster).

De-prioritization of Failed Nodes
In a multi-instances clustered database environment such as Oracle RAC,  this JDBC feature assigns
a low priority to a failed instance for a user-defined period of time thereby reducing the connection
establishment latency (iow, avoid attempting to get connections from the failed instance).


Optimizing Statements Processing

The default COMMIT mode with JDBC is Auto-COMMIT; unless this corresponds to your desire, 
you should explicitly disable Auto-COMMIT on the connection object.

conn.setAutoCommit(false);


Processing a SQL statement requires several steps including: parsing, binding variables, executing,
fetching resultSets (if a query), and COMMITting or ROLLBACKing the transaction (if a DML
i.e., Insert, Update, or Delete). 



Java developers have several options for optimizing SQL statements processing including: 
Prepared Statements, Statements Caching, ResultSets caching with change notification.


Prepared Statements

Parsing (i.e., hard parsing) is the most expensive operation during the processing of a SQL statement.
The best practice consists in avoiding parsing by using Prepared Statements which are parsed only once then reused many times on subsequent invocations, after binding variables. A security
 byproduct of Prepared Statements is to prevent SQL injection.

https://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html
Statements Caching

Statement caching significantly improves performance.  The JDBC driver caches the SQL statements 
(PreparedStatements and CallableStatements) on close, using an LRU algorithm then refers
the RDBMS to the parsed form in its library cache (i.e., "use statement #2)during subsequent
invocations of the same statement.  Enabled by setting  Implicit statement caching  to true and
 allocating a statement cache in the driver memory (i.e., an array per physical connection).

  
OracleDataSource ods = new OracleDataSource(); 
... 
ods.setImplicitCachingEnabled( true ); 

ods.setStmtCacheSize(nn);

...

ResultSets Caching with Change Notification - the Hard Way (JDBC-OCI)

Caching JDBC result sets avoids re-executing the corresponding SQL query, resulting in dramatic
Java applications performance. RDBMSes allow caching ResultSet at the server side but the 
applications needs a roundtrip to the database to get these. Optimizing further, these result set can be
pushed to the drivers (Java, C/C++, PHP, C#, and so on) and grabbed by the applications without
 database roundtrips. 
Then what if the ResultSets become stale, out of sync with the actual RDBMS data? RDBMSes 
furnish mechanisms to maintain the ResultSets, up to date. For example, the Oracle database's Query
Change Notifications allows registering a SQL query with the RDBMS and receiving notifications
when committed DMLs from other threads render the ResultSets out of sync. 
Java applications may explicitly implement ResultSet caching with change notification through the
 following steps:

Prerequisite: grant CHANGE NOTIFICATION to the schema (i.e., database user); 

 grant change notification to HR;  // might need your DBA's help.

1) Create a registration


OracleConnection conn = ods.getConnection();
Properties prop = new Properties();
prop.setProperty(OracleConnection.DCN_NOTIFY_ROWIDS, "true");
prop.setProperty(OracleConnection.DCN_QUERY_CHANGE_NOTIFICATION,"true");
   ...
    DatabaseChangeRegistration dcr = conn.registerDatabaseChangeNotifictaion(prop);
    ...

2) Associate a query with the registration
  
  Statement stmt = conn.createStatement(); 
  // associating the query with the registration
  ((OracleStatement)stmt).setDatabaseChangeRegistration(dcr); 
  // any query that will be executed with the 'stmt' object will be associated with
  // the registration 'dcr' until 'stmt' is closed or 
  // '((OracleStatement)stmt).setDatabaseChangeRegistration(null);' is executed.
 ...



3) Listen to the notification
 ...
 // Attach the listener to the registration. 
 // Note: DCNListener is a custom listener and not a predefined or standard 
 // listener
 DCNListener list = new DCNListener(); dcr.addListener(list); 
 ...

 catch(SQLException ex) { 
 // if an exception occurs, we need to close the registration in order 
 // to interrupt the thread otherwise it will be hanging around. 
  if(conn != null) 
     conn.unregisterDatabaseChangeNotification(dcr); 
  throw ex; 
 }



ResultSets Caching with Change Notification - the Easy Way (JDBC Thin or OCI with DB 18c)

You may also enable ResultSet caching with invalidation, in a much easier way, using the following
 steps (once JDBC-Thin in Oracle database 18c is available on-premise).

1) Set the following database parameters in the database configuration file also known as
 INIT.ORA.
CLIENT_RESULT_CACHE_SIZE=100M // e.g., maximum cache size, in bytes
CLIENT_RESULT_CACHE_LAG=1000 // maximum delay for refreshing the cache (msec) 



2) Set the JDBC connection property oracle.jdbc.enableQueryResultCache to true (the default).



3) add the following hint to the SQL query string  "/*+ RESULT_CACHE */"

Example "SELECT /*+ RESULT_CACHE */ product_name, unit_price 
             FROM PRODUCTS WHERE unit_price > 100"

If changing the Java/JDBC source code to add the SQL hint is not an option, you can instruct the
 RDBMS to cache the ResultSets of all queries related to a specific table, either at table creation 
(default mode) or later (force mode); this is called Table annotation.

Examples              

CREATE TABLE products (...) RESULT_CACHE (MODE DEFAULT);
   ALTER TABLE products RESULT_CACHE (MODE FORCE);

The RDBMS furnishes views such as the V$RESULT_CACHE_STATISTICS and
CLIENT_RESULT_CACHE_STATS$ table for monitoring the effectiveness of ResultSet caching.
See section 15  in the performance tuning guide for more details on configuring the server-side 
result set cache

Array Fetch

Array fetching is an absolute necessity when retrieving a large number of rows from a ResultSet.
The fetch size can be specified on Statement, PreparedStatement, CallableStatement, and 
ResultSet objects.
Example: pstmt.setFetchSize(20);



When using the Oracle database, this array size is capped by the RDBMS's internal buffer known as
Session Data Unit (SDU). The SDU buffer is used  for transferring data from the tables to the client, 
over the network. The size of this buffer, in bytes, can be specified in JDBC URL

      jdbc:oracle:thin:@(DESCRIPTION=(SDU=10240)
                     (ADDRESS=(PROTOCOL=tcp)(HOST=myhost-vip)(PORT=1521))
                     (CONNECT_DATA=(SERVICE_NAME=myorcldbservicename)))



or at the service level in Net Services configuration files sqlnet.ora and tnsnames.ora.
There is a hard limit depending on the RDBMS release: 2MB with DB 12c, 64K with DB 11.2,
 and 32K with DB pre-11.2.
In summary, even if you set the array fetch to a large number, it cannot retrieve more data than
the SDU permits.

Array DML (Update Batch)

The JDBC specification defines array operations as sending a batch of the same DML operations
(i.e.,  array INSERTs, array UPDATEs, array DELETE) for sequential execution at the server, 
thereby reducing network round-trips.

Update Batching consists in explicitly invoking the addBatch methods which adds a statement to
 an array operation then explicitly calling executeBatch method. 

...
 PreparedStatement pstmt =
  conn.prepareStatement("INSERT INTO employees VALUES(?, ?)");
pstmt.setInt(1, 2000);
pstmt.setString(2, "Milo Mumford");
pstmt.addBatch();
pstmt.setInt(1, 3000);
pstmt.setString(2, "Sulu Simpson");
pstmt.addBatch();
int[] updateCounts = pstmt.executeBatch();

...

Optimizing Network Traffic


Network Data Compression

The ability to compress data transmitted between the Java applications and the RDBMS over LAN or WAN reduces the volume of data, the transfert time and the number of roundtrips.

// Enabling Network Compression 
prop.setProperty("oracle.net.networkCompression","on"); 
// Optional configuration for setting the client compression threshold.
prop.setProperty("oracle.net.networkCompressionThreshold","1024"); ds.setConnectionProperties(prop); 
ds.setURL(url); 
Connection conn = ds.getConnection(); 
...

Sessions Multiplexing 


The Oracle database Connection Manager a.k.a. CMAN, furnishes the ability to funnel multiple database connections over a single network connection thereby saving OS resources.
See more details in the Net Services Admin guide.

In-Place Processing


As we have seen earlier, SQL statements execution involves a number of roundtrips between a database client i.e., Java mid-tier/web-server and the RDBMS; this is the rationales for using stored procedures. Even modern data processing such as Hadoop or Spark, collocate the processing and data for low latency.
All RDBMSes furnish stored procedures in various languages including proprietary procedural language such as Oracles PL/SQL but also Java, JavaScript, even PHP, Perl, Python, and TCL.
I discussed the pros and cons of stored procedures in chapter 1 of my book.
I'd add that in a modern micro-services based architecture, stored procedures are perfect for designing data-bound services.
The Oracle database furnishes Java and PL/SQL stored procedures. Java in the database is one of the best Oracle database gem; see some code samples on GitHub.

Scaling Out Java Applications that use an RDBMS


In this section, I will discuss scaling Java applications using Sharded databases, Multitenant databases, database proxy and the upcoming asynchronous Java database access API.

Horizontal Scaling of Java applications with Sharded Databases


Sharded database have been around for a while; think of shards as horizontal partitioning of tables across several databases (iow, partitions on steroids!).
The main impact for developers is that Java application must be Shard-aware; iow, the requirement to: (i) define which fields serve as sharding key, (ii) set the binding values and build the sharding key (and optionally, the super sharding key) before requesting a connection to the datasource. RDBMS vendors are actively working on a routing capability which will remove shard-awareness (see database proxy, later in this post).

Java SE 9 furnishes the standard APIs for building the sharding and supersharding keys.

DataSource ds = new MyDataSource();
     ShardingKey shardingKey = ds.createShardingKeyBuilder()
                           .subkey("abc", JDBCType.VARCHAR)
                           .subkey(94002, JDBCType.INTEGER)
                           .build();
 ...
 Connection con = ds.createConnectionBuilder()
                           .shardingKey(shardingKey)
                           .build();

Depending on the RDBMS implementation, the map of shards keys across all shard, also know as "shard topology" is maintained by an external mechanism known as the "Shard Director" (in  Oracle database implementation). Without further optimization, all connection requests (with a mandatory sharding key) go to the Shard Director which finds the corresponding shard then a connection is established with that shard.

A Shared Pool for Sharded DBs

The Oracle Universal Connection Pool (UCP) furnishes a shared single pool for all shards.
UCP has been enhanced to transparently suck the shard map (i.e., all the keys that map to a specific shard), from the Shard Director, during the first connection to a specific shard. Once UCP gets the keys range, it no longer needs to go to the Shard Director for subsequent connections requests related to that shard. After a little while, assuming your Java application randomly accesses all shards, UCP will get the entire shard topology from the Shard Director. A high availability byproduct of UCP acting as the Shard Director is that shard-aware Java applications can work even if the Shard Director is down.

Scaling Java Applications with Multi-Tenant Databases


Multi-tenancy is a key business requirement for enterprise Java applications. It could be simulated at the application level but true Multi-tenancy requires a Multi-tenant RDBMS where each tenant has it's own database.
Multi-tenant RDBMS scale by managing thousands of databases with one of very few database instances (an instance being the set of processes and memory structures necessary for managing a database), thereby reducing drastically the required computing resources.

How would Java applications scale with Multi-Tenant RDBMS?

A non Multi-tenant aware connection pool would allocate a pool per database, defeating the purpose.  UCP has been enhanced to use a single shared pool for all pluggable databases -- a.k.a. PDB (a PDB is the tenant specific database in Oracle's Multi-tenant architecture).
Upon a connection request to a specific PDB, if there is no free/available connection attached to that tenant database, UCP transparently repurposes an idle connection in the pool, which was attached to another PDB to be re-attached to this one, thereby allowing to use a small set of pooled connections to service all tenants while avoiding new connection creation (remember, this is very expensive!) and preserving system resources.
See the UCP doc for more details on using one datasource per tenant or a single datasource for all tenants.

Database proxy 


Proxies are man-in-the-middle software running between the database and its clients e.g., Java applications. There are several proxy offerings on the market; to name a few: MySQL Router, the Oracle Database Connection Manager in Traffic Director Mode (CMAN-TDM), ProxySQL, and so on.
The Oracle CMAN-TDM is new in Oracle database 18c; it is an extension of the existing Oracle Connection Manager a.k.a. CMAN and furnishes these new following capabilities
  • Fully transparent to applications
  • Routes database traffic to right instance (planned)
  • Hides database planned and unplanned outages to support zero application downtime
  • Optimizes database session usage and application performance 
  • Enhances database security
CMAN-TDM is client agnostic, iow, it supports all database clients applications including: Java, C, C++, DotNET, Node.js, Python, Ruby, R.
Java applications would connect to CMAN-TDM which, in its turn, connects to the database using the latest driver and libraries then transparently furnish the Quality of Service that the application would get only if it was using the latest driver and APIs

See more details in the CMAN landing page and the Net Services documentions linked from the landing page.

Asynchronous Java Database Access API (ADBA)


The existing JDBC API leads to blocked threads, threads scheduling, and contention; it is not suitable for reactive applications or high throughput and large-scale deployments. There exist non-standard asynchronous Java database access APIs but the Java community needs a standard one where user threads never block. User threads submit database operations and return; the API implementation takes care of executing the operations, independently of user threads.
This new API proposal is not intended to be an extension to, or a replacement for, JDBC but, rather, an entirely separate API that provides completely nonblocking access to the same databases as JDBC.


The new API proposal relies on the java.util.concurrent.CompletionStage interface; it is available for download from the OpenJDK sandbox @  http://tinyurl.com/java-async-db.
You can sed some examples in the latest JavaOne presentation @ http://bit.ly/2wi948k.

There was a suggestion on the mailing list http://mail.openjdk.java.net/pipermail/jdbc-spec-discuss/ to rather base he API on the Reactive Streams class java.util.concurrent.Flow; you can follow that discussion in the mailing list.

I would encourage all the readers of this blog to review the API and get involved in the discussion.

ADBA over JDBC

In order to help the community get a feel of ADBA, an alpha version of it that runs over the vanilla/synchronous JDBC -- that we are calling AoJ for ADBA over JDBC -- @ https://github.com/oracle/oracle-db-examples/tree/master/java/AoJ

Resources