Summary
While servers with CPUs supporting multiple cores and multi-gigabytes of main memory have become common in recent years, few applications have yet taken advantage of that increased computing power. In a recent OnJava.com article, Michael Juntao Yuan and Dave Jaffe illustrate how to scale Java applications to multi-core CPUs.
Advertisement
While computers with multiple CPUs have been available for many years in both server and desktop forms, in the past twelve months, the major CPU manufacturers started shipping multi-core CPUs in large volumes. The result is that in a short time, multi-core machines will become the norm for running server as well as desktop applications.
In addition to multiple cores, recent CPU architectures have moved to 64-bit instruction sets, with the primary benefit of making a lot more memory available to applications.
While the twin trends of multi-core CPUs and a larger available memory space have received a lot of attention, relatively few applications have yet taken advantage of them. The reason is partly due to lack of a general understanding of how to exploit such processors, as well as the need for architectural changes to existing applications, according to a recent article by Michael Juntao Yuan and Dave Jaffe, Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers.
Speaking of application architecture, Yuan and Jaffe note that:
Traditionally, Java EE had been designed for the multitiered architecture. This architecture envisions that the web server, servlet container, EJB server, and database server each runs on its own physical computer, and those computers are tied together through remote call protocols on the local network.
But with the new generation of more powerful server hardware, a single computer is powerful enough to run all those components for a medium-sized website. Running everything on the same physical machine is much more efficient than the distributed architecture described above. All communications are now interthread communications that can be handled efficiently by the same operating system or even inside the same JVM in many cases. It eliminates the expensive object serialization requirements and high network latency associated with remote calls. Furthermore, since different components tend to use different kind of server resources (e.g., the database is heavy on disk usage while Java EE is CPU-intensive), the integrated stack helps us to balance the server usage and reduce overall contention points.
Even if all application tiers are colocated in the same server, a server's CPU(s) may still sit idle, waiting for some other tasks to complete. Idle CPUs are a key indication of an application that does not properly scale to multi-core systems, according to Yuan and Jaffe:
When the application is fully loaded [during load testing], the CPU should run between 80% and 100% of its capacity. If the CPU usage is substantially lower, you should look for other bottlenecks, such as whether the network or disk I/O is saturated... An underutilized CPU could also indicate contention points inside the application. For instance... if there is a synchronized block on the critical path of multiple threads (e.g., a code block frequently accessed by most requests), the multiple CPUs would not be fully utilized.
Yuan and Jaffe note that, fortunately, the Java platform offers a wealth of development tools and techniques to ensure that a machine's capacity is fully utilized by a Java program. They mention three aspects of JDK 1.5 and above that are particular relevant to scalable application development:
The concurrency utility library... simplifies the Java thread API and provides a thread-safe set of Collection implementations. For instance, the new ConcurrentHashMap is a thread-safe HashMap and you can read/write it without a synchronized block...
The NIO... library... allows multiple threads to share one physical connection (e.g., a socket) to the hard disk or network. A thread no longer needs to block the I/O socket to read or write data... The NIO is especially useful in multi-CPU computers where CPUs often wait on the I/O, and where there are many threads.
The logging library provides a convenient API to log information from the application to the console, logfiles, or network destinations... [and allows you to] configure the logging output by changing the logging level at runtime via configuration files. This helps us to reduce logging—which involves slow I/O operation and is a major cause for CPU waiting—at runtime, without recompiling the application code.
Yuan and Jaffe highlight in the article JVM tuning parameters to take advantage of all of a system's available memory:
You should allocate as much memory as possible to the JVM using the -Xms (minimum memory) and -Xmx (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server...
On 64-bit systems, the call stack for each thread is allocated 1MB of memory space. Most threads do not use that much space. Using the -XX:ThreadStackSize=256k flag, you can decrease the stack size to 256k to allow more threads...
Use the -XX:+DisableExplicitGC flag to ignore explicit application calls to System.gc()....
The -Xmn flag lets you manually set the size of the "young generation" memory space for short-lived objects. If your application generates lots of new objects, you might improve GCs dramatically by increasing this value. The "young generation" size should almost never be more than 50% of heap.
Finally, the authors provide tips on fine-tuning the garbage collector for large-memory systems.
What is your experience using Java on multi-core systems? What additional techniques have you used in order to scale an application so that it can take full advantage of available hardware resources?
1) Java might be a "vitual machine," but that virtual machine is a 32-bit machine. The 64-bit Java VMs runs great on 64-bit multi-core servers (we have 64-bit AMD Opteron dual cores, Intel dual core Xeons and Intel quad core Xeons in our test lab), but Java arrays are indexed by 31-bit values, Java NIO uses 31-bit values for offsets and lengths, etc.
2) Multi cores require cheaper coordination models than the Java monitor ("synchronized" keyword which generates MONITORENTER and MONITOREXIT byte codes). Doug Lea and the rest of JSR166 team efforts are very important here, as they pay off even with dual cores, and with quad core (or multi socket multi core) they are hugely "in the lead" performance wise. It's not that you replace "synchronized" with the concurrent APIs, but rather that the concurrent APIs are simply better (more appropriate) for 60-70% of the problems that one faces for implementing those high-contention points in software.