Summary
ParallelArray is a new class, proposed for Java 7, that makes it easy to automatically parallelize aggregate operations on in-memory data. Brian Goetz's latest IBM developerWorks article introduces this class with code examples.
ParallelArray uses fork-join parallelism to allow for parallel aggregate operations over data:
The main operations are to apply some procedure to each element, to map each element to a new element, to replace each element, to select a subset of elements based on matching a predicate or ranges of indices, and to reduce all elements into a single value such as a sum.
In his article, Goetz explains that:
Fork-join is a technique that makes it easy to express divide-and-conquer parallel algorithms in a way that admits efficient execution on a wide range of hardware, without code changes...
Fork-join embodies the technique of divide-and-conquer; take a problem and recursively break it down into subproblems until the subproblems are small enough that they can be more effectively solved sequentially. The recursive step involves dividing a problem into two or more subproblems, queueing the subproblems for solution (the fork step), waiting for the results of the subproblems (the join step), and merging the results...
The principal benefit of using the fork-join technique is that it affords a portable means of coding algorithms for parallel execution. The programmer does not have to be aware of how many CPUs will be available in deployment; the runtime can do a good job of balancing work across available workers, yielding reasonable results across a wide range of hardware.
One area that is ripe for parallelization is sorting and searching in large data sets. It's easy to express such problems using fork-join... But because these problems are so common, the [concurrency] class library [proposed for Java 7] provides an even easier way—ParallelArray...
Goetz then enumerates the main benefits of ParallelArray:
The idea is that a ParallelArray represents a collection of structurally similar data items, and you use the methods on ParallelArray to create a description of how you want to slice and dice the data. You then use the description to actually execute the array operations (which uses the fork-join framework under the hood) in parallel...
This approach has the effect of letting you declaratively specify data selection, transformation, and post-processing operations, and letting the framework figure out a reasonable parallel execution plan, just as database systems allow you to specify data operations in SQL and hide the mechanics of how the operations are implemented. Several implementations of ParallelArray are available for different data types and sizes, including for arrays of objects and for arrays of various primitives...
The remainder of Goetz's article explains the basic operations supported by ParallelArray: filtering, application, mapping, replacement, and summarization. Goetz also provides results from a simple benchmark that estimates the speedup from using ParallelArray, noting that:
It is possible to get reasonable parallelism using high-level, portable mechanisms without tuning.