This starts at a pretty early stage of development; usually, the basic logic of the application should be influenced because it drives the asymptotic parallelism behaviors. Consider a common pattern of optimization we’ve seen in single core tuning: the use of locally adaptive algorithms to heuristically reduce the computation time. By definition, this introduces dependencies in the computation that are beneficial in the single core case but limit parallelism for multi-core. Similar choices are made about libraries and programming languages that optimize for single core performance (or even small-way parallelism), but sacrifice long-term scalability...
Front-loading at least some of this transition is often less costly in the long run and positions them to more competitively reap the benefits of our silicon innovations over time. It’s not quite as simple as this binary choice, but you get the basic idea…program for as many cores as possible, even if it is more cores than are currently in shipping products.
Ghuloum recalls that just a few years ago developers wanting to benefit from parallelism expected that tools, such as parallelizing compilers, would do most of the work. Now, however, developers are beginning to realize that such tools will only go so far, and that basic ideas about code design will have to be re-thought:
Increasingly, we are discussing how to scale performance to core counts that we aren’t yet shipping (but in some cases we’ve hinted heavily that we’re heading in this direction). Dozens, hundreds, and even thousands of cores are not unusual design points around which the conversations meander...
[Taking advantage of that] requires at least some degree of going back to the algorithmic drawing board and rethinking some of the core methods they implement. This also presents the “opportunity” for a major refactoring of their code base, including changes in languages, libraries, and engineering methodologies and conventions they’ve adhered to for (often) most of the their software’s existence.
Ghuloum says that this is actually unpleasant news for many developers and organizations, because it means they have to adapt some well-established practices, including some basic algorithms, and designs to that new reality. At the same time, those quick enough to restructure their code that way will enjoy an advantage over their slower competitors.
To what extent do you consider your application's scalability on many CPU cores?
I expect the overwhelming majority of developers to do little or nothing amount this impending excess of cores. In most cases their competitors will also do nothing, so the competitive pressure will be equally small.
I am far from convinced there is role for a 100 core processor in a general purpose desktop, and possibly not in many servers either.
I use local state only and immutability in my Java as an attempt to minimize the difficulty of scaling it across multiple processors. The environments I'm coding to have been multi-processor for years so it's not just a possibility that I'm preparing for.
The problem is that with the (confused) idea that never considering what might happen is the best approach. Real discussion about the relative risks and rewards of different approaches is short-circuited and people charge forward at full speed towards the cliff. In a few years well see a lot of crying "how could I have known? YANGI! KISS! arrghh!"
We need to modernize some old sayings:
An ounce of prevention is worth nothing, idiot. A stitch in time is dumb, you stupid fool. Haste makes you more productive on paper.
Baseless Assertion #1: Imperative programming will always be with us.
Baseless Assertion #2: Most programmers on most projects will not adopt parallel-friendly programming architectures early in the development cycle because they are not a natural way to program. They tend to have awkward syntax, to front load a bunch of abstraction work on the developer before the domain is understood well enough to do a good job of it, etc.
Conclusion: By the inductive hypothesis (which I think I'm well justified in using here) closures are the only thing that are going to save us.
Given this:
function findEmployeesSortedBySalary() : List<Employee> { var employees = ... return employees.sortBy( \ emp -> emp.Salary ) }
even an idiot like myself can turn that into this:
function findEmployeesSortedBySalary() : List<Employee> { var employees = ... return employees.parallelSortBy( \ emp -> emp.Salary ) }
That's parallelism I can get my head around as a day-to-day dev. No locks. No actors. No adopting some funky framework for all my code. Just call a different method.
Until parallelism is that easy, us average schmucks aren't going to produce much of it.
Mark and others are correct, not much will be done to the code we write. It's not just average "smucks" that find parallelism in software difficult. It's also difficult for people who focus on it as a career. Much of the work regrading parallel algorithms is archived in books (Misra Chandry's work comes to mind), many of which sit on my bookshelf unused for years. I just don't find the need to code up a parallel Euler tour everyday.
<blockquote>Now, however, developers are beginning to realize that such tools will only go so far, and that basic ideas about code design will have to be re-thought"</blockquote>
People have thought and thought and thought about parallel algorithms. People just need to read the existing work that's gone largely ignored.
<blockquote>the use of locally adaptive algorithms to heuristically reduce the computation time.</blockquote>
That's a good indicator it's time to stop reading. Let's leave Ghuloum to his precious this time. We all know about the trouble it brought Frodo.
map reduce is in heavy use at google. I think everyone is surprised at how commonly map-reduce can be put into play. I heard that it is now being used by something like 500 different programs. Probably because it is a pretty basic distributed parallel algorithm.
Could a library of distributed map reduce be developed for use on regular file systems like linux? You know something that would make it possible to run distributed grep/find/sort/myprogram on Ubuntu in 2010. Would python or ruby or java be able to call such a library? Probably not anytime soon so it looks like I will be buying that new erlang book.
> map reduce is in heavy use at google. I think everyone is > surprised at how commonly map-reduce can be put into play. > I heard that it is now being used by something like 500 > different programs. Probably because it is a pretty basic > distributed parallel algorithm. > > Could a library of distributed map reduce be developed for > use on regular file systems like linux? You know something > that would make it possible to run distributed > grep/find/sort/myprogram on Ubuntu in 2010. Would python > or ruby or java be able to call such a library? Probably > not anytime soon so it looks like I will be buying that > new erlang book.
Map-reduce is great but isn't it overkill on a single system with multiple cores?
I also don't really understand the question since there are already map-reduce implementations written in Java and surely also in Python and Ruby.
map reduce is an "easy" way to let programmers take advantage of parallel processing in single systems and also distributed systems. If I had a single system with multiple cores, I'd want to run distributed grep on the linux filesystem. I would hope that the grep programmer provided a way to do that. That programmer could use a distributed map-reduce library. The question is how should that library be written?
map -reduce is not just for large data sets. It is useful in any case where the coordination penalty doesn't outweigh the time savings from going parallel.
> Because it is absent. `parallelSortBy` is just a name.
Well, it isn't. If that method takes the closure I've passed in and parallelizes the operation without me ever having to worry about synchronization or programming to some bizarre API.
> As a programmer it might happen that you have to implement > such a method occasionally and add it to your library.
Well, with enough parallelized operations (map, sort, findAll/filter, etc.) I'd hope not to have to add many, since I should be able to combine them to get done what I need to. Exponential composition and all that.
Of course, someone has to write the parallelized methods, but thankfully there are smart people in the world like Doug Lea:
Yah. Microsoft did that with the Parallel FX library. The Task Parallel Library (TPL) is most interesting esp. considering it leverages closures in the latest C# .NET release.
> If I had a single system with > multiple cores, I'd want to run distributed grep on the > linux filesystem.
I realize it's just an example, but its just a really bad one. The result would just be to have 4-8 threads waiting for the disk to serve up the file, instead of just one. I've only once managed to do a grep that wasn't IO limited, and that one did a scan for over 2000 patterns simultaneously.
This does perfectly illustrate that you need to apply parallelism intelligently, not blindly everywhere.
For many of the tasks we program, current single-threaded execution is perfectly fast enough, and there is no need to consider parallelism within the task. For the grep example above, it's almost always going to be IO limited in speed, so there's little point in making other parts faster since IO time will dominate the total running time anyway.
For server type applications, the architecture needs to allow multiple tasks to be executed in parallel (most already do), and we need to avoid code where one task can block one of more others. This sort of work is already in place in many cases, it will just become more important.
In general, this sort of parallelism is more an architecture issue than an issue that most of us will have to deal with within individual pieces of code.
The sort of tasks that can benefit greatly internally from parallelism are usually things like sorting or heavy calculations. There is a lot of research done in these areas, and there are usually libraries developed as part of that research that can assist with the particulars.
In closing, the warning is to avoid counting on MHz/GHz increases to solve your performance problems, but there's no reason to panic, as I don't expect the current performance per core to start dropping. If your system works fine with the current CPUs, it will still be fine with the future multi-cores.
> no reason to panic, as I don't expect the current > performance per core to start dropping. If your system
Actually I suspect many processor designers would very much like to reduce the per core performance. If there was widespread adoption of multithreading it would be possible without overall performance loss. As this is unlikely it acts as a brake on that CPU design choice. Perhaps the "Unwelcome Advice" article was an attempt to gain the designers more freedom in this area. Compare with GPU designers who are able to implement large numbers of very simple cores on a die.
> Actually I suspect many processor designers would very > much like to reduce the per core performance. If there was > widespread adoption of multithreading it would be possible > without overall performance loss. As this is unlikely it > acts as a brake on that CPU design choice. Perhaps the > "Unwelcome Advice" article was an attempt to gain the > designers more freedom in this area. > Compare with GPU designers who are able to implement large > numbers of very simple cores on a die.
I don't really think that this is an attempt to gain space for design simplification. There is too much legacy software that won't be changed, and it would be unacceptable to many to see it run slower on new hardware. Additionally, there are quite a few algorithms that are highly resistant to parallelization. I cannot really imagine a parallel implementation for the Fibonacci sequence for instance.
As you mention, GPUs use a different approach, and that's why it's so interesting that GPUs are now also being used for generic computational work, not just for 3d/video. That gives a developer (or operating system) options to run computations on the most suitable hardware, and to tune systems by supplying the right combination of (C|G)PUs.
Flat View: This topic has 26 replies
on 2 pages
[
12
|
»
]