Summary
In a recent blog post, Assaf Arkin compares threads and independent processes, and suggests that most Java developers turn to threads to scale their application, whereas those working with PHP, Ruby, or other LAMP languages, use processes. He argues that processes scale better.
Advertisement
Assaf Arkin's recent blog post, Why Processes Scale Better Than Threads, contrasts the ways in which LAMP developers and Java developers build complex applications:
In the LAMP world, processes are everything. If you want to pull out data from a file, sort it, and e-mail the result, you pipe several programs together. You’re building a solution by assembling processes.
And for more complex tasks you add even more processes. Want to do things on a schedule? Fire them up with cron. Need to improve throughput? Start up a cache process. Monitor uptime? That’s another process for you.
By contrast, Java developers would run just one JVM process, and call into various APIs to accomplish those same tasks:
In Java you don’t scan files with grep, you use a library. You don’t pipe e-mails to sendmail, you use a library. All the features you need are folded into the VM.
Which turned a snappy VM into a huge behemoth that takes a couple of minutes to boot, as it’s setting up libraries, frameworks and containers. You don’t want to startup the JVM more than once.
To accomplish multiple concurrent tasks, Java developers would use threads, not independent processes. Arkin believes that these approaches result in different scalability characteristics of an application:
Multi-threaded developers tend to scale through objects, libraries and frameworks. When you focus on the components around you, you don’t pay much attention to anything outside the sandbox. The level of abstraction is the API.
Multi-process developers scale by assembling programs together, chaining them or running them in parallel. If it’s not in the framework, you look for a program (or combination of) that does what you need. The level of abstraction is the task...
The more independent processes you have, the easier they are to combine into new and interesting uses.
Because processes can easily be distributed across multiple servers, Arkin believes that solutions that center around the multi-process approach scale better horizontally (incorporating more servers), whereas the multithreading solution scales better vertically, and is able to take better advantage of a more powerful server.
Arkin's concluding point is that horizontal scaling—distributing workload across many less powerful servers—can result in more overall scale than distributing load to more threads in a single process on one powerful server. Potentially, the horizontal scaling approach is also more economical.
Do you agree with Arkin's conclusion that the multi-process approach scales better? And, if so, how do you architect Java applications to distribute workload among multiple processes?
It's certainly not black and white. I think the author has some good points, and I believe he knows a number of use cases in which doing it any other way (than as he suggested) would be painful over-kill.
> <p>To accomplish multiple concurrent tasks, Java > developers would use threads, not independent processes. > Arkin believes that these approaches result in different > scalability characteristics of an application:</p> > > <blockquote> > <p>Multi-threaded developers tend to scale through > objects, libraries and frameworks. When you focus on the > components around you, you don’t pay much attention to > anything outside the sandbox. The level of abstraction is > the API.</p>
An application server nowadays plays the same role as an O/S: it manages processes.
Essentially all the effort that has gone into application servers simply repeats what has been done in previous decades regarding operating systems.
> > <p>Multi-process developers scale by assembling programs > together, chaining them or running them in parallel. If > it’s not in the framework, you look for a program (or > combination of) that does what you need. The level of > abstraction is the task...</p>
But there is no typed interface between tasks, and that is a great problem. Using tasks is like using a dynamically typed programming language: you never know what is going to work, until you execute it.
> > <p>The more independent processes you have, the easier > they are to combine into new and interesting uses.</p> >
Low coupling provides better reuse; that is common sense. There was a discussion a few moons ago here in artima about if frameworks are better than libraries. The outcome was that libraries are better due to lower coupling.
> <p>Arkin's concluding point is that horizontal > scaling—distributing workload across many less powerful > servers—can result in more overall scale than distributing > load to more threads in a single process on one powerful > server. Potentially, the horizontal scaling approach is > also more economical.</p> >
Since processes can be distributed better than threads, it goes without saying that processes scale better.
But what if threads could be distributed as well? that would certainly turn the case in favor of threads.
> But there is no typed interface between tasks, and that is > a great problem. Using tasks is like using a dynamically > typed programming language: you never know what is going > to work, until you execute it.
Is it a great problem in your actual practical experience?
'cause in mine, it isn't. I have had very occasional problems with tasks created from pipes etc breaking down with OS and software updates because there wasn't a defined API to work with. This is a very very small part of my day to day issues, though, and a much greater problem with APIs - including APIs with type checking.
> > But there is no typed interface between tasks, and that > > Is it a great problem in your actual practical > experience?
More an irritation, most frequent problem is variations in separators (e.g. CSV with , or ;, quoting differences). Also with one process writing values (dates or decimals) in a national language form, while the next process expects a different form.
As for processes vs threads, I think many applications will need to use both and the best mix is likely to depend on the operating system(s) involved.
> <p>By contrast, Java developers would run just one JVM > process, and call into various APIs to accomplish those > same tasks:</p>
False Dichotomy. There's nothing stopping a Java developer from scaling with multiple processes that are multi-threaded. I just wrote such an application about a month ago.
> <blockquote> > <p>In Java you don’t scan files with grep, you use a > library. You don’t pipe e-mails to sendmail, you use a > library. All the features you need are folded into the > VM.</p>
This isn't true.
> <p>Which turned a snappy VM into a huge behemoth that > takes a couple of minutes to boot, as it’s setting up > libraries, frameworks and containers. You don’t want to > startup the JVM more than once.</p> > </blockquote> > </p>
I write Java apps all the time that as fast as any other program. This is just nonsense.
> <blockquote> > <p>Multi-threaded developers tend to scale through > objects, libraries and frameworks. When you focus on the > components around you, you don’t pay much attention to > anything outside the sandbox. The level of abstraction is > the API.</p> > > <p>Multi-process developers scale by assembling programs > together, chaining them or running them in parallel. If > it’s not in the framework, you look for a program (or > combination of) that does what you need. The level of > abstraction is the task...</p>
He's ignoring the problem of resource sharing and synchronization. These are problems are simple to solve in a single multi-threaded app in Java but require IO in multi-process architectures.
> <p>The more independent processes you have, the easier > they are to combine into new and interesting uses.</p>
Sorry, but I think this total BS. Compared to well designed OO (which there is admittedly little of) processes are monlithic. How do I reuse just a portion of a process' logic?
> </blockquote> > > <p>Because processes can easily be distributed across > multiple servers, Arkin believes that solutions that > center around the multi-process approach scale better > horizontally (incorporating more servers), whereas the > multithreading solution scales better vertically, and is > able to take better advantage of a more powerful > server.</p>
You can Java can do both in Java and there are good reasons to do so. For example, a Java application can run 5 threads on 5 machines. A comparable mutli-process architecture would have 5 processes on 5 machines. The Java process needs only 5 caches. The mutli-process architecture requires 25 if it caches at all. In addition, the Java architecture can actually have a smaller footprint on each machine depending on a number of factors.
> <p>Do you agree with Arkin's conclusion that the > multi-process approach scales better? And, if so, how do > you architect Java applications to distribute workload > among multiple processes?</p>
Sure, in a lot of cases it's much better. If you want high-availablility, you need the multi-process model. One of the easiest ways to do this is to use a messaging architecture such as what any JMS provider sells.
For optimum scalability you want both. I am using an ETL tool that works with the dataflow paradigm. The tool breaks every step of the flow into a separate process and then spawns several threads for each process. I would call the pipelining approach depth parallelism and the superscalar approach breath parallelism. Former is easier to use and improves throughput but many times makes latency worse. The later requires that the data for each thread is independent from each other and that is often tricky. Witness also this paper http://www.e.u-tokyo.ac.jp/cirje/research/dp/2006/2006cf397.pdf about the Japanese industry moving some processes from conveyor belt to work-cell assemblies.
Since I went through the fields of electronic engineering, computer science and business management already I will stop now :-).
> False Dichotomy. There's nothing stopping a Java > developer from scaling with multiple processes
But its not typical.
> He's ignoring the problem of resource sharing and > synchronization. These are problems are simple to solve > in a single multi-threaded app in Java but require IO in > multi-process architectures.
These problems are hard - in Java or any other language. There's a reason Erlang has processes (separate memory spaces communicating via queues) rather than threads. Its not at all hard to argue that Erlang's model is vastly superior to Java's for concurrent processing.
> processes are monlithic. How do I reuse just a portion of > a process' logic?
Your process is too big then. Unix is the counter example.
> For example, a Java application can run > 5 threads on 5 machines. A comparable mutli-process > architecture would have 5 processes on 5 machines. The > Java process needs only 5 caches.
With much more complex cache coordination logic...
> The mutli-process > architecture requires 25 if it caches at all.
Bogus - caching should be at the service interface level.
> If you want > high-availablility, you need the multi-process model.
Yep - its generally faster overall - thread context switches aren't free, after all. Process context switches you're going to get anyhow.
> > False Dichotomy. There's nothing stopping a Java > > developer from scaling with multiple processes > > But its not typical.
That still doesn't explain how Java having the ability to do both makes it inferior.
> > He's ignoring the problem of resource sharing and > > synchronization. These are problems are simple to > solve > > in a single multi-threaded app in Java but require IO > in > > multi-process architectures. > > These problems are hard - in Java or any other language. > There's a reason Erlang has processes (separate memory > y spaces communicating via queues) rather than threads. > Its not at all hard to argue that Erlang's model is > s vastly superior to Java's for concurrent processing.
In terms of what? Performace? Usability? In what way?
> Here's a timely article on the topic that was on digg > today. > http://www.computer.org/portal/site/computer/menuitem.5d61c > 1d591162e4b0ef1bd108bcd45f3/index.jsp?&pName=computer_level > 1_article&TheCat=1005&path=computer/homepage/0506&file=cove > r.xml&xsl=article.xsl > > > processes are monlithic. How do I reuse just a portion > of > > a process' logic? > > Your process is too big then. Unix is the counter > example.
There are unix services that I would like to us pieces of. I understand the concept. It's just that well designed classes more resuable than processes.
> > For example, a Java application can run > > 5 threads on 5 machines. A comparable mutli-process > > architecture would have 5 processes on 5 machines. The > > Java process needs only 5 caches. > > With much more complex cache coordination logic...
What's complex about it? I don't even have to think about it when I write my application.
> > The mutli-process > > architecture requires 25 if it caches at all. > > Bogus - caching should be at the service interface level.
OK. I agree that it could be, but is it? Suppose there is a local file used as a resource. In order to avoid loading it into memory for every process, you need to use a shared memory model. Are you suggesting this is eaier than using threads in Java?
> > If you want > > high-availablility, you need the multi-process model. > > Yep - its generally faster overall - thread context > switches aren't free, after all.
I assume you mean on a single CPU machine.
> Process context switches > you're going to get anyhow.
> Here's a timely article on the topic that was on digg > today. > http://www.computer.org/portal/site/computer/menuitem.5d61c > 1d591162e4b0ef1bd108bcd45f3/index.jsp?&pName=computer_level > 1_article&TheCat=1005&path=computer/homepage/0506&file=cove > r.xml&xsl=article.xsl
His basic argument here is that threading is non-deterministic and therefore too hard.
I love this part:
To offer another analogy, a folk definition of insanity is to do the same thing over and over again and expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs.
Someone needs to invite Mr. Lee into the real world. First of all, the folk definition is nonsense. There's a whole branch of physics called quantum mechanics that's based in part on this exact expectation. It also implies an extremely ignorant idea that one can expect the same results based on what has been seen before. More to the point, very few real world applications (i.e. anything that deals with IO) is fully deterministic whether it is threaded or not. Lee's use of blantant rhetoric here put's hime on very shaky ground with me. Such arguments are usually used by those with preconcieved notions that they wish to rationalize.
The other thing is that the assumption that running parallel processes solves these problems in itself is also wrong. You still need to deal with communication between processes and resource sharing.
Perhaps if you think that threads are too difficult for you or developers you work with then maybe they are not for you. I don't find them to be all that difficult. Programming with multiple threads is not fundamentally from writing mutliple-processes and they provide many more options.
> > If you want > > high-availablility, you need the multi-process model. > > Yep - its generally faster overall - thread context > switches aren't free, after all.
When I said this, I meant that you need more than one machine. Really you need more than one machine at more than one geographic location. I'm not sure if that was clear.
> Process context switches > you're going to get anyhow.
I think I misinterpreted this before. You mean that the machine the process is running on will have context switches regardless of how many instaces of your app is running. Assuming the CPUs are fewer than processes (and maybe even otherwise) this is granted. But in an IO bound application, multi-process systems should generally have more process-switches than an equivalent multi-threaded application.
I think a better title for the article would be "Why LAMP is better than Java".
It seems the author is mostly talking about piping data sequentially through a set of standard Linux commands vs. the use of Java threads. This certainly is not the scenario I think of when comparing multi-process programming with multi-threading programming.
> Is it a great problem in your actual practical > experience? > > 'cause in mine, it isn't. I have had very occasional > problems with tasks created from pipes etc breaking down > with OS and software updates because there wasn't a > defined API to work with. This is a very very small part > of my day to day issues, though, and a much greater > problem with APIs - including APIs with type checking. > > Eivind.
My experience is different; in many occasions, errors were silently ignored and only revealed much later.
> > But there is no typed interface between tasks, and that > is > > a great problem. Using tasks is like using a > dynamically > > typed programming language: you never know what is > going > > to work, until you execute it. > > Is it a great problem in your actual practical > experience? > > 'cause in mine, it isn't. I have had very occasional > problems with tasks created from pipes etc breaking down > with OS and software updates because there wasn't a > defined API to work with.
The biggest issue I have with pipes is when something besides the last process in the pipe has a problem. You can have a very hard time trying to find out why a process is not getting any output from a pipeline, especially when it is okay for there to be no output.
Another issue is that I've noticed more recently that
ulimit -c 0
has suddenly become popular, so you won't even see the tale tale sign of a core file for a really poorly behaving process.
Many people can write reliable software. It's the ones that can't, or don't that I worry about using pipes and processes.