This post originated from an RSS feed registered with Agile Buzz
by James Robertson.
Original Post: Smalltalk threads, scalability, and all that
Feed Title: Cincom Smalltalk Blog - Smalltalk with Rants
Feed URL: http://www.cincomsmalltalk.com/rssBlog/rssBlogView.xml
Feed Description: James Robertson comments on Cincom Smalltalk, the Smalltalk development community, and IT trends and issues in general.
An interesting discussion of scalability started up in this comment thread after I posted on Smalltalk threads. One of the commenters pointed out that BottomFeeder had trouble importing his feeds - he had 5100 of them. The interesting thing is that he immediately assumed that it was due to the spawning of 5100 threads - he saw CPU spiking and a crash.
Now first off, I'll admit that I've never considered the use case of 5100 subscriptions :) Having said that, threads really aren't the problem here - memory usage is. Go open the settings in BottomFeeder, and look at the Memory page. There's a soft limit on memory, and a hard limit. Once BottomFeeder hits the hard limit, any request for additional memory will result in garbage collection - and eventually, if the problem persists, an out of memory exception. Now, it can get ugly at that point - it takes memory to put up a dialog stating "out of memory", so it's possible to have a complete crash. You can customize the memory policy to deal with that (assuming that actual real or virtual memory is available, of course).
Here's what's going on - the base BottomFeeder application consumes 30MB before you load any feeds. Yes, that's fairly "hefty", but I have what amounts to a "kitchen sink" image with lots of libraries pre-loaded - it's made some of the additional features I've added to the app much, much simpler. I subscribe to 300 feeds, and they are all kept in memory. As I'm sitting here, Bf is consuming an extra 30MB over it's starting base. To figure out why, you have to understand this:
How VisualWorks memory management works
How BottomFeeder keeps feed and item data around
What happens during the BottomFeeder update cycle
There's fairly extensive documentation on the memory model in our Application Developers Guide - for our purposes here, I'll explain parts of it. There are three areas of memory relevant to our discussion here:
New Space - where new objects are "born"
Survivor Space - 2 memory zones where objects that survive the first pass GC in New Space end up
Old Space - As the Survivor Spaces fill, objects "tenure" into Old Space. Unless there's manual intervention or a hard memory limit, Old Space simply grows and is not scavenged
Now, let's look at an update cycle. BottomFeeder loops over the N subscriptions, and if threaded updates are on, spawns N threads. Each thread looks at the feed in question, and does the following:
Is it time to check for an update? Check the meta data in the feed (if any). Possibly answer no, and generate no update
If the answer was yes, issue a conditional-http get (assuming the feed supports that). If that answers no new stuff, the thread ends
If there is new stuff, we now have an HTTP response object (consuming memory)
We now parse the XML we got into an XML doc (consuming memory)
We now do an XML to Object conversion, creating a set of items for the feed (consuming memory)
Those items are merged into the existing feed, and this thread terminates
That's a noticeable amount of memory for each update that comes back from the server, although much of it is transient objects - i.e., objects that should never make it into Old Space. In older revs of Bf, I didn't have New Space set big enough, and memory usage tended to grow after the first set of updates, and stabilize high - too many objects tenured. I now have that handled, at least for subscription sizes in the neighborhood of mine and smaller. For 5100 though? I suspect that a lot of tenuring is happening, and is subsequently slamming into the hard upper limit on memory - not to mention that the persistent objects likely consume enough space to stress the default upper bound
So let's return to the example - 5100 threads answer back with well over 1000 responses, generating an equal number of XML documents, and a large number of items. That likely stresses both the size of New Space and the default memory bound - causing a scaling problem. Note how threads - which are very, very cheap in Smalltalk - don't really enter into it. Now, creating a thread pool for a large number of feeds might cut down on some of the CPU spiking, but it's not really an issue in the problem at hand
In the case at hand, setting the upper bound of memory up higher - to something like 400 or 500 MB - would allow Bf to handle that number of subscriptions. The hard part is the tenuring. While Old Space size limits can be configured on the fly, New Space and the Survivor Spaces can't be - to change them you have to save the image and restart (something that Bf does not do in its deployed state). I need to ask our VM guys how feasible it would be to allow runtime resizing of those.
In any case, a thread pool isn't really the issue here. Having said that, I created one as an optional thing for BottomFeeder last night. I noted that a commenter pointed out that Java developers didn't need to roll one, since they can just download one. That's as may be, but I was able to create one in about an hour while most of my attention was on "Desperate Housewives" (a true guilty pleasure if ever there was one). A general thread pool would have been more trouble anyway - I created a standalone pool implementation first and discovered that the hard way :)