This post originated from an RSS feed registered with Ruby Buzz
by Patrick Lenz.
Original Post: Killing me softly: Keeping dispatchers alive
Feed Title: poocs.net
Feed URL: http://feeds.feedburner.com/poocsnet
Feed Description: Personal weblog about free and open source software, personal development projects and random geek buzz.
This is an intermediate publication to my long promised in depth review of me trying to scale a million dynamic page impressions a day on Rails.
When the site in question finally stabilized somewhat, a new problem crept up that I've been unable to fully resolve over the past weeks. The net effect is that my FastCGI dispatchers become unresponsive after a while, potentially after a huge traffic spike. Those sit there doing nothing and lighttpd is unable to talk to them.
The site is powered by 4 application servers running 7 dispatchers each and a dedicated lighttpd proxy. After a while, half of those dispatchers are unresponsive and as such no longer serving any requests. The page load times crawl to a halt.
Currently, I'm on Ruby 1.8.4, lighttpd 1.4.10 and Rails 1.0 on Linux 2.6.14.
I've tried everything from upgrading Ruby and all gems to debugging potentially exceeded TCP connection limits on my servers to even talking to weigon, the brains behind lighttpd. No avail.
The weird thing is, it doesn't matter which end I restart, be it the dispatcher *or* lighttpd, everything goes back to normal. That way I cannot even tell for sure that it's Ruby to blame or my application. It could just as well be lighttpd or my local machine configuration.
Since I was in desperate need of an operational site I whipped up a script to probe all the available dispatchers for responsiveness and kill them with brute-force if they aren't. I'm using the process scripts, namely the spinner/spawner duo that comes with Rails. As such, the dispatcher is immediately restarted and becomes available for lighttpd to serve to within a couple of seconds.
As this is obviously more of a band aid than anything else, this script is provided as-is, with no claims being made about being functional for anyone else, being pretty, well documented or not eating your cat. You absolutely need Net::SSH installed in order to be able to kill dispatchers not running on localhost. I'm running the script inside of a screen session in order to keep an eye on what's happening with my dispatchers and how often they get killed. Your mileage may vary.
In case you're having similar issues with your Rails application, feel free to leave a comment. The script only takes care of dispatchers that are already hung. It is by no means meant as a final cure and I'm more than eager to find out what's causing the freezes in the first place.
The script is available in the body of this article or as a download here.