This post originated from an RSS feed registered with .NET Buzz
by Steve Hebert.
Original Post: MS Windows Services for Unix + Client for NFS + EMC = Kernel Memory Leak
Feed Title: Steve Hebert's Development Blog
Feed URL: /error.htm?aspxerrorpath=/blogs/steve.hebert/rss.aspx
Feed Description: .Steve's .Blog - Including .Net, SQL Server, .Math and everything in between
Here's a topic I thought I'd lend a little google juice since Microsoft has created a hotfix. This problem is nasty - difficult to diagnose and difficult to track down.
We have an application where we are sharing a NAS device with Unix
servers. We were seeing in our pre-production and production
environments that a group of Windows 2000 boxes would go belly up after
a week of use. These machines would gradually become slow and
suddenly unable to communicate on the network. When looking at
the event logs, the system stopped communicating over the network and
showed repeated errors. It looked like someone tripped over a
network cable whenever these systems went down. To make diagnosis
worse, I did not have physical access to the boxes - only VNC
access. Here's the path to diagnosing the problem, perhaps this
will save someone else some time.
After a few crashes we saw that socket creation was being denied
because resources were low. This led us to looking at System
PTEs. Once we were focused on the System PTEs, we monitored
system PTEs in perfmon and saw that the leak didn't start happening for
4 hours, but then steadily declined on rate loosely tied to traffic
volume. Without any traffic we would see PTEs decrease at a rate
of ~ 5/hour, with traffic we saw a range from 60-100 PTEs per
hour. The PTEs always decremented in blocks of 10.
At this point we weren't sure what was causing it - typically a driver
because these consume kernel memory. After spending a couple of
days trying to track this down, we found that the Windows Services for
Unix were at fault. We contacted MS support and they shipped us
the hotfix. The problem disappeared and we haven't seen the
behavior since.
I find it hard to believe that Microsoft has had this product in the
field for so long and only now they see this critical of a leak. For
some reason we only saw the problem with our EMC/NFS connection.
We have a solaris/NFS connection in development that has never
exhibited the problem. I guess Microsoft doesn't test wsFU against small 3rd
party vendors like EMC. </sarcasm> We spent a ton of time
tracking this problem and questioned everything on these systems. It's
interesting to note that the problem also happens in Windows 2003, but
because 2K3 always has significantly more System PTEs than Win2k the
box will take much longer to fail.