This post originated from an RSS feed registered with Ruby Buzz
by Matt Bauer.
Original Post: A Survey of Gem Naming
Feed Title: blogmmmultiworks
Feed URL: http://blog.mmmultiworks.com/feed/rss.xml
Feed Description: Thoughts on Ruby, hosting and more
I've been working to build a gems data warehouse based on the Rubyforge mirror downloads. A beta version should be done by RailsConf. One of the first steps in building a data warehouse is the Extraction, Transform and Load (ETL) of data. The source for the gems data warehouse are the Apache logs. They typically look like:
From this it's very easy to tell we have a download of whys Hpricot version 0.5 gem. It's not always that easy though. Some gems have an os designation like:
The above all follow a nice pattern and writing a regex for it isn't too bad. They all use a hyphen to separate import parts and follow a nice pattern of name-version-arch-os[version]. Then there's the following without a consistent use of hyphens.
Notice first that it uses win32 and not mswin32 as the os designation. Second it includes the Ruby version with release tag which really plays havoc on the regex. So how do you write a nice regex to parse all this? You do it like this:
Now I could make this more concise and feel free to leave your solution in the comments. I just chose not to as this one is at least readable. Some interesting things to note about this regex. There is that lack of an amd64 or sparc designation since I don't see any gems specifically designated for that architecture. In the same breath, there are no gems with the i486 or i686 designation. I just left them on in case it happens one day. There are also no gem designated for solaris or any *BSD (other than darwin).
Now some of you smart kids in the audience might say, "Use the gem spec to figure this all out." To which my response would be, "Yah right!" I looked at the gem specs initially to do all this work but all I found was despair. Nearly all the released gems have incomplete gem specs which makes parsing the filename the most effective and correct way to extract information.
One final word, I'm not for enforcing a naming scheme via the api since there are always reasons to go outside the scheme. I instead prefer the community self regulates itself. I don't blame people for not filling out their gem specs; I don't. There's no reason to since no process uses most of the information in it. Once someone builds something to make use of the gem spec people will update their gem specs on their own. Just so happens I'm building such a beast. Stay tuned or find me a RailsConf for a demo.