This post originated from an RSS feed registered with Ruby Buzz
by Christian Neukirchen.
Original Post: Querying databases using Ruby
Feed Title: chris blogs: Ruby stuff
Feed URL: http://chneukirchen.org/blog/category/ruby.atom
Feed Description: a weblog by christian neukirchen - Ruby stuff
I have, like many others, problems using databases. For the biggest
part, these problems relate to the usage of SQL, which is IMO an
extremely ugly language that hides the beauty of the relational model
underlying.
Even more interesting, there is no real alternative for SQL to access
relational databases. Therefore, I have been thinking about
formulating database queries directly in Ruby, and effectively
executing them there too. This sounds likely like a big performance
loss, but it has the great advantage that you actually know what the
database it doing (that is, exactly what you tell it) without reading
big and complex ANALYZE output. Also, note that—in the
end—efficient queries always run faster than the unoptimized ones.
Therefore, I think a pure-Ruby database query language scales good
enough for now, optimization is left for later.
Now, what is the basic idea of database querying? In most cases, a
database query can be written in Unix pipe style (Joins will come
later). For example, take this SQL:
SELECT name, address FROM customers WHERE city = "New York" LIMIT 5
We could rewrite this like the following in Unix (In fact, 4gl(?)
works a lot like that):
Unfortunately, the Ruby query is not lazy, as the SQL and Unix
queries are—they only look until they found five results. We could
rewrite the query with Lazy
Streams, then it would
look like that:
That looks much better, actually. Unfortunately, this method possibly
needs lots of method calls and lambdas, so let’s rewrite it
imperatively (ugh):
result = []
count = 0
Customers.each { |c|
next if c.city != "New York"
break if count >= 5
result << [c.name, c.address]
count += 1
}
In fact, I think I can rewrite almost each SQL query in a loop like
that. Note, how we could optimize the query, if we, say had a
secondary key on city (this approach forces you to cleverly make use
of indexes, as linear tablescans take much longer than your usual
database, which is probably written in C).
Customers.each(:city, "New York") { ... }
Now, queries on a single table are fairly easy; how can we join
tables? There are three kinds of joins, 1:1, 1:n, and m:n.
1:1 joins are trivial, we just grab the data we need:
result << [c.amount, Customers[c.custid].name]
An interesting idea left open would be to directly store a reference
to the customer, so you could write [c.amount, c.customer.name].
You can do this if referential transparency is guaranteed.
For 1:n joins, we need an inner loop (assume useful indexes),
outer joins can be done similarily:
SELECT customer.name, payment.amount
FROM customers, payments
WHERE customer.id = payment.customer
Finally, m:n joins can be done with a table and two 1:n joins, just
like in SQL.
A further feature, that SQL provides are aggregates. SQL can easily
compute the count, sum, average, and maximum and minimum of
a set of rows. In Ruby, we can do that too, of course:
sum = 0
Payments.each { |p| sum += p.amount }
Should we need an aggregate inside a query, we either need several
table scans (ugh) or some kind reference to future values.
As you can see, almost all SQL queries can be written in standard,
iterative Ruby code that should not be slower (in theory, at least) if
the queries are written well. However, that kind of Ruby code is
rather tiresome to write. Again, have a look at our first piece of
SQL:
SELECT name, address FROM customers WHERE city = "New York" LIMIT 5
Wouldn’t it be nice if we could write this in a more functional style?
That looks very convenient, no? How does it work? Essentially, we
are building a chain again, like in the example with the Unix pipes.
You can see the source for the full details,
but look how Limit is implemented to get an idea of how it works:
class Limit < Statement
def initialize(limit)
super()
@limit = limit
@count = 0
end
def call(row)
if @count < @limit
@count += 1
pass_on row
end
end
end
This way, we can combine all operators needed in a powerful way
without any implicit loops (one could even imagine an
peephole-optimizer to work on the chain…) in our source, but still
full access to Ruby’s methods. Unfortunately, this way of querying
needs, in worst case, (Number of rows) * (Number of statements in the
chain) calls, which are comparatively expensive. It would be
interesting to rewrite the example in raw C that is connectable with
Ruby…