This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: CPU power vs. bandwidth
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I posted this idea a couple years ago on the biopython mailing list.
(Can't find it in the archive after a very cursory search.)
I'm a consultant, working for myself. When I started I only had
dial-up access to the Internet. I tried to get DSL but "there weren't
enough lines in the central office." This is in part because I live
in Santa Fe, New Mexico, which isn't known for its telecommunications
infrastructure. Lots of land, few people, fewer with money, and
excepting Santa Fe not known for its high-tech companies. (For a city
of under 70,000 people, Santa Fe has a surprising number of chemical
informatics and complex systems companies.) We are likely within a
few miles of the main Internet connection for Los Alamos National Labs
including the Internet2, but locals can't simply tap into that line.
The feds might frown on it if you tried.
I bought a house a few years ago and when I first looked at it I was
glad to see it already had DSL. Bandwidth is much nicer now though
still not as good as the 10 Mbit/sec. bandwidth I could get from
school a decade ago. By comparison, the laptop I used to type this
essay is at least an order of magnitude better than the high-end
server machines back then.
I know all the stories about there being a lot of dark fiber in the
main backbone because communications technology is improving faster
than demands on it, but let's face it, getting that bandwidth to the
edge of the network is hard. Maybe not to a US national center or
large company , but at least a small consulting house in Santa Fe
or a startup in Lake City, Florida or a research group in Capetown,
South Africa. Many startups in the late 1990s flopped because they
expected otherwise.
Disk space, CPU power, bioinformatics data, and bandwidth are all
increasing exponentially. The doubling periods I've heard are 1 year,
18 month, and 2 years for the first three. The last is hard to pin
down. Gilder's Law states it's 6 months, but that's for fiber optic,
not for last mile. Call it every 4 years, based on my personal
experience.
Let's assume for now this trend continues. At some point (I get about
10 years) it will be impossible to download all of the new biological
data over a 10 Mbit/second link. As a matter of personal philosophy,
I want small groups and even a single researcher to have effective use
of the data. My extrapolation says that unless we change things, only
large or well-financed groups will be able to do at least certain
types of research problems, simply because of bandwidth limitations.
What can be done to change that? I have several ideas. I've
expounded on BitTorrent but that
helps the server and not so much the client. I'll talk about proxies
and smart fetching in the future, but I have another idea which is
more intriguing; User Mode Linux.
I mean that as a synechoche for any sort of virtual machine. Once
upon a time in the simpler days of the Internet everyone was friendly
and it was easy to get accounts on different machines. For example,
the group I was in in grad school had some special compute hardware
for doing molecular dynamics. They published a paper describing the
system and included a user name and password for the machine so anyone
could try it out themselves.
It would be nice if everyone was nice. Think about what you could do
if you could log on to a machine at NCBI with all the data on a local
disk or database, and with lots of scratch space and CPU power. You
could very easily make your own BLAST-able subset of GenBank, or
create a specialized index of some properties you found interesting.
Only the results of a search would need to be sent back to you, and
almost certainly that will be much smaller than the whole database.
And when NCBI updates there's no real bandwidth problems since they
could propogate the data on a local high-speed network.
Not everyone is nice. People hijack others' computers for distributed
denial of service attacks, for sending junk email, for spying, and
simply for fun. For my idea to work requires some way to let foreign
code operate safely on a local machine and with some severe
restrictions on what it can do. One option is Java's security sandbox
but that requires everything be written in Java, and there is too much
existing code in Perl, C, C++, Python, Fortran and other languages to
make that a real solution. Another is a chroot'ed jail but it's hard
to get a rich unix environment working that way. (Eg, how do you set
up a cron job?)
Instead, what if NCBI/Ensembl/whoever ran an virtual operating system
independent of the main OS, and let each user have root access to an
instance of that virtual OS? The main OS could be in charge of
resource limits (network connections, disk space, CPU time) but
otherwise leave the users free to install pretty much anything. Want
to upgrade to the CVS version of Python? Go ahead! Want to tweak the
system install of BLAST? Not a problem.
That gets the program closer to the data but to be useful the compute
provider also needs to organize the data for consistent access. At
the start each site will have documents describing the filesystem
layout, database schemas, and internal services. For portabilty,
people will develop a naming scheme and translation layer so a program
can fetch needed data from anywhere. This could happen even without
my proposal; the O|B|F made an attempt during the hackathon. But I
think currently the need isn't pressing enough.
There I go again, deep into the details. The overall idea is to make
it easier for people to create and run data intensive programs. The
administrative work is handled by the compute service provider and the
intermediate bandwidth limitations have much less impact.
It's not likely to happen though. If there's anonymous access then
it'll be used by people to
find Mersenne primes
or crack passwords.
Instead the compute providers will need to issue accounts. A virtual
OS does provide more defense-in-depth and lets users reconfigure the
OS to fit personal needs. I just don't see those as being big enough
advantages to warrant the administrative overhead, at least for
the next 5 years.
Topics to think about:
Is the idea of using virtual OSes a distraction? What does it
add that a secure OS like FreeBSD can't provide. (Support for
software that really, really want to be installed under /usr,
like binary RPMs?)
Who now has bandwidth problems? Is it really getting
worse?
Will people really want all the world's sequence data?
At some point the sequence data will stop growing exponentially
and bend into an S-curve. At some point we will have sequenced
every organism on the planet. When will the inflection point be?
What software developer can't set up a local database mirror?
(Students and those using people friendly languages like Python?)