This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Visual dataflow programming
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I've been thinking about Michel Sanner's
ViPEr system. It's a
visual dataflow system for structural biology, akin in some repects to
general purpose tools like
AVS, IBM's
Data Explorer or domain
specific tools like
Pipeline Pilot
for chemistry. I started from a point of skepticism and it seems I'm
still there.
I first got interested in dataflow systems back in college. I
remember talking to one of my professors about it, and I tried
sketching out on paper possible ways to make it work, but I never
managed to get good control flow in the system. It turns out there
are ways to do it, but they end up looking rather complicated.
I played around with IBM's Data Explorer a bit in the mid-90s.
Cornell Theory Center had an add-in package for doing structure
visualization. I went there for a two-day workshop with Dorina,
another grad student in the group. She's not a programmer but she can
write scripts. She was able to make some very nice depictions that I
couldn't touch with
VMD.
The main reason was the support for constructive solid geometry, which
is rare for structure programs but pretty common in other tools, but
part of it was the easy of changing the dataflow.
(There was an interesting demo at the CRBM workshop showing that CSG
really should be more common in structure visualization codes.)
Afterwards I looked for free codes for data-flow coding. The only one
at the time was
Khorus,
then at UNM and now distributed
through a company. It was designed for visualizations which can be
done on an array, like 2D images and 3d fluid flows, and couldn't
easily support data structures appropriate to molecules. (Michel
says it's still that way.)
Data Explorer went open source around 1998/1999 and I looked at it a
bit; got it to compile under IRIX and looked at the docs. I didn't
have anything to test it out on so that's where I stopped.
I also read some reports of people who used dataflow systems.
One striking critique was the difficulty in scaling. A small
system is easy to understand. When there's only a few nodes -
those built during demos - then it's easy to see what's going on.
When a system gets large there are a few problems: it's hard
to distinguish the transformation nodes, the connection lines
dominate the canvas, and the graph simply gets too large to
display on a single canvas.
There are solutions, or perhaps just workarounds, for these. Each
node could be given a distinct shape, making it easier to see.
Pipeline Pilot does this - it looks like they paid a decent graphical
designer for their nodes. When the graph gets too large, a subgraph
can be collapsed into a single node, much like a function. (But what
shape is the new node? How can be be made distinctive?) This also
reduces the number of lines in the system, except that if a subgraph
can be collapsed into a node then there weren't many overlaps between
the internal connections and the external ones.
When the graph gets larger the layout algorithms for the connections
become more complex. As I recall, Data Explorer reused algorithms
from circuit layouts, to try and reduce overlaps, but I still needed
to tweak them to make them easier to understand. Another solution is
to use transmitter/receiver nodes to connect regions even on different
canvases without laying out a line between them. This helps make
module-like systems, but not really. Also, if the canvas is scaled
large enough to see everything at once then it's hard to make
everything out and if you've zoomed in close enough then you look
track of what's in the connections which came in from out of screen.
There are other problems too, like standardizing how certain things
are arranged. You can think of this as making the transition from
spaghetti code to structured code. Indeed, many of these complaints
above have parallels to textual programming.
You could also argue that there are large-scale data-flow based
systems; circuits. My Dad did video broadcast engineering and could
look at a circuit board and tell what the different parts did, just
because of the standard way things were arranged and how they looked.
He designed studio layouts by drawing nodes and connecting them with
lines; originally on a sheet of paper and later using AutoCAD.
These are both suggestive, but I think that's as far as it goes. Even
in the first few years of computer languages, people were able to
write interesting programs. I know there's been decades of work on
visual programming languages, and believe that if they were good for
programming then people would be using them for some tasks; if only
from sheer obstinacy.
I take that back. In looking up research in visual programming
and workflows, I came across
an
undergrad review paper which concludes:
Despite the move toward graphical displays and interactions embodied
by VPLs, a survey of the field quickly shows that it is not worthwhile
to eschew text entirely. While many VPLs could represent all aspects
of a program visually, such programs are generally harder to read and
work with than those that use text for labels and some atomic
operations. For example, although an operation like addition can be
represented graphically in VIPR, doing so results in a rather dense,
cluttered display. On the other hand, using text to represent such an
atomic operation produces a less complicated display without losing
the overall visual metaphor.
You might also be interested in
some comments on the Joel on software site.
Michel's point though is that people don't need to or even want to
learn to program, so that iniability to "represent all aspects of a
program visually" isn't a problem. Why introduce something with the
complexities needed for general programming?
I think that's an interesting viewpoint, but I don't know how
realistic it is. There are ways to introduce people to simple
programming, as with people who enter things in Excel, then write
formulas, then write simple VB functions. This has the advantage that
it doesn't dead-end like dataflow systems seem to do.
The dataflow user interface is also quite different than most apps
people use. In a normal UI you see the controls and some indicators
of how they related, perhaps from grouping or some sort of high-level
schematic. The innards are one big black box. With a dataflow
system, you change parameters by editing nodes in the network. You
can directly see the way those fields iteract. The black box has
transparent walls. Is seeing that detail really helpful? Perhaps,
but I'm not convinced. (I can think of a system where the
nodes can be dragged into a GUI builder, which might help.)
I also believe most people who can't program do know someone who can
help configure systems, write macros, etc.. This might be another
graduate student, a coworker, or technical support staff. (I've been
all of these. :) This doesn't mean a programmer; there are
plenty of places which have the house expert for making Excel macros
but who isn't a software developer. So I don't think the lack of
programming skills in a given user is necessarily a problem.
But visual programming does appear tantalizing for some domains. It
seems to work best when it's strongly data-flow oriented, with only a
a few different data types but many possible transformations yet where
only a few (a dozen or two) transformations are used at a time. Image
analysis is the most obvious one, and most of the standard dataflow
applications have libraries for that.
The question though is the appropriateness of this paradigm for
chemistry or biology. I've seen several data-flow/pipeline projects
for bioinformatics, which has many similarities to the style of
analyzes in chemical informatics. One is
Piper, from
bioinformatics.org,
which started in 1998 but is now discontinued.
When I was at ISMB in Edmonton a couple years ago, I saw
several different companies with dataflow products for
bioinformatics. I picked up the literature, and I'm leafing
through them now. Let's see:
Übertool from science-factory.com - web site
now says "... discontinue its operations as of August 31, 2003 due to
insufficient funding and the inability to attract new investors". Too
bad. I rather liked this one. It looked pretty and implemented quite
a few algorithms.
Hyperthesis in gRNA from Helixense - doesn't seem to have
done much since ISMB 2002, according to their web site. At the
conference they said they had 23 employees then, so I'm suspecting
they aren't doing well. Oh, and they used Jython underneath.
There was another, but I don't see it in my notes.
When I think about the topic some more, I realize that the style of
doing analyzes in chemical and bio- informatics are really not all
that different than other fields, at least when expressed in a visual
progamming style. This sugggests that if it's useful then there
should be more generic libraries for this, both open-source and
proprietary. I've been looking around and I can find very few.
Freshmeat told me about Taverna and xFloWS, but after an hour or two
all told and I couldn't find much more. I really did expect to find a
commercial library for Windows using COM. I also looked for academic
work, but it seems to have been done in the 1980s and early 1990s, as
there's very little available from the last 10 years. The newsgroups
are also remarkably empty - a couple posts a years!
I'm left with the conclusion that dataflow visual programming isn't
really that effective, despite what Pipeline Pilot and Michel argue.
But I can be wrong about that. I haven't used either system directly
nor seen people use Pipeline Pilot so this whole argument is based
just on my general knowledge. Suppose I wanted to develop code like
Pipeline Pilot, that is, a visual dataflow system for chemical
informatics. It would need some way to read and manipulate chemical
structures. OpenEye's OEChem is a good choice for that, as is
Daylight, if some format conversion tools are available. It needs the
GUI framework. Michel's code should do that, but I must say it isn't
as aesthetically pleasing as PP's is. (PP looks very pretty!) It
needs ways to talk to different servers, but Python code can handle
that just fine. I don't think it would be a hard thing; perhaps a
couple of months, depending on what you want. However, the result
would not, IMO, be commercially viable.
It may be appropriate as an in-house or open source project. the
first really depends on a company's needs and the second, well, I just
don't have time for the second so it would require volunteers, and
most programmers like programming using text, not pictures.
BTW, there are some other alternatives to highly visual programming
language. One is to use simple text. Start with the example graphic,
shown at http://www.scripps.edu/~sanner/images/work/ViperIntro.jpg. I'll
write code for a hypothetical Python API which produces the same
result.
Personally, I find the text easier to understand, but that's in part
because I've been doing that for a couple decades. What my text
version doesn't do is provide a GUI. Something along the lines of PythonCard might be
a useful way to add that.
I've not used PythonCard, but the idea behind it is to make it easy to
develop the sorts of GUIs people expect, and something that's easy for
beginning programmers to use. And unlike ViPEr, the resulting UI
looks normal.
In any case, either of these are interesting projects and something I
would like to work on. If you're also interested in these ideas
and can fund us to help out,
contact me. We are
available for both consulting and custom software development.