Summary
When the information you need already exists, but it's scattered here and there around the web, you have an option. You can create a small, super-lightweight web app to put it together--a mashup. It's not quite as easy as falling off a log, but it's gotten to the point that end users can create their own applications.
Mashups are super-lightweight web apps that are created in minimum time,
with minimum code, from information and services that already exist on
the web. You want your email, RSS feeds, and calendar all in one page?
Smash the pieces into a web page that has the interface you want. Have
a web-accessible GPS locator in your company's delivery trucks? Combine
that feed with a a map service like Google maps to display the truck's
location in real time,
The possibilities are cool, and the prospect of doing them with very
little work is even cooler. so it was great to get an overview of
the technologies that make it all work.
Andreas Krohn of Kapow Technologies gave a great survey of the
tchnology landscape. This post summarizes his talk, mashing it together
with a talk given by Sean Brydon, Greg Murray, and
Mark Basier of Sun Microsystems, as well as one given by Dave Johnson
(also from Sun). (The latter talks went deeper into the technologies
and provided useful insights into security considerations. But mostly
they introduced the most important buzzwords to know.)
Software development tends to be an expensive, time-consuming process.
So only the most critical projects get implemented. There are a
limited number of them, but since they are used by many people, they
justify the investment in a system that enhances reliability and scalability,
like the Service-Oriented Architecture (SOA)--even if it is more complex
and harder to use.
On the other hand, many people have a need for small, single-purpose
applications that may not do much more than put information together
for their purposes. Those small apps rarely get developed, because
coding resources are scarce. There are a very small number of users
for those apps, but there are a large number of applications. When
plotted on a curve, there the number of possible applications continues
to infinity as the number of users diminishes to one:
Andreas pointed out that mashups help to address that "long tail"
at the end of the graph. All development enhancements do that
to some degree, of course. But the goal of mashup technology is
to get to the point that users can do it for themselves.
Let's say you want to put your hotel reservations, map
of their locations, and a calender with your travel dates,
all on one page. The goal is to see the information you
need, all in one place, gathering it from whereever it
happens to exist.
To create a mashup, you need:
Web page GUI components and a GUI Builder
A communication mechanism and a data format
One or more data sources you can access
Optionally, a data repository you can interact with
The GUI components display data and give you a way
to provide selection. The coommunication mechanism
goes out to the web, delivering a package of information
in a given data format. The data sources deliver
one-way information, while the optional repository
gives you a way to store information you want to save.
In a moment, We'll look at The most common technologies
used in each area. First, let's take a look at some
all-in-one mashup builders that let you create a mashup
without writing a line of code.
These are fast way to see what a mashup could be. Google Gadgets are mostly
display-only widgets, some of which let you specify filtering criteria or
add information (like the calendar). They're pre-connected to a data service,
and they provide bits of javascript you can drop into a web page. (So far,
I haven't figured out how to keep them from overwriting each other, but
I'm sure I'll figure it out, eventually)
Yahoo Pipes
With Yahoo Pipes, you drag and drop components onto a page, identify
the RSS feeds you want to use for your data source, and then specify sorting
and selection criteria (which you can attach to fields and other GUI components)
to control the information you see.
You can also configure your mashup to publish the information it gathers,
delivering information in other formats such as Atom and JSON.
The Pipes system is limited to RSS feeds, so if you want to access additional
data sources, or perhaps add special functionality (like dragging items
to the calendar), then you'll need to add some code.
Teqlo
This award-winning system uses Java technology. It only works with Firefox
2.0, but it provides multiple kinds of widgets--including RSS readers, todo
lists, calendar, and others. Perhaps even more than the GUI building tool
is the fact that the widgets can talk to each other, so you can drag and
drop information from one widget to another. (You could drag an entry into
the calendar, for example, or drag a calendar entry into the map to see
where it's located.)
QEDWiki
I stumbled across this item while researching the others. IBM's QEDWiki
is a PHP-based system that combines Wiki building and mashup construction
in a single system that both coders and end users can customize to get the
behaviors they want.
If you want to start playing with these technologies and see what kind of mashups
you can construct, head over to the Resources section
now. To find out how to turn normal web pages into data sources you can use
in your mashups, read the next section. Following that, you'll find more information
on the underlying technologies.
When the information you is on the web, but it isn't
in a form a mashup can, there is a solution: Use a
Mashup Enabler to convert the information into
usable form.
Microformats
Embedding microformat tags in an HTML page makes it possible to extract
its contents in XML form. This kind of enabling requires cooperation from
the information producer, but it's relatively easy to do, and the data tagging
will remain valid even when the page layout changes.
Kapow and OpenKapow
Web Scrapers like Kapow and OpenKapow pull data from an HTML page and
turns it into a data feed a mashup can use. (OpenKapow is the free version.)
Andreas Krohn of Kapow Technologies demonstrated the process:
Download and run openkapow
Tell it to build a new service
Specify the URL the data comes from
Tell it what to search for in the page
Tell it which items to include in the output
Do an initial search
Specify a loop to output multiple items
Use menu items to extract pieces
The risk with web scraping, of course, is that the data format you're
scraping could change. But the reward is that you get the app you want.
The risk/reward ratio depends on the time required to create such an
app. With the all-in-one mashup build systems that serve "the long
tail", the ratio becomes favorable to the point that it's worth setting
up a web scraper to access critical bits of data, even if it has to change
once in a while.
Note: Other services in the web scraping category include
Dapper, Google Data, and the Java Mozilla HTML parser.
What follows is a whirlwind tour of the technology buzzwords mentioned in
the talks. My notes are sketchy at points, but should serve as a decent guide
to the process.
In the old days, you did a lot of programming to create a GUI. But in the
web era, you assemble pre-built components, wiring them to data feeds. The
code you write--if any--is minimal.
The components themselves are built using AJAX, of course. That means Javascript.
But the tricky bit is the differences in the document object structure (DOM)
that different browsers create for their web pages.
AJAX component libraries attempt to account for those differences. The
degree to which they're successful determines how robust and reliable they
are.
Libraries built of AJAX components include:
prototype
scrip.aculo.us
Dojo
jMaki, where the "j" stands for JavaScript, and "Maki"
is Japanese for "wrapper", or "container".
But even when you have the best of libraries, it takes a fair amount of
work to wire them up and lay them out. That's where GUI builders come in.
For a serious enterprise app that will have many users on different browsers,
one of the commercial mashup builders may make sense, for the sake of increased
reliability and timely support:
BackBase
NexaWeb
On the other hand, when you're creating something for yourself, one of
the open source builders may work well enough to do what you need on the
browser you use regularly:
Google web Toolkit (GWT): Java classes that generate
Javascript, so you can write cleaner code and use the compiler to help
detect errors. With GWT you can browse the libraries, take advantage
of code completion, refactor with the IDE, do unit testing, and rapidly
cycle between coding and testing. Perhaps more importantly, the GWT
libraries were designed to optimize the end user experience and only
then, where possible, optimize the developer experience.
An XML microformat embedded in an HTML page is one kind of data format,
of course, as are the more widely known RSS and Atom formats. This section
compares those formats, along with Javascript Object Notation (JSON).
RSS
The standard syndication formats, intended for one feed and many listeners,
are RSS and Atom. Of the two, RSS is the more widely known, but it has
many difficulties:
RSS 1.0 and RSS 2.0, despite having similar names, are backed by
two entirely different groups, and are fundamentally incompatible.
Several variants of RSS 1.0's predecessors are still in common use,
as well, from 0.92 through 0.94.
RSS is a very loose standard, so succesfully parsing one standard
doesn't guarantee success with a different one. (For example, it doesn't
specify which fields could contain HTML escaped into text form, and
there is no support for summares.)
RSS only covers the transmission of text. It doesn't handle multimedia,
binaries, or HTML.
Atom
The Atom specification rectifies those problems:
ATOM 1.0 is a single, well-specified IETF standard.
It handles binaries and HTML, as well as text.
There is a publishing protocol that can be used to create and update
feeds, as well as a client protocol to consume them.
JSON
Individual XML formats and standard variants like RSS and Atom are
great for data exchange, but they add a fair amount of overhead for a
simple message. JavaScript Object Notation is a much simpler mechanism
for sending small amounts of data. It's a lot easier to parse--all it
takes is one line of Javascript, which makes it ideal for an AJAX client.
A client that expects JSON format data uses tbe following URL convention
to express that preference:
http://someURL?format=json
The structure that comes back consists of:
Name/value pairs, separated by colons, where strings are surrounded
by quotes:
"id": 5
"name": "Bob"
Multiple pairs in an object, separated by commas, where the object
is defined by braces:
{"id": 5, "name": "bob" }
Multiple objects in an array, separated by commas, where the array
is defined by square brackets:
[ {...},
{....
}
]
When delivering JSON, the server sets the MIME type with a call like
this one:
Within strings, special characters like quotes, backslashes, slash
characters, and control chars need to be escaped.
JSONP
In an interesting augmentation of basic JSON functionality, the client
can append a callback name as parameter:
http://...?jsonp=someFunction
When the server sees that command, it surrounds the JSON data with
a call to the function, returning:
someFunction(...JSON data...)
So instead of doing an XMLHttpRequest, the client can use JSONP.
Note on innerHTML:
Once you've obtained objects from your data source, the easist way to insert
them into the web page so they can be viewed is by using the innerHTML function
call on the element you want to modify:
Easier for cross-platform browsers than DOM, which has major differences
between IE and other browsers
Handles embedded HTML
Simpler to create/maintain
You can create a string and assign it to an iterational property.
For example:
The standard communication mechansims are REST, the Atom Publishing Protocol
(APP) to publish Atom, RSS, JSON, and microformats, as well as straight
HTTP requests to get such data.
REST
REST stands for "REpresentational State Transfer". (Huh?
What does that mean?) The standard definition goes on to state, "representations
embody state". (Sorry, still no clue.) But at bottom, REST is really
simple, and almost obvious. So it's worth knowing about. To help, here's
an acronym that should be easier to remember:
Really Easy System for Transmitting (web data)
All REST is, in the end, is standard HTTP requests. The HTTP requests
have been around for a long time, in fact. But someone finally decided
to use them.
With REST, you always specify a resource with a URI. You then implement
standard CRUD functions (create, read, update, delete) with the equivalent
HTTP requests:
Http POST
Http GET
Http PUT
Http DELETE
When you combine REST with Javascript callback functions and JSON data,
you get a lot of power at very low cost, even if the server is just
pushing data to your client.
But when the server is ready to listen to post, put, and delete requests,
as well as GETs, you get even more power. And that is the basis of the
Atom Publishing Protocol.
Atom Publishing Protocol (APP)
The APP is a REST-based publishing mechanism for Atom format data. In
additon to providing CRUD functions with post, get, put, and delete, APP
provides for user authentication, which is critical in any such scenario.
Atom is a generic data format, of course. It's not just for blogs. But
in addition to allowing any kind of data in an entry, Atom also allows
for CRUD operations on collections of entries. In fact, it even includes
next/previous links in its collections.
To publish an entry, you post to a collection URI. To read one, you use
GET. Then you edit and use PUT to replace it.
Libraries that support APP include:
ROME: A strong Java library that handles RSS and
Atom (discussed below)
Abdera: STAX-based Atom-only parser in Java
Google Data API: Google's library.
Note: Other entries in this category include the Universal
Feed Parser (Python) and the Windows-only Windows RSS Platform built into
IE7.
Now that you've see the choices for data formats and communication mechanisms,
the possibilities for a data source are fairly easy to understand (any of
which could be generated by masup enabler:
RESTful service
RSS or Atom feed (RSSBus, Grazr, Java ROME)
The choices for a data repository are equally simple:
RESTful service (especially one that supports APP)
The libraries that make the most sense for coders on the Java platform include
JSR 311, ROME, and ROME Propono.
JSR 311
This JSR is a Java API for RESTful web services. It's intended to make
coding simpler and encourage good RESTful style.
ROME
This toolkit is a DOM-based parser/generator in Java. Arguably the most
capable toolkit for RSS and Atom on the java platform, it parses and generates
all forms of RSS and Atom. It's based on JDOM, and is both pluggable and
extenbsible.
To consume an arbitrary feed with ROME, create a SynFeed object, read
the data, and then convert it to Atom (or RSS, if you must). The command
to create a new SynFeed looks something like this:
In general, you want to think syndication, not coordination. In other words,
provide a RESTful feed, and let others mash it up the way they want to. (Instead
of having long meetings where you make sure the provider and consumer interfaces
match up.)
You want to provide one or more of:
APIs
A RESTful web service
Atom or RSS feeds
But to make sure that clients can make use of your service, you also want
to provide:
Client-side javascript libraries
A client-side widget
Client-side CSS
Documentation for the API
An example showing how it all fits together
You also want to set up a server-side proxy-style service, where:
The browser gets a web page from the host server
The web page gets Javascript file and CSS from the mashup server
The javascript gets data from the mashup database server
(The client can also use cross-site scripting to pull together data from
multiple sources.)
Ideally, the server should also be prepared to deliver data in mulitple formats,
including JSON, JSONP, and XML:
if ("jsonp".equals(format) ...
response.write("callback" + jsonp_value + content)
else if ("json".equals(format) ...
response.setContentType("text/json")
else
response.setContentType "text/xml")
write ...
For performance, you'll also want to do as much caching as you can:
Client-side cache via HTTP Conditional GET
Proxy server cache via HTTP headers
Server side cache using your favorite cache technology
And you'll also want to set things up so that applications and browsers like
Firefox, Safari, and IE can auto-discover your feeds, using HTML that looks
roughly like this:
The information in this section came from the Java Blueprints folks-Sean
Brydon, Greg Murray, and Mark Basier--who turn out to have a fair amount of
expertise on the subject. Any inaccuracies here resulted from my limitations,
not theirs.
To start with, use the data format that is appropriate for your level of
exposure:
JSON: When you're in the same domain as the data provider.
(You could get arbitrary Javascript, and that Javascript will execute,
so you want to be inside your firewall.)
JSONP: When you are inside or outside of your domain.
(You're specifying the javascript to execute with this format, so it's
safer. That makes this the easiest and most portable option.)
XML: When you're getting data from outside your domain,
or when the client isn't using Javascript, which would let it easily parse
JSON data.
Securing JSON:
When using JSON, any javascript could be coming your way, so
you need a certain level of trust in your data source, or added levels
of security protection:
Use a namespace for your javascript commands
Use CSS for customization
Don't add to the prototype of common objects
Securing your services:
Create a token:
Create a file with apikey (It's secure because it can't be edited
with javascript.)
Use a session-based hash to make sure that a session has been established
(so the request isn't coming in at random):
Create an API key generated from the URL, using a one-way hash
The user registers to get the API key (a very long string)
The key is attached to the GET request
The host name is mapped against the hash for access.