Cool Tools and Other Stuff
JavaOne 2007, #3: Mashup a Web App with Little or No Code
by Eric Armstrong
May 14, 2007

Summary
When the information you need already exists, but it's scattered here and there around the web, you have an option. You can create a small, super-lightweight web app to put it together--a mashup. It's not quite as easy as falling off a log, but it's gotten to the point that end users can create their own applications.

Contents

What's a Mashup?

The Long Tail

Structure of a Mashup

Mashup Builders: Google Gadgets, Yahoo Pipes, Teqlo, QEDWiki

Mashup Enablers: Microformats & Web Scrapers (OpenKapow, Kapow, Dapper)

Mashup Technologies

Web page GUI components and GUI Builders: GWT, Xap, NetBeans

Data Formats: Microformats, RSS, Atom, JSON, JSONP

Communication Mechanisms: REST, APP

Data Sources and Repositories

Coding Libraries: JSR 311, ROME, ROME Propono

Service-Building Strategies

Security Considerations

Resources

What's a Mashup?

Mashups are super-lightweight web apps that are created in minimum time, with minimum code, from information and services that already exist on the web. You want your email, RSS feeds, and calendar all in one page? Smash the pieces into a web page that has the interface you want. Have a web-accessible GPS locator in your company's delivery trucks? Combine that feed with a a map service like Google maps to display the truck's location in real time,

The possibilities are cool, and the prospect of doing them with very little work is even cooler. so it was great to get an overview of the technologies that make it all work.

Andreas Krohn of Kapow Technologies gave a great survey of the tchnology landscape. This post summarizes his talk, mashing it together with a talk given by Sean Brydon, Greg Murray, and Mark Basier of Sun Microsystems, as well as one given by Dave Johnson (also from Sun). (The latter talks went deeper into the technologies and provided useful insights into security considerations. But mostly they introduced the most important buzzwords to know.)

The Long Tail

Software development tends to be an expensive, time-consuming process. So only the most critical projects get implemented. There are a limited number of them, but since they are used by many people, they justify the investment in a system that enhances reliability and scalability, like the Service-Oriented Architecture (SOA)--even if it is more complex and harder to use.

On the other hand, many people have a need for small, single-purpose applications that may not do much more than put information together for their purposes. Those small apps rarely get developed, because coding resources are scarce. There are a very small number of users for those apps, but there are a large number of applications. When plotted on a curve, there the number of possible applications continues to infinity as the number of users diminishes to one:

      | *
  ^   |  *
  |   |   *
Users |    *
      |     *
      |       *
      |         *
      |            *
      |                *
      |                    *
      |                         *
      |                              *
      |                                   *   *   *
      +--------------------------------------------
                   Applications -->

Andreas pointed out that mashups help to address that "long tail" at the end of the graph. All development enhancements do that to some degree, of course. But the goal of mashup technology is to get to the point that users can do it for themselves.

Structure of a Mashup

Let's say you want to put your hotel reservations, map of their locations, and a calender with your travel dates, all on one page. The goal is to see the information you need, all in one place, gathering it from whereever it happens to exist.

To create a mashup, you need:

Web page GUI components and a GUI Builder

A communication mechanism and a data format

One or more data sources you can access

Optionally, a data repository you can interact with

The GUI components display data and give you a way to provide selection. The coommunication mechanism goes out to the web, delivering a package of information in a given data format. The data sources deliver one-way information, while the optional repository gives you a way to store information you want to save.

In a moment, We'll look at The most common technologies used in each area. First, let's take a look at some all-in-one mashup builders that let you create a mashup without writing a line of code.

Mashup Builders

Google Gadgets

These are fast way to see what a mashup could be. Google Gadgets are mostly display-only widgets, some of which let you specify filtering criteria or add information (like the calendar). They're pre-connected to a data service, and they provide bits of javascript you can drop into a web page. (So far, I haven't figured out how to keep them from overwriting each other, but I'm sure I'll figure it out, eventually)

Yahoo Pipes

With Yahoo Pipes, you drag and drop components onto a page, identify the RSS feeds you want to use for your data source, and then specify sorting and selection criteria (which you can attach to fields and other GUI components) to control the information you see.

You can also configure your mashup to publish the information it gathers, delivering information in other formats such as Atom and JSON.

The Pipes system is limited to RSS feeds, so if you want to access additional data sources, or perhaps add special functionality (like dragging items to the calendar), then you'll need to add some code.

Teqlo

This award-winning system uses Java technology. It only works with Firefox 2.0, but it provides multiple kinds of widgets--including RSS readers, todo lists, calendar, and others. Perhaps even more than the GUI building tool is the fact that the widgets can talk to each other, so you can drag and drop information from one widget to another. (You could drag an entry into the calendar, for example, or drag a calendar entry into the map to see where it's located.)

QEDWiki

I stumbled across this item while researching the others. IBM's QEDWiki is a PHP-based system that combines Wiki building and mashup construction in a single system that both coders and end users can customize to get the behaviors they want.

If you want to start playing with these technologies and see what kind of mashups you can construct, head over to the Resources section now. To find out how to turn normal web pages into data sources you can use in your mashups, read the next section. Following that, you'll find more information on the underlying technologies.

Mashup Enablers

When the information you is on the web, but it isn't in a form a mashup can, there is a solution: Use a Mashup Enabler to convert the information into usable form.

Microformats

Embedding microformat tags in an HTML page makes it possible to extract its contents in XML form. This kind of enabling requires cooperation from the information producer, but it's relatively easy to do, and the data tagging will remain valid even when the page layout changes.

Kapow and OpenKapow

Web Scrapers like Kapow and OpenKapow pull data from an HTML page and turns it into a data feed a mashup can use. (OpenKapow is the free version.) Andreas Krohn of Kapow Technologies demonstrated the process:

Download and run openkapow

Tell it to build a new service

Specify the URL the data comes from

Tell it what to search for in the page

Tell it which items to include in the output

Do an initial search

Specify a loop to output multiple items

Use menu items to extract pieces

The risk with web scraping, of course, is that the data format you're scraping could change. But the reward is that you get the app you want.

The risk/reward ratio depends on the time required to create such an app. With the all-in-one mashup build systems that serve "the long tail", the ratio becomes favorable to the point that it's worth setting up a web scraper to access critical bits of data, even if it has to change once in a while.

Note: Other services in the web scraping category include Dapper, Google Data, and the Java Mozilla HTML parser.

Mashup Technologies

What follows is a whirlwind tour of the technology buzzwords mentioned in the talks. My notes are sketchy at points, but should serve as a decent guide to the process.

Web page GUI components and GUI Builders

In the old days, you did a lot of programming to create a GUI. But in the web era, you assemble pre-built components, wiring them to data feeds. The code you write--if any--is minimal.

The components themselves are built using AJAX, of course. That means Javascript. But the tricky bit is the differences in the document object structure (DOM) that different browsers create for their web pages.

AJAX component libraries attempt to account for those differences. The degree to which they're successful determines how robust and reliable they are.

Libraries built of AJAX components include:

prototype

scrip.aculo.us

Dojo

jMaki, where the "j" stands for JavaScript, and "Maki" is Japanese for "wrapper", or "container".

But even when you have the best of libraries, it takes a fair amount of work to wire them up and lay them out. That's where GUI builders come in.

For a serious enterprise app that will have many users on different browsers, one of the commercial mashup builders may make sense, for the sake of increased reliability and timely support:

BackBase

NexaWeb

On the other hand, when you're creating something for yourself, one of the open source builders may work well enough to do what you need on the browser you use regularly:

Google web Toolkit (GWT): Java classes that generate Javascript, so you can write cleaner code and use the compiler to help detect errors. With GWT you can browse the libraries, take advantage of code completion, refactor with the IDE, do unit testing, and rapidly cycle between coding and testing. Perhaps more importantly, the GWT libraries were designed to optimize the end user experience and only then, where possible, optimize the developer experience.

XAP: The open source version of NexaWeb

NetBeans: for jMaki

Data Formats

Microformats

An XML microformat embedded in an HTML page is one kind of data format, of course, as are the more widely known RSS and Atom formats. This section compares those formats, along with Javascript Object Notation (JSON).

RSS

The standard syndication formats, intended for one feed and many listeners, are RSS and Atom. Of the two, RSS is the more widely known, but it has many difficulties:

RSS 1.0 and RSS 2.0, despite having similar names, are backed by two entirely different groups, and are fundamentally incompatible.
Several variants of RSS 1.0's predecessors are still in common use, as well, from 0.92 through 0.94.
RSS is a very loose standard, so succesfully parsing one standard doesn't guarantee success with a different one. (For example, it doesn't specify which fields could contain HTML escaped into text form, and there is no support for summares.)
RSS only covers the transmission of text. It doesn't handle multimedia, binaries, or HTML.

Atom

The Atom specification rectifies those problems:

ATOM 1.0 is a single, well-specified IETF standard.
It handles binaries and HTML, as well as text.
There is a publishing protocol that can be used to create and update feeds, as well as a client protocol to consume them.

JSON

Individual XML formats and standard variants like RSS and Atom are great for data exchange, but they add a fair amount of overhead for a simple message. JavaScript Object Notation is a much simpler mechanism for sending small amounts of data. It's a lot easier to parse--all it takes is one line of Javascript, which makes it ideal for an AJAX client.

A client that expects JSON format data uses tbe following URL convention to express that preference:

http://someURL?format=json

The structure that comes back consists of:

Name/value pairs, separated by colons, where strings are surrounded by quotes:

"id": 5
"name": "Bob"

Multiple pairs in an object, separated by commas, where the object is defined by braces:

{"id": 5, "name": "bob" }

Multiple objects in an array, separated by commas, where the array is defined by square brackets:
```
[ {...},
  {....
  }
]
```

When delivering JSON, the server sets the MIME type with a call like this one:

response.setContentTye("appication/json;charset=UTF-8")

Within strings, special characters like quotes, backslashes, slash characters, and control chars need to be escaped.

JSONP

In an interesting augmentation of basic JSON functionality, the client can append a callback name as parameter:

http://...?jsonp=someFunction

When the server sees that command, it surrounds the JSON data with a call to the function, returning:

someFunction(...JSON data...)

So instead of doing an XMLHttpRequest, the client can use JSONP.

Note on innerHTML:
Once you've obtained objects from your data source, the easist way to insert them into the web page so they can be viewed is by using the innerHTML function call on the element you want to modify:

Easier for cross-platform browsers than DOM, which has major differences between IE and other browsers
Handles embedded HTML
Simpler to create/maintain
You can create a string and assign it to an iterational property. For example:
```
<span onmouseover='alert(document.cookie);'>Data</span>
```

Communication Mechanisms

The standard communication mechansims are REST, the Atom Publishing Protocol (APP) to publish Atom, RSS, JSON, and microformats, as well as straight HTTP requests to get such data.

REST

REST stands for "REpresentational State Transfer". (Huh? What does that mean?) The standard definition goes on to state, "representations embody state". (Sorry, still no clue.) But at bottom, REST is really simple, and almost obvious. So it's worth knowing about. To help, here's an acronym that should be easier to remember:

Really Easy System for Transmitting (web data)

All REST is, in the end, is standard HTTP requests. The HTTP requests have been around for a long time, in fact. But someone finally decided to use them.

With REST, you always specify a resource with a URI. You then implement standard CRUD functions (create, read, update, delete) with the equivalent HTTP requests:

Http POST

Http GET

Http PUT

Http DELETE

When you combine REST with Javascript callback functions and JSON data, you get a lot of power at very low cost, even if the server is just pushing data to your client.

But when the server is ready to listen to post, put, and delete requests, as well as GETs, you get even more power. And that is the basis of the Atom Publishing Protocol.

Atom Publishing Protocol (APP)

The APP is a REST-based publishing mechanism for Atom format data. In additon to providing CRUD functions with post, get, put, and delete, APP provides for user authentication, which is critical in any such scenario.

Atom is a generic data format, of course. It's not just for blogs. But in addition to allowing any kind of data in an entry, Atom also allows for CRUD operations on collections of entries. In fact, it even includes next/previous links in its collections.

To publish an entry, you post to a collection URI. To read one, you use GET. Then you edit and use PUT to replace it.

Libraries that support APP include:

ROME: A strong Java library that handles RSS and Atom (discussed below)

Abdera: STAX-based Atom-only parser in Java

Google Data API: Google's library.

Note: Other entries in this category include the Universal Feed Parser (Python) and the Windows-only Windows RSS Platform built into IE7.

Data Sources and Repositories

Now that you've see the choices for data formats and communication mechanisms, the possibilities for a data source are fairly easy to understand (any of which could be generated by masup enabler:

RESTful service

RSS or Atom feed (RSSBus, Grazr, Java ROME)

The choices for a data repository are equally simple:

RESTful service (especially one that supports APP)

Local storage (for a rich client implementation)

Coding Libraries

The libraries that make the most sense for coders on the Java platform include JSR 311, ROME, and ROME Propono.

JSR 311

This JSR is a Java API for RESTful web services. It's intended to make coding simpler and encourage good RESTful style.

ROME

This toolkit is a DOM-based parser/generator in Java. Arguably the most capable toolkit for RSS and Atom on the java platform, it parses and generates all forms of RSS and Atom. It's based on JDOM, and is both pluggable and extenbsible.

To consume an arbitrary feed with ROME, create a SynFeed object, read the data, and then convert it to Atom (or RSS, if you must). The command to create a new SynFeed looks something like this:

Synfeed feed = input.build(new InputStreamReader(inputStream));

When polling, take the following steps to maximize performance and minimize network traffic:

Use HTTP conditional GET or Etags
Don't poll too often
Use ROME's Fetcher, which includes a caching feed-store (as do other such utilities)

To publish with ROME:

Create a synfeed
Add entries to it

Set the content type and deliver it:

application/rss+xml
application/atom+xml

ROME Propono: ROME Propono is a client/server library that makes it easy to build a client APP, and makes it easy to add RESTful services to an existing web app.

Service-Building Strategies

In general, you want to think syndication, not coordination. In other words, provide a RESTful feed, and let others mash it up the way they want to. (Instead of having long meetings where you make sure the provider and consumer interfaces match up.)

You want to provide one or more of:

APIs

A RESTful web service

Atom or RSS feeds

But to make sure that clients can make use of your service, you also want to provide:

Client-side javascript libraries

A client-side widget

Client-side CSS

Documentation for the API

An example showing how it all fits together

You also want to set up a server-side proxy-style service, where:

The browser gets a web page from the host server

The web page gets Javascript file and CSS from the mashup server

The javascript gets data from the mashup database server

(The client can also use cross-site scripting to pull together data from multiple sources.)

Ideally, the server should also be prepared to deliver data in mulitple formats, including JSON, JSONP, and XML:

if ("jsonp".equals(format) ...
   response.write("callback" + jsonp_value + content)
else if ("json".equals(format) ...
   response.setContentType("text/json")
else
   response.setContentType "text/xml")
   write ...

For performance, you'll also want to do as much caching as you can:

Client-side cache via HTTP Conditional GET

Proxy server cache via HTTP headers

Server side cache using your favorite cache technology

And you'll also want to set things up so that applications and browsers like Firefox, Safari, and IE can auto-discover your feeds, using HTML that looks roughly like this:

<link rel="alternate", "title ="My Feed (RSS)",
      href="feeds/myfeed?format=RSS">

You'll also want to ensure that you're generating a valid feed, with properly escaped HTML and well-formed XML:

feedvalidator.org (works on all formats)

Other service-building strategies include:

DWR: A combination of servlets on the server and Javascript on client--good for a tightly integrated application

JavaServer Faces to wrap Ajax components

In summary,

Consider json for data interchanges

Strive for RESTful design

Access the server using a server-side proxy stragegy

Build a RESTful service so others can mashup your site

Security Considerations

The information in this section came from the Java Blueprints folks-Sean Brydon, Greg Murray, and Mark Basier--who turn out to have a fair amount of expertise on the subject. Any inaccuracies here resulted from my limitations, not theirs.

To start with, use the data format that is appropriate for your level of exposure:

JSON: When you're in the same domain as the data provider. (You could get arbitrary Javascript, and that Javascript will execute, so you want to be inside your firewall.)

JSONP: When you are inside or outside of your domain. (You're specifying the javascript to execute with this format, so it's safer. That makes this the easiest and most portable option.)

XML: When you're getting data from outside your domain, or when the client isn't using Javascript, which would let it easily parse JSON data.

Securing JSON:

When using JSON, any javascript could be coming your way, so you need a certain level of trust in your data source, or added levels of security protection:

Use a namespace for your javascript commands

Use CSS for customization

Don't add to the prototype of common objects

Securing your services:

Create a token:

Create a file with apikey (It's secure because it can't be edited with javascript.)

Use a session-based hash to make sure that a session has been established (so the request isn't coming in at random):

Create an API key generated from the URL, using a one-way hash

The user registers to get the API key (a very long string)

The key is attached to the GET request

The host name is mapped against the hash for access.

Other security notes:

Don't change state with HTTP GET

Add a security token to your forms

Don't just use cookies to validate the user

When rendering query string data, verify it

Resources

All-in-one mashup builders:

BackBase: http://www.backbase.com/ (Unfortunately, this page has a redirect that defeats the back button.)

NexaWeb: http://www.nexaweb.com/

QEDWiki (IBM): http://services.alphaworks.ibm.com/qedwiki/

Pipes (Yahoo): http://pipes.yahoo.com/pipes/

Teqlo: http://www.teqlo.com/

Mashup Enablers

Dapper: http://www.dapper.net/

Google Gadgets: http://desktop.google.com/plugins/

Kapow Technologies: http://www.kapowtech.com/

Microformats: http://microformats.org/

OpenKapow: http://www.openkapow.com/

GUI-building technologies:

Dojo: http://dojotoolkit.org/

DWR: http://getahead.org/dwr/

Google Web Tookkit (GWT): http://code.google.com/webtoolkit/

Java Server Faces: http://java.sun.com/javaee/javaserverfaces/

jMaki: https://ajax.dev.java.net/

XAP: http://incubator.apache.org/xap/

Communications and data formats:

Abdera (Apache): http://incubator.apache.org/abdera/

Atom: - Spec: http://www.ietf.org/rfc/rfc4287.txt - Overview: http://www-128.ibm.com/developerworks/xml/library/x-atom10.html

Google Data: http://code.google.com/apis/gdata/index.html

JSON / JSONP: http://www.json.org/

REST: http://rest.blueoxen.net/cgi-bin/wiki.pl

ROME / Propono: - http://wiki.java.net/bin/view/Javawsxml/Rome - http://wiki.java.net/bin/view/Javawsxml/RomePropono

RSS 1.0: http://web.resource.org/rss/1.0/

RSS 2.0: http://feedvalidator.org/docs/rss2.html

Feed Validator:

http://www.feedvalidator.org

Talk Back!

Have an opinion? Readers have already posted 2 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Eric Armstrong adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Eric Armstrong has been programming and writing professionally since before there were personal computers. His production experience includes artificial intelligence (AI) programs, system libraries, real-time programs, and business applications in a variety of languages. He works as a writer and software consultant in the San Francisco Bay Area. He wrote The JBuilder2 Bible and authored the Java/XML programming tutorial available at http://java.sun.com. Eric is also involved in efforts to design knowledge-based collaboration systems.


	Web Artima.com