Adam Bosworth, with his 4P criteria for mass adoption on the web, seems to have clarified the worse is better debate over web data formats as only he can. Mike Champion had an interesting comment on RDF and Atom's slopworthiness as compared to RSS over on David Megginson's related entry 'RSS as the HTML for data': "Bosworth's presentation is very well worth studying in this respect. He says that successful Web-scale technologies tend to be simple (for users), sloppy, standardized (widely deployed in a more or less interoperable way, irrespective of formal status), and scalable. I don't think Atom or RDF meet these criteria. Atom's main value over RSS is supposed to be its FORMAL standardization, but apparently nobody really cares. (Tim Bray's 'Mr. Safe' has not appeared, but RSS interop and even extensibility is happening and making it boss-friendly in practice). RDF is not simple for ordinary mortals, and its scalability is unproven. (I have been informed that actual RDF systems handle sloppiness well, even though one would think that its basis in formal logic would make it brittle... I don't know how to evaluate that)." Slop I can only assume that Mike's referring to recent discussion on xml-dev among other things. Over there I agreed that RDF has simplicity/comprehesion issues, but pointed out with a few simple examples that RDF is lot more tolerant of partial and missing information than some people realise. For example Daniel Steinberg also commenting on Adam Bosworth's keynote, thinks that total agreement is a requirement: "Bosworth predicts that RSS 2.0 and Atom will be the lingua franca that will be used to consume all data from everywhere. These are simple formats that are sloppily extensible. Anyone who wants to can use these formats to consume content or to author content. Contrast this with the Semantic Web, which requires that you get a large group of people to agree on the schema of everything." In reality, what Daniel said is wrong and specifically, wrong about RDF - RDF was designed with the unexpected in mind. It's understandable how one might come to that conclusion and how Mike is having trouble reconciling loose and fast with RDF. A lot of this has had to do with early hype about the Semantic Web, which has at times sounded suspiciously like AI reborn. It also has to do with the way the benefits of RDF have been couched - critically, when WS technology adoption was on the up and up, the emphasis of Semantic Web standardization within the W3C was on formalization of the technology rather than useful applications. If all I heard from people raving about Python and MySQL were Turing Machines and Relational Algebra, I might end up not appreciating their usefulness, and I would certainly be bored to tears. RDF receives its robustness and flexibility properties from its design, and two design properties stand out. Graphs First is the graph model that RDF is based on. All RDF data organized as a graph, different from XML tree based document structures and vaguely like relational databases, but without the idea of tables. The beauty of the graph model is that it is 'additive'. That means you can keep merging new items onto the graph without having to create new data structures to support new information. It's extensible and uniform in much the same way a hashmap data structure is in a programming language. All RSS variants have ended up basing themselves on a slotted dictionary style data structure; Atom makes no bones about it. This is one reasons why, despite there being over 9 formats, it's possible to support all of them in a single program's object model. It's even been suggested that Atom offers little value because people have not had to significantly adjust their program models to accomodate it, but this speaks hardly about Atom and volumes about the data structure all feed formats have settled on - the dictionary. Aside from who retains editorials controls, the syndication format war arguments have been over which slots are mandatory and how to do ad-hoc extensions for things like links and metadata tagging - which is where the one-deep dictionary structure starts to break down and you edge toward a graph. It's also 'subtractive', which means you can take data out of the graph and leave a smaller graph behind just the same way you'd remove an item from a hashmap, but with the hassle of doing something like dropping a column or table in a database. In the developer trenches, adding or dropping database columns is often the stuff of nightmares, as is trying to restoring data where the integrity constraints were left to the application to manage. This not so problematic with RDF. Using RDF as the data model, queries and merging operations end up producing new graphs as their results, in much the same way SQL result sets are also tables. More importantly it makes for elegant and uniform programming against the data and reduces prior agreements an application has to have (ie knowing how a domain is mapped onto specific tables and relations). If RSS APIs look like dictionaries then RDF APIs tend to look like like hooks into tuplespaces. From a scalability consideration, the fact that we can break up large graphs of data into smaller, possibly overlapping graphs without logically losing integrity or meaning allows us to physically distribute a dataset in ways that are clumsy, unworkable or downright expensive with RDBMSes and filesystems. Being naturally distributed also is the basis for 3rd party metadata and annotation; I do not need physical access or joins to your data store to make assertions about your data. The most interesting slide in Adam Bosworth's presentation are not about the 4Ps, but the diagrams which show queries being divvied across across servers (sorry, I've lost the link to the powerpoint). While, it's known that Google break out their indices across a cluster into what they call 'shards', Bosworth's model looks like the late Gene Kan's Infrasearch query router, now part of the JXTA project. As a counterpoint, Doug Cutting of Lucene and Nutch fame has said, more or less, that there's no great advantage yet to distributed queries across the web in this way over downloading and centralizing the indexes: "Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine." But you have to question whether downloading the web into a cluster for indexing is the way to go; or at least whether it's the only way to go. If the amount of data being generated exceeds our ability to centralize it, at some point Jim Gray's distributing computing economics might flip in favour of sending the query out after the data rather than trying to localize the data. William Grosso has wondered whether Gray's model invalidates semantic web precepts: "Now along comes Gray, making an argument that, when you think about it, implies that the semantic web, as currently conceived, might just be all wrong. His basic point is that it's far cheaper to vend high-level apis than give access to the data (because the cost of shipping large amounts of data around is prohibitive). Since the semantic web is basically a data web, one wonders: why doesn't Gray's argument apply?" Worlds Second is the "open world" assumption of RDF. What that means is that with RDF, the fact that you can't find the answer to your query doesn't make the query is false (in languages like SQL and Prolog failure to get a result usually means false, which would the 'closed world' assumption). For example if you were searching for an Atom entry's summary, and didn't find anything, and you concluded there's no summary for that entry that would be a closed world assumption. But in RSS1.0, which is RDF based, you'll conclude you don't have a summary to hand, not that it doesn't exist. the dat might be incomplete at the time of asking. Dan Brickley describes this as 'missing isn't broken': "Developers who come to the Semantic Web effort via XML technology often make an understandable mistake. They assume that missing is broken when it comes to the contents of RDF/XML documents, that if you omit some piece of information from an RDF file, you have in some formal, technical sense 'done something wrong' and should be punished. RDF doesn't work like that. Missing isn't broken. In the general case, you are free to say as much, or as little, in your RDF document as you like. RDF vocabularies such as FOAF, Dublin Core, MusicBrainz, RDF-Wordnet don't get to tell you what to do, what to write, what to say. Instead, they serve as an interconnected dictionary documenting the meaning of the terms you're using in your RDF documents." Those coming to RDF from fields like description logics (DL) and ontology engineering have worried about the consequences of RDF's 'open world' for the Semantic Web formats as it has all sorts of unpleasant mathematical and computational consequences, such being able to ask questions that would taking aeons to answer, or maybe can't be answered at all. But in practice this doesn't bite hard. Balance The case of the DL and ontology crowd coming to the Semantic Web and fussing over questions that will blow up in their engines is much like the case of the enterprise crowd coming to the Web fussing over type systems and discovery languages needed for their tools. The likeness is not fleeting - both the Semantic Web and Web Services advocates have been busy building competing technology stacks in the last decade. They have valid points and good technology but they have vastly, vastly overestimated not just the need for such precision in the Web context, but the willingness of people to invest any time in caring an iota about it. As Pat Hayes put it: "It is fundamentally unnecessary. The semantic web doesn't need all these DL guards and limitations, because it doesn't need to provide the industrial-quality guarantees of inferential performance. Using DLs as a semantic web content markup standard is a failure of imagination: it presumes that the Web is going to be something like a giant corporation, with the same requirements of predictability and provable performance. In fact (if the SW ever becomes a reality) it will be quite different from current industrial ontology practice in many ways. It will be far 'scruffier', for a start; people will use ingenious tricks to scrape partly-ill-formed content from ill-structured sources, and there is no point in trying to prevent them doing so, or tutting with disapproval. But aside from that, it will be on a scale that will completely defeat any attempt to restrict inference to manageable bounds. If one is dealing with 10|9 assertions, the difference between a polynomial complexity class and something worse is largely irrelevant." Pat Hayes is an interesting person to have said that. He's a legend in the world of AI in the way Adam Bosworth is a legend as a software developer. Both have concluded in their own ways that the 'neat' orthodoxies implicit in Web Services and the uppercase Semantic Web are futile. Cleaning up the Web is infeasible. Accepting that data will considered open or closed, local or remote are architectural criteria, analogous in some ways to Deutsch's first fallacy of assuming the network is reliable. Once you decide that closed world is a bad assumption (or the network will not always be there), that changes your approach. If you come from an SQL/XML background the open world idea of everything being effectively optional is going seem weird and unworkable, but what it really means is that every addition of data is an extension act - extensibility is intrinsic to the RDF way of doing things, not something that gets bolted on as with mustUnderstand/mustIgnore. The same intrinsic nature goes for distribution and breaking up of datasets. RDF databases can be distributed across any number of nodes. The technical challenge then is not so much scaling the database across clusters, the model already supports that. Instead the real challenge is routing and distributing the queries. Query routing is a special case of the kind of packet routing problems that occupy telecoms, peer-to-peer and internet engineers. Bosworth is right, we need Really Simple Querying, but it's too early to rule out RDF as a good fit for returning the results or managing scale issues....