Sponsored Link •
|
Summary
An enormous amount of data can be conveniently represented as a labelled tree. This is the basis of many markup languages such as SGML, XML, HTML, and others. Herein I propose an alternative, Labelled S-Expressions and provide code for a working parser in C++.
Advertisement
|
S-Expressions are a well known way of representing hierarchical data-structures in text form, but they lack the extra information and structure which is gained by associating a label with a data element.
Consider the following S-Expression:
(cd (title Maggie May) (artist Rod Stewart) (country UK) (company Pickwick) (price 8.50) (year 1990) )One drawback of the format is that the label is only identified as being the first element in the S-Expression. The other problem is that white-space is significant as being a delimiter between labels and the rest of the data. For textual data, it is important that white-space not be part of the grammar, otherwise subtle errors can easily arise. There is no standardized S-Expression format, and each has different ways of grouping data. XML on the other hand is more structured. The following is what the example data structure would look like as XML.
<cd> <title>Maggie May</title> <artist>Rod Stewart</artist> <country>UK</country> <company>Pickwick</company> <price>8.50</price> <year>1990</year> </cd>XML however is very complex and slow and painful to parse. It is also verbose. Labelled S-Expressions provide the best of both formats, while being extremely easy to parse.
(cd: (title:Maggie May) (artist:Rod Stewart) (country:UK) (company:Pickwick) (price:8.50) (year:1990) )
document ::= tagged_content tagged_content ::= "(" text ":" content ")" content ::= text? (tagged_content | text)* text ::= (^("(" | ":" | ")" | "'") | escaped_char)+ escaped_char ::= "'" .And that is it, only five production rules! Encoding is UTF-8.
// Public Domain code by Christopher Diggins struct lsx_node { lsx_node* sibling; lsx_node* child; const char* text; }; lsx_node* lsx_parse(const char** pp) { if (**pp == ')') { ++*pp; return NULL; } lsx_node* node = (lsx_node*)calloc(sizeof(lsx_node), 1); node->text = *pp; if (**pp == '(') { ++*pp; while (*(*pp)++ != ':') { if (**pp == '\'') ++*pp; } node->child = lsx_parse(pp); lsx_node* child = node->child; while (child != NULL) { child->sibling = lsx_parse(pp); child = child->sibling; } } else { while ((**pp != '(') && (**pp != ')')) { if (**pp == '\'') ++*pp; ++*pp; } } return node; }
// Public Domain code by Christopher Diggins // returns the number of lsx_nodes // and returns 0 if document is ill-formed int lsx_node_count(const char* p) { int ret = 0; int match = 0; if (*p == NULL) return 0; while (*p != NULL) { switch (*p) { case '(' : ++ret; ++match; while (*p != ':') { if (*p == NULL) return 0; if (*p == '\'') ++p; if (*p == NULL) return 0; ++p; } ++p; break; case ')' : --match; ++p; break; case ':' : return 0; default: ++ret; while ((*p != '(') && (*p != ')') && (*p != ':')) { if (*p == NULL) return 0; if (*p == '\'') ++p; if (*p == NULL) return 0; ++p; } break; } } if (*p != NULL) return 0; if (match != 0) return 0; return ret; }
// Public Domain code by Christopher Diggins bool lsx_is_label(lsx_node* x) { return (*x->text == '('); } std::string lsx_get_label(lsx_node* x) { assert(lsx_is_label(x)); const char* p = x->text; while (*(++p) != ':') { if (*p == '\'') ++p; } return std::string(x->text + 1, p); } std::string lsx_get_text(lsx_node* x) { assert(!lsx_is_label(x)); const char* p = x->text; while (*p != ')') { if (*p == '\'') ++p; ++p; } return std::string(x->text, p); } void lsx_output_as_xml(lsx_node* x) { if (x == NULL) return; if (!lsx_is_label(x)) { std::cout << lsx_get_text(x) << std::endl; } else { std::cout << "<" << lsx_get_label(x) << ">" << std::endl; lsx_output_as_xml(x->child); std::cout << "</" << lsx_get_label(x) << ">" << std::endl; } lsx_output_as_xml(x->sibling); };
<?xml version="1.0" encoding="ISO-8859-1"?> <!-- public domain by Christopher Diggins http://www.cdiggins.com --> <t:transform version="1.0" xmlns:t="http://www.w3.org/1999/XSL/Transform"> <t:output method = "html" version = "1.0" encoding = "ISO-8859-1" omit-xml-declaration = "no" standalone = "yes" indent = "yes" /> <t:template match="/"> <html> <body> <dl><dt>(lsx:</dt> <dl><t:apply-templates/></dl> <dt>)</dt></dl> </body> </html> </t:template> <t:template match="node()"> <dt>(<t:value-of select="name()"/>:</dt> <dl><t:apply-templates select="attribute::*"/></dl> <dl><t:apply-templates select="child::node()"/></dl> <dt>)</dt> </t:template> <t:template match="attribute::*"> (<t:value-of select="name()"/>:<t:value-of select="."/>) </t:template> <t:template match="text()"><t:value-of select="."/></t:template> </t:transform>
Have an opinion? Readers have already posted 13 comments about this weblog entry. Why not add yours?
If you'd like to be notified whenever Christopher Diggins adds a new entry to his weblog, subscribe to his RSS feed.
Christopher Diggins is a software developer and freelance writer. Christopher loves programming, but is eternally frustrated by the shortcomings of modern programming languages. As would any reasonable person in his shoes, he decided to quit his day job to write his own ( www.heron-language.com ). Christopher is the co-author of the C++ Cookbook from O'Reilly. Christopher can be reached through his home page at www.cdiggins.com. |
Sponsored Links
|