Summary
An enormous amount of data can be conveniently represented as a labelled tree. This is the basis of many markup languages such as SGML, XML, HTML, and others. Herein I propose an alternative, Labelled S-Expressions and provide code for a working parser in C++.
Advertisement
S-Expressions are a well known way of representing hierarchical
data-structures in text form, but they lack the extra information and structure which is gained by associating a label with a
data element.
One drawback of the format is that the label is only identified as being the first element in the S-Expression.
The other problem is that white-space is significant as being a delimiter between labels and the rest of the data.
For textual data, it is important that white-space not be part of the grammar, otherwise subtle errors can easily arise. There is no standardized
S-Expression format, and each has different ways of grouping data. XML on the other hand is more structured.
The following is what the example data structure would look like as XML.
XML however is very complex and slow and painful to parse. It is also verbose. Labelled S-Expressions provide
the best of both formats, while being extremely easy to parse.
The precise syntax of the Labelled S-Expression format was inspired by
Derek Parnell's unnamed markup language. Here is the Grammar for Labelled S-Expressions in Extended Backus-Naur Form (EBNF):
And that is it, only five production rules! Encoding is UTF-8.
Design Goals
Labelled S-Expressions was designed so that it could be parsed as quickly as possible, and be as simple as possible,
while still maintaining usefulness as a portable text-based data format.
Parser
The following is a snippet of C++ code for parsing labelled S-Expressions. It breaks a buffer up into nodes, where a node can either point to a label: "(xxx:" or a textual node. The input must already be validated to be a well-formed document.
// Public Domain code by Christopher Diggins
struct lsx_node {
lsx_node* sibling;
lsx_node* child;
const char* text;
};
lsx_node* lsx_parse(const char** pp) {
if (**pp == ')') {
++*pp;
return NULL;
}
lsx_node* node = (lsx_node*)calloc(sizeof(lsx_node), 1);
node->text = *pp;
if (**pp == '(') {
++*pp;
while (*(*pp)++ != ':') {
if (**pp == '\'') ++*pp;
}
node->child = lsx_parse(pp);
lsx_node* child = node->child;
while (child != NULL) {
child->sibling = lsx_parse(pp);
child = child->sibling;
}
}
else {
while ((**pp != '(') && (**pp != ')')) {
if (**pp == '\'') ++*pp;
++*pp;
}
}
return node;
}
Checking Well-Formedness
Here is a function for checking well-formedness and returns the number of nodes in the data:
// Public Domain code by Christopher Diggins
// returns the number of lsx_nodes
// and returns 0 if document is ill-formed
int lsx_node_count(const char* p) {
int ret = 0;
int match = 0;
if (*p == NULL) return 0;
while (*p != NULL) {
switch (*p) {
case '(' :
++ret;
++match;
while (*p != ':') {
if (*p == NULL) return 0;
if (*p == '\'') ++p;
if (*p == NULL) return 0;
++p;
}
++p;
break;
case ')' :
--match;
++p;
break;
case ':' :
return 0;
default:
++ret;
while ((*p != '(') && (*p != ')') && (*p != ':')) {
if (*p == NULL) return 0;
if (*p == '\'') ++p;
if (*p == NULL) return 0;
++p;
}
break;
}
}
if (*p != NULL) return 0;
if (match != 0) return 0;
return ret;
}
There are a multitude of markup languages floating around,
http://www.pault.com/pault/pxml/xmlalternatives.html has a nice list of markup languages. I don't believe any even
come close to the simplicity and efficiency of Labelled S-Expressions.
Your code may be fast, but I'd rather have slow code that works. I'll give you the benefit of the doubt and assume the code is C99 (since it sure isn't C89 standard). But even then, it doesn't quite work as intended. I have the following for a print routine:
void lsx_print(struct lsx_node *n,int l) { while(n != NULL) { if (n->text != NULL) printf("%*.*c%s\n",l,l,n->text); if (n->child) lsx_print(n->child,l+1); n = n->sibling; } }
I feed your parser the following:
('(cd'::blah)
I get the following:
(cd'::blah) :blah)
Which I don't think is correct (never mind the fact that as written, the entire labelled S expression has to exist as one line).
I'm guessing you probably knocked this code out without really testing it to show just how easy it is to parse, but the code as is, doesn't really work as you intended.
I think you'll find lots of prior art. This looks just like nested tuples/lists/dictionaries in Python, or PHP's nested arrays. Perl and Ruby have similar constructions. I wrote a web application a couple of years ago that stores inventory data in a text file as nested PHP arrays. The entire inventory can be loaded into PHP using the eval function to parse it:
The code to store inventory in a file is just as simple with PHP's var_export function.
I've seen the same kind of thing many many times, and back before XML had any traction.
If you've looked at YAML you've noticed that it is pretty much the same thing without the parentheses.
> The following is a snippet of C code for parsing labelled > S-Expressions (warning: it is much faster than what you are > used to!)
It may be faster but I'm used to code that compiles and works. Did you get that code to run? Hold off bragging about how fast it is until it works, is robust enough to handle human-edited inputs, can deal with syntax errors, doesn't store read-only pointers into the original text buffer, etc. Code snippets used for illustrating a technique should be a lot more readable than this.
> Your code may be fast, but I'd rather have slow code that > works.
Well then I invite you to fix it!
> I'll give you the benefit of the doubt and assume > the code is C99 (since it sure isn't C89 standard).
It's actually C++. I tried to make it as close to C as possible, but I probably screwed up.
> I'm guessing you probably knocked this code out without > really testing it to show just how easy it is to parse, > but the code as is, doesn't really work as you intended.
I did test it some, but apparently I made some mistakes. Do you ever make mistakes in your code? Or do you spend all your time making criticisms, without offering alternatives?
YAML is a lot more complicated, and whitespace is significant.
> It may be faster but I'm used to code that compiles and > works. Did you get that code to run?
It's C++.
> Hold off bragging > about how fast it is until it works, is robust enough to > handle human-edited inputs, can deal with syntax errors,
Here is the code for validating well-formedness (you have to call this first):
// returns the number of lsx_nodes // and returns 0 if document is ill-formed int lsx_node_count(const char* p) { int ret = 0; int match = 0; if (*p == NULL) return 0; while (*p != NULL) { switch (*p) { case '(' : ++ret; ++match; while (*p != ':') { if (*p == NULL) return 0; if (*p == '\'') ++p; if (*p == NULL) return 0; ++p; } ++p; break; case ')' : --match; ++p; break; case ':' : return 0; default: ++ret; while ((*p != '(') && (*p != ')') && (*p != ':')) { if (*p == NULL) return 0; if (*p == '\'') ++p; if (*p == NULL) return 0; ++p; } break; }
> doesn't store read-only pointers into the original text > buffer, etc.
That is an implementation detail. You should be able to figure it out yourself, if you actuallly are a programmer.
> Code snippets used for illustrating a > technique should be a lot more readable than this.
Then I invite you to rewrite it. It shouldn't be hard for you, since you are obviously so much smarter than everyone else.
It turns out that your print function is simply insufficient. You have to parse the text when outputting it. Here is a more complete output code in C++:
> I did test it some, but apparently I made some mistakes. > Do you ever make mistakes in your code? Or do you spend > all your time making criticisms, without offering > alternatives?
I am sorry Sean, this was uncalled for. I realize that you were trying to be helpful. To be honest, my frustration was more at Jorgenson. Lately he has been ignorantly criticizing everything I post. When you are subject to a constant barrage of criticism, you can become rather thin-skinned.
> Out of curiosity, is your parser open-source and available online?
It was not available online because it's only a personal test. (and i'm not a professional programer). I've just put it online: http://sax.scriptsdb.org/birdy.hpp
YARD and biscuit (mb2sync) were the main inspirations!
> > Out of curiosity, is your parser open-source and > available online? > > It was not available online because it's only a personal > test. (and i'm not a professional programer). > I've just put it online: > http://sax.scriptsdb.org/birdy.hpp > > YARD and biscuit (mb2sync) were the main inspirations!
Very nice code! It has inspired me to look into simplifying the YARD code. Thanks for sharing it.
> You should be able to figure it out yourself, > if you actuallly are a programmer. > ... > Then I invite you to rewrite it. It shouldn't be > hard for you, since you are obviously so much > smarter than everyone else. > ... > To be honest, my frustration was more at Jorgenson. > Lately he has been ignorantly criticizing everything > I post. When you are subject to a constant barrage of > criticism, you can become rather thin-skinned.
I'm not interested in rewriting your code. You are the one posting code samples, so you shouldn't be too surprised when people criticize your code, especially after they've actually bothered to try to compile and run it. If you don't like your code or ideas criticized stop posting a new article every day in a forum visited by programmers.
You were thin-skinned before I ever responded to your articles . You routinely insult people who try to engage you in discussion, and you bristle at any criticism or attempt to get you to explain yourself. You present yourself as a language designer and C++ expert, so you should expect readers to hold you to a high standard. If you post vague statements about design flaws or performance problems or how such-and-such is unsuitable for your purposes, and then refuse to explain yourself or engage in a professional discussion about your ideas, how can anyone in this forum take you seriously?
As a long-time Artima reader I am a little upset at how you've spammed the site to promote yourself and to try to build a buzz around the non-existent Heron language. Even your Artima profile is essentially fishing for a job offer. I'm done ignorantly responding to your posts, but I do enjoy reading the responses -- I have actually learned a few things from astute readers who have taken the time to refute or expand on your topics. Unfortunately you eventually insult or bore them away.