Weblogs Forum - Trees and Labelled S-Expressions

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Grammar

The precise syntax of the Labelled S-Expression format was inspired by Derek Parnell's unnamed markup language. Here is the Grammar for Labelled S-Expressions in Extended Backus-Naur Form (EBNF):

  document ::=
    tagged_content

  tagged_content ::=
    "(" text ":" content ")"

  content ::=
    text? (tagged_content | text)*

  text ::=
    (^("(" | ":" | ")" | "'") | escaped_char)+

  escaped_char ::=
    "'" .

And that is it, only five production rules! Encoding is UTF-8.

Design Goals

Labelled S-Expressions was designed so that it could be parsed as quickly as possible, and be as simple as possible, while still maintaining usefulness as a portable text-based data format.

Parser

The following is a snippet of C++ code for parsing labelled S-Expressions. It breaks a buffer up into nodes, where a node can either point to a label: "(xxx:" or a textual node. The input must already be validated to be a well-formed document.

// Public Domain code by Christopher Diggins

struct lsx_node {
  lsx_node* sibling;
  lsx_node* child;
  const char* text;
};

lsx_node* lsx_parse(const char** pp) {
  if (**pp == ')') {
    ++*pp;
    return NULL;
  }
  lsx_node* node = (lsx_node*)calloc(sizeof(lsx_node), 1);
  node->text = *pp;
  if (**pp == '(') {
    ++*pp;
    while (*(*pp)++ != ':') {
      if (**pp == '\'') ++*pp;
    }
    node->child = lsx_parse(pp);
    lsx_node* child = node->child;
    while (child != NULL) {
      child->sibling = lsx_parse(pp);
      child = child->sibling;
    }
  }
  else {
    while ((**pp != '(') && (**pp != ')')) {
      if (**pp == '\'') ++*pp;
      ++*pp;
    }
  }
  return node;
}

Checking Well-Formedness

Here is a function for checking well-formedness and returns the number of nodes in the data:

// Public Domain code by Christopher Diggins
// returns the number of lsx_nodes
// and returns 0 if document is ill-formed

int lsx_node_count(const char* p) {
  int ret = 0;
  int match = 0;
  if (*p == NULL) return 0;
  while (*p != NULL) {
    switch (*p) {
      case '(' :
        ++ret;
        ++match;
        while (*p != ':') {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        ++p;
        break;
      case ')' :
        --match;
        ++p;
        break;
      case ':' :
        return 0;
      default:
        ++ret;
        while ((*p != '(') && (*p != ')') && (*p != ':'))  {
          if (*p == NULL) return 0;
          if (*p == '\'') ++p;
          if (*p == NULL) return 0;
          ++p;
        }
        break;
    }
  }
  if (*p != NULL) return 0;
  if (match != 0) return 0;
  return ret;
}

XML Converter

Here is a rudimentary XML converter for LSX data:

// Public Domain code by Christopher Diggins

bool lsx_is_label(lsx_node* x) {
  return (*x->text == '(');
}

std::string lsx_get_label(lsx_node* x) {
  assert(lsx_is_label(x));
  const char* p = x->text;
  while (*(++p) != ':') {
    if (*p == '\'') ++p;
  }
  return std::string(x->text + 1, p);
}

std::string lsx_get_text(lsx_node* x) {
  assert(!lsx_is_label(x));
  const char* p = x->text;
  while (*p != ')') {
    if (*p == '\'') ++p;
    ++p;
  }
  return std::string(x->text, p);
}

void lsx_output_as_xml(lsx_node* x) {
  if (x == NULL)
    return;
  if (!lsx_is_label(x)) {
    std::cout << lsx_get_text(x) << std::endl;
  }
  else {
    std::cout << "<" << lsx_get_label(x) << ">" << std::endl;
    lsx_output_as_xml(x->child);
    std::cout << "</" << lsx_get_label(x) << ">" << std::endl;
  }
  lsx_output_as_xml(x->sibling);
};

XSLT

Xml can be mapped to Labelled S-Expressions as evidenced by the following XSLT stylesheet for transforming XML into Labelled S-Expressions.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- public domain by Christopher Diggins http://www.cdiggins.com -->
<t:transform version="1.0" xmlns:t="http://www.w3.org/1999/XSL/Transform">
  <t:output method = "html" version = "1.0" encoding = "ISO-8859-1" omit-xml-declaration = "no" standalone = "yes" indent = "yes" />

  <t:template match="/">
    <html>
    <body>
    <dl><dt>(lsx:</dt>
    <dl><t:apply-templates/></dl>
    <dt>)</dt></dl>
    </body>
    </html>
  </t:template>

  <t:template match="node()">
    <dt>(<t:value-of select="name()"/>:</dt>
      <dl><t:apply-templates select="attribute::*"/></dl>
      <dl><t:apply-templates select="child::node()"/></dl>
    <dt>)</dt>
  </t:template>

  <t:template match="attribute::*">
    (<t:value-of select="name()"/>:<t:value-of select="."/>)
  </t:template>

  <t:template match="text()"><t:value-of select="."/></t:template>
</t:transform>

Similar Work

There are a multitude of markup languages floating around, http://www.pault.com/pault/pxml/xmlalternatives.html has a nice list of markup languages. I don't believe any even come close to the simplicity and efficiency of Labelled S-Expressions.

Sean Conner

Posts: 19
Nickname: spc476
Registered: Aug, 2005

Re: Trees and Labelled S-Expressions

Posted: Nov 3, 2005 8:39 PM

Your code may be fast, but I'd rather have slow code that works. I'll give you the benefit of the doubt and assume the code is C99 (since it sure isn't C89 standard). But even then, it doesn't quite work as intended. I have the following for a print routine:


void lsx_print(struct lsx_node *n,int l)
{
  while(n != NULL)
  {
    if (n->text != NULL)
      printf("%*.*c%s\n",l,l,n->text);
    if (n->child)
      lsx_print(n->child,l+1);
    n = n->sibling;
  }
}

I feed your parser the following:

('(cd'::blah)

I get the following:


 (cd'::blah)
 :blah)

Which I don't think is correct (never mind the fact that as written, the entire labelled S expression has to exist as one line).

I'm guessing you probably knocked this code out without really testing it to show just how easy it is to parse, but the code as is, doesn't really work as you intended.

Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 3, 2005 11:08 PM

I think you'll find lots of prior art. This looks just like nested tuples/lists/dictionaries in Python, or PHP's nested arrays. Perl and Ruby have similar constructions. I wrote a web application a couple of years ago that stores inventory data in a text file as nested PHP arrays. The entire inventory can be loaded into PHP using the eval function to parse it:


array (
  1009 => 
  array (
    'type' => 'n',
    'price' => '',
    'year' => '2004',
    'make' => 'FOREST RIVER',
    'model' => 'LEXINGTON',
    'length' => '27',
    'status' => 'Available',
    'stock' => '101',
  ),
  1017 => 
  array (
    'type' => 'n',
    'price' => '',
    'year' => '2004',
    'make' => 'CHINOOK',
    'model' => 'CONCOURSE',
    'length' => '21',
    'status' => 'Available',
    'stock' => '173',
  ),
  ...
)

The parser looks like this:


$f = fopen($filename, 'rb');
$s = fread($f, filesize($filename));
eval('$inventory=' . $s . ';');
// display inventory
print_r($inventory);

The code to store inventory in a file is just as simple with PHP's var_export function.

I've seen the same kind of thing many many times, and back before XML had any traction.

If you've looked at YAML you've noticed that it is pretty much the same thing without the parentheses.

> The following is a snippet of C code for parsing labelled
> S-Expressions (warning: it is much faster than what you are
> used to!)

It may be faster but I'm used to code that compiles and works. Did you get that code to run? Hold off bragging about how fast it is until it works, is robust enough to handle human-edited inputs, can deal with syntax errors, doesn't store read-only pointers into the original text buffer, etc. Code snippets used for illustrating a technique should be a lot more readable than this.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 6:10 AM

> Your code may be fast, but I'd rather have slow code that
> works.

Well then I invite you to fix it!

> I'll give you the benefit of the doubt and assume
> the code is C99 (since it sure isn't C89 standard).

It's actually C++. I tried to make it as close to C as possible, but I probably screwed up.

> I'm guessing you probably knocked this code out without
> really testing it to show just how easy it is to parse,
> but the code as is, doesn't really work as you intended.

I did test it some, but apparently I made some mistakes. Do you ever make mistakes in your code? Or do you spend all your time making criticisms, without offering alternatives?

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 6:18 AM

> I think you'll find lots of prior art.

Which I linked to!!

> If you've looked at YAML you've noticed that it is pretty
> much the same thing without the parentheses.

I don't know how you make that connection:
http://yaml.org/spec/current.html

YAML is a lot more complicated, and whitespace is significant.

> It may be faster but I'm used to code that compiles and
> works. Did you get that code to run?

It's C++.

> Hold off bragging
> about how fast it is until it works, is robust enough to
> handle human-edited inputs, can deal with syntax errors,

Here is the code for validating well-formedness (you have to call this first):


// returns the number of lsx_nodes 
// and returns 0 if document is ill-formed 
int lsx_node_count(const char* p) {
  int ret = 0;
  int match = 0;
  if (*p == NULL) return 0;  
  while (*p != NULL) {       
    switch (*p) {
      case '(' :      
        ++ret;
        ++match;        
        while (*p != ':') {       
          if (*p == NULL) return 0;
          if (*p == '\'') ++p; 
          if (*p == NULL) return 0;
          ++p;
        }                        
        ++p;    
        break;
      case ')' :
        --match;
        ++p;    
        break;                
      case ':' :
        return 0;  
      default: 
        ++ret;
        while ((*p != '(') && (*p != ')') && (*p != ':'))  {       
          if (*p == NULL) return 0;
          if (*p == '\'') ++p; 
          if (*p == NULL) return 0;
          ++p;
        }
        break;      
    }

> doesn't store read-only pointers into the original text
> buffer, etc.

That is an implementation detail. You should be able to figure it out yourself, if you actuallly are a programmer.

> Code snippets used for illustrating a
> technique should be a lot more readable than this.

Then I invite you to rewrite it. It shouldn't be hard for you, since you are obviously so much smarter than everyone else.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 6:38 AM

It turns out that your print function is simply insufficient. You have to parse the text when outputting it. Here is a more complete output code in C++:

string lsx_get_label(lsx_node* x) {
  assert(lsx_is_label(x));
  const char* p = x->text;
  while (*(++p) != ':') {
    if (*p == '\'') ++p;
  }
  return string(x->text + 1, p);
}
 
string lsx_get_text(lsx_node* x) {
  assert(!lsx_is_label(x));
  const char* p = x->text;
  while (*p != ')') {
    if (*p == '\'') ++p;
    ++p;
  }
  return string(x->text, p);
}
 
void lsx_output_as_text(lsx_node* x) {
  if (x == NULL)
    return;
  if (!lsx_is_label(x)) {
    cout << lsx_get_text(x) << endl;
  }
  else {
    cout << "(" << lsx_get_label(x) << ":" <<
    lsx_output_as_text(x->child);
    cout << ")" << endl;
  }
  lsx_output_as_text(x->sibling);
}

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 7:14 AM

> I did test it some, but apparently I made some mistakes.
> Do you ever make mistakes in your code? Or do you spend
> all your time making criticisms, without offering
> alternatives?

I am sorry Sean, this was uncalled for. I realize that you were trying to be helpful. To be honest, my frustration was more at Jorgenson. Lately he has been ignorantly criticizing everything I post. When you are subject to a constant barrage of criticism, you can become rather thin-skinned.

Reno C.

Posts: 10
Nickname: saxml
Registered: Oct, 2005

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 7:45 AM

> content ::= text? (tagged_content | text?)*

Shouldn't it be

text? (tagged_content | text)*

instead ?

I'm using my own YARD-like parser and I've got to remove this optional quantifier.

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 7:54 AM

> > content ::= text? (tagged_content | text?)*
>
> Shouldn't it be

text? (tagged_content | text)*

> instead ?
>
> I'm using my own YARD-like parser and I've got to remove
> this optional quantifier.

Yes! You are right, my bad. That rule, as it was, would match anything. Thanks for pointing that out.

Out of curiosity, is your parser open-source and available online?

Reno C.

Posts: 10
Nickname: saxml
Registered: Oct, 2005

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 8:24 AM

> Out of curiosity, is your parser open-source and available online?

It was not available online because it's only a personal test. (and i'm not a professional programer).
I've just put it online: http://sax.scriptsdb.org/birdy.hpp

YARD and biscuit (mb2sync) were the main inspirations!

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 8:41 AM

> > Out of curiosity, is your parser open-source and
> available online?
>
> It was not available online because it's only a personal
> test. (and i'm not a professional programer).
> I've just put it online:
> http://sax.scriptsdb.org/birdy.hpp
>
> YARD and biscuit (mb2sync) were the main inspirations!

Very nice code! It has inspired me to look into simplifying the YARD code. Thanks for sharing it.

Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 10:41 AM

> You should be able to figure it out yourself,
> if you actuallly are a programmer.
> ...
> Then I invite you to rewrite it. It shouldn't be
> hard for you, since you are obviously so much
> smarter than everyone else.
> ...
> To be honest, my frustration was more at Jorgenson.
> Lately he has been ignorantly criticizing everything
> I post. When you are subject to a constant barrage of
> criticism, you can become rather thin-skinned.

I'm not interested in rewriting your code. You are the one posting code samples, so you shouldn't be too surprised when people criticize your code, especially after they've actually bothered to try to compile and run it. If you don't like your code or ideas criticized stop posting a new article every day in a forum visited by programmers.

You were thin-skinned before I ever responded to your articles . You routinely insult people who try to engage you in discussion, and you bristle at any criticism or attempt to get you to explain yourself. You present yourself as a language designer and C++ expert, so you should expect readers to hold you to a high standard. If you post vague statements about design flaws or performance problems or how such-and-such is unsuitable for your purposes, and then refuse to explain yourself or engage in a professional discussion about your ideas, how can anyone in this forum take you seriously?

As a long-time Artima reader I am a little upset at how you've spammed the site to promote yourself and to try to build a buzz around the non-existent Heron language. Even your Artima profile is essentially fishing for a job offer. I'm done ignorantly responding to your posts, but I do enjoy reading the responses -- I have actually learned a few things from astute readers who have taken the time to refute or expand on your topics. Unfortunately you eventually insult or bore them away.

Sean Conner

Posts: 19
Nickname: spc476
Registered: Aug, 2005

Re: Trees and Labelled S-Expressions

Posted: Nov 4, 2005 5:04 PM

Using your validator and wrapping some code around it, I got the following:


GOOD: ('(a'::b)
GOOD: ( : )
GOOD: (:)
GOOD: (a:b)
GOOD: (a:)
GOOD: (:b)
BAD: (a:b:c)
GOOD: (:(:))
BAD: (a:(b)
BAD: (a:())
BAD: ((a:b):b)
GOOD: (a:b(c:d))
GOOD: (a : b)
GOOD: ( a : b)
GOOD: ('::':)
GOOD: ('':b)
BAD: ()
BAD: (:()()())
GOOD: (:(:)(:)(:))
BAD: (::)

So, which are actually good and which are not?

Christopher Diggins

Posts: 1215
Nickname: cdiggins
Registered: Feb, 2004

Re: Trees and Labelled S-Expressions

Posted: Nov 12, 2005 6:33 PM

> Using your validator and wrapping some code around it, I
> got the following:
>
>


> GOOD: ('(a'::b)
> GOOD: ( : )
> GOOD: (:)
> GOOD: (a:b)
> GOOD: (a:)
> GOOD: (:b)
> BAD: (a:b:c)
> GOOD: (:(:))
> BAD: (a:(b)
> BAD: (a:())
> BAD: ((a:b):b)
> GOOD: (a:b(c:d))
> GOOD: (a : b)
> GOOD: ( a : b)
> GOOD: ('::':)
> GOOD: ('':b)
> BAD: ()
> BAD: (:()()())
> GOOD: (:(:)(:)(:))
> BAD: (::)
>

>
> So, which are actually good and which are not?

AFAIK that labelling is entirely correct according to the grammar I posted.

Flat View: This topic has 13 replies on 1 page

Previous Topic

Next Topic