PHP Buzz Forum - 52 Making simple things easy, and difficult things possible. yet another html parser. 15

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

PHP Buzz Forum
52 Making simple things easy, and difficult things possible. yet another html parser. 15

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Alan Knowles

Posts: 390
Nickname: alank
Registered: Sep, 2004

Alan Knowles is Freelance Developer, works on PHP extensions and PEAR.

52 Making simple things easy, and difficult things possible. yet another html parser. 15

Posted: Nov 1, 2004 7:41 AM

This post originated from an RSS feed registered with PHP Buzz by Alan Knowles.
Original Post: 52 Making simple things easy, and difficult things possible. yet another html parser. 15 Feed Title: Smoking toooo much PHP Feed URL: http://www.akbkhome.com/blog.php/RSS.xml Feed Description: More than just a blog :)	Latest PHP Buzz Posts Latest PHP Buzz Posts by Alan Knowles Latest Posts From Smoking toooo much PHP

13a5 When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...

Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.

So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)

To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)

So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..

<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));

Outputs:


    [0] => Array
        (
            [0] => 14 // token type (look up the source)
            [1] =>    // data (tag name or string)
            [2] => 1  // line number
            [3] => 0  // character position
        )

    [1] => Array
        (
            [0] => 1
            [1] =>

            [2] => 2
            [3] => 50
        )

    [2] => Array
        (
            [0] => 2
            [1] => HTML
            [2] => 2
            [3] => 51
        )

    [3] => Array
        (
            [0] => 2
            [1] => HEAD
            [2] => 2
            [3] => 57
        )
.....
......
     [15] => Array
         (
            [0] => 2
            [1] => A
            [2] => 6
            [3] => 212
            [4] => Array  // array of attributes
                (
                    [HREF] => "/pub/WWW/Consortium/"
                )

        )

    [16] => Array
        (
            [0] => 2
            [1] => IMG
            [2] => 7
            [3] => 243
            [4] => Array
                (
                    [align] => bottom

                    [src] => "/pub/WWW/Icons/WWW/w3c_48x48"
                )

        )

the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.

I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);

so it can be used 'how you want it...'

20

Read: 52 Making simple things easy, and difficult things possible. yet another html parser. 15

Previous Topic

Next Topic


	Web Artima.com