The Artima Developer Community
Sponsored Link

PHP Buzz Forum
52 Making simple things easy, and difficult things possible. yet another html parser. 15

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Alan Knowles

Posts: 390
Nickname: alank
Registered: Sep, 2004

Alan Knowles is Freelance Developer, works on PHP extensions and PEAR.
52 Making simple things easy, and difficult things possible. yet another html parser. 15 Posted: Nov 1, 2004 7:41 AM
Reply to this message Reply

This post originated from an RSS feed registered with PHP Buzz by Alan Knowles.
Original Post: 52 Making simple things easy, and difficult things possible. yet another html parser. 15
Feed Title: Smoking toooo much PHP
Feed URL: http://www.akbkhome.com/blog.php/RSS.xml
Feed Description: More than just a blog :)
Latest PHP Buzz Posts
Latest PHP Buzz Posts by Alan Knowles
Latest Posts From Smoking toooo much PHP

Advertisement
13a5 When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...

Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.

So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)

To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)

So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..

<?php
print_r(
flexyparser_tokenize(
file_get_contents("..some file...")
));

Outputs:

[0] => Array
(
[0] => 14 // token type (look up the source)
[1] => // data (tag name or string)
[2] => 1 // line number
[3] => 0 // character position
)

[1] => Array
(
[0] => 1
[1] =>

[2] => 2
[3] => 50
)

[2] => Array
(
[0] => 2
[1] => HTML
[2] => 2
[3] => 51
)

[3] => Array
(
[0] => 2
[1] => HEAD
[2] => 2
[3] => 57
)
.....
......
[15] => Array
(
[0] => 2
[1] => A
[2] => 6
[3] => 212
[4] => Array // array of attributes
(
[HREF] => "/pub/WWW/Consortium/"
)

)

[16] => Array
(
[0] => 2
[1] => IMG
[2] => 7
[3] => 243
[4] => Array
(
[align] => bottom

[src] => "/pub/WWW/Icons/WWW/w3c_48x48"
)

)
the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.

I really want to do a tree version of this, that loads data into a user defined object: eg.
<?php
$tree = flexyparser_toTree($data, new MyClass);

so it can be used 'how you want it...'


20

Read: 52 Making simple things easy, and difficult things possible. yet another html parser. 15

Topic: Impressions from LinuxWorld Previous Topic   Next Topic Topic: Integrating PHPUnit with Phing

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use