The Artima Developer Community
Sponsored Link

.NET Buzz Forum
Regex, HTML, and my sanity

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eric Gunnerson

Posts: 1006
Nickname: ericgu
Registered: Aug, 2003

Eric Gunnerson is a program manager on the Visual C# team
Regex, HTML, and my sanity Posted: Feb 26, 2004 8:56 AM
Reply to this message Reply

This post originated from an RSS feed registered with .NET Buzz by Eric Gunnerson.
Original Post: Regex, HTML, and my sanity
Feed Title: Eric Gunnerson's C# Compendium
Feed URL: /msdnerror.htm?aspxerrorpath=/ericgu/Rss.aspx
Feed Description: Eric comments on C#, programming and dotnet in general, and the aerodynamic characteristics of the red-nosed flying squirrel of the Lesser Antilles
Latest .NET Buzz Posts
Latest .NET Buzz Posts by Eric Gunnerson
Latest Posts From Eric Gunnerson's C# Compendium

Advertisement

The answer I came up with is at the bottom. But first, a brief digression.

There were several responses to my regex puzzle. They can be grouped into:

  1. Here's how you do it
  2. Here's how you do it without using regex
  3. Using regex on this problem will cause the Dow to drop and the end of Western civilization as we know it (Question: Would that mean that eastern civilization would take over? Discuss this in your group, and be ready to present to the larger group when we get back together)
  4. How *do* you do that?

#1 was the kind of response I expected. My original idea was to highlight a regex technicque that made this a lot easier and more robust than the code I had seen suggested.

#2 is interesting. Clearly, if you can find a good library - and it's not more effort to prove that it is good than to create your own (remembering that you always underestimate how much effort it is to do it yourself), you should use it.  But that really wasn't the point of my post - my question was “how you would do it using regex?“.

Which brings us to #3. While I agree with Raymond that there are cases where regex is more trouble than it's worth - something like brace matching comes to mind - I'm not sure that I agree in this case. You can't use an XML approach because HTML isn't required to be well-formed, which means you're either using a library or writing custom code. I'm not convinced that custom code is going to be more robust than a well-written regex without a fair bit of testing, and I do know that it would just as easy (perhaps easier) to write custom code that isn't robust as it would be to write a regex that isn't robust.

On to our solution. Note that I'm not claiming that this is a robust and tested solution - I'm more interested in showing off a regex technique. If you want to use it for real, be sure to test it well.

Conventional regex systems would require us to enumerate every tag that we want to replace. In that direction lies madness, as it's pretty likely you won't get it right. The example I saw, for example, didn't even replace “<script>“. But .NET regex (and current Perl syntax, IIRC...) allows you to use zero-width assertions and specify what you don't want to match.

The first step is to create something that matches a xml tag. The simplest version is:

<.+?>

which works great if there are no embedded “>“ inside the tag. To be able to handle a quoted attribute such as

 <button text=“<Hello>“>

we'll need to modify the regex to handle that case specifically. Here's the regex to do it:

(                    # group
[^"]+?                 # One or more non-" chars. Matches tag with no quotes. non-greedy
|                      # or
                       # match something like <fred a="<5>">
.+?                     # Everything up to ", non-greedy
"                       # literal “
.*?                     # zero or more characters after quote, non-greedy
"                       # literal “
.*?                     # zero or more characters after quote, non-greedy
)

Now that we have that, we have to tell it what tags not to match. We can do that with a negative lookahead:

(?!br|/br|p|/p)

The key to the lookaheads/lookbehinds is that they don't consume any characters. So, this says “It's okay to match at this point unless the string is one of “br“, “/br“, “p“, or “/p“ (yes, you'd need to use a case-insensitive match to cover both upper and lowercase versions).

Lookahead is a great feature to have if you're trying to do more than one thing in a regex. Here's the full regex.

<                    # opening < of the tag
(?!br|/br|p|/p)      # negative lookahead. Match wil fail if any of these are present
(                    # group
[^"]+?                 # One or more non-" chars. Matches tag with no quotes. non-greedy
|                      # or
                       # match something like <fred a="<5>">
.+?                     # Everything up to ", non-greedy
"        
.*?                     # zero or more characters after quote, non-greedy
"                        
.*?                     # zero or more characters after quote, non-greedy
)
>                   # close of tag

Read: Regex, HTML, and my sanity

Topic: Is this the oldest Microsoft web site ? Previous Topic   Next Topic Topic: up and down

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use