The Artima Developer Community
Sponsored Link

Java Buzz Forum
Google News: Complicated algorythms, but not the simple ones

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
dion

Posts: 5028
Nickname: dion
Registered: Feb, 2003

Dion Almaer is the Editor-in-Chief for TheServerSide.com, and is an enterprise Java evangelist
Google News: Complicated algorythms, but not the simple ones Posted: Feb 18, 2006 2:22 PM
Reply to this message Reply

This post originated from an RSS feed registered with Java Buzz by dion.
Original Post: Google News: Complicated algorythms, but not the simple ones
Feed Title: techno.blog(Dion)
Feed URL: http://feeds.feedburner.com/dion
Feed Description: blogging about life the universe and everything tech
Latest Java Buzz Posts
Latest Java Buzz Posts by dion
Latest Posts From techno.blog(Dion)

Advertisement

I have been working with a company that recently got added to Google News, which is great.

I assumed that Google would do a fantastic job and grokking the news from the site.

I was unfortunately wrong.

In time, we started to see our content appear in Google News, but the headlines were all screwed up. Elements in a rightbar on the site would show up as a headline for an article. It all seemed very random too. Very strange indeed.

We contacted the Google News team, assuming that content providers could use some kind of microformat to help the Google document parser.

We would be very willing to say:

<h1 class="googlenews-headline header">Headline</h1>

<h1 rel="headline">Headline</h1>

No such thing existed. They thought that one of the problems was that the headline had a link within it (as it acts as a permalink to itself). They assume that a headline can not also be a list, so they ignore it.

As we go back and forward on this, I then think. Wait a minute. Why are they bothering screenscraping our HTML when we have RSS feeds for everything?

Surely it would be simpler to grok our feed than scrape our HTML? *sigh*

Read: Google News: Complicated algorythms, but not the simple ones

Topic: Total Security Solution for Windows Computers Previous Topic   Next Topic Topic: A little hiccup, but back on track

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use