Java Buzz Forum - Google News: Complicated algorythms, but not the simple ones

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Buzz Forum
Google News: Complicated algorythms, but not the simple ones

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

dion

Posts: 5028
Nickname: dion
Registered: Feb, 2003

Dion Almaer is the Editor-in-Chief for TheServerSide.com, and is an enterprise Java evangelist

Google News: Complicated algorythms, but not the simple ones

Posted: Feb 18, 2006 2:22 PM

This post originated from an RSS feed registered with Java Buzz by dion.
Original Post: Google News: Complicated algorythms, but not the simple ones Feed Title: techno.blog(Dion) Feed URL: http://feeds.feedburner.com/dion Feed Description: blogging about life the universe and everything tech	Latest Java Buzz Posts Latest Java Buzz Posts by dion Latest Posts From techno.blog(Dion)

I have been working with a company that recently got added to Google News, which is great.

I assumed that Google would do a fantastic job and grokking the news from the site.

I was unfortunately wrong.

In time, we started to see our content appear in Google News, but the headlines were all screwed up. Elements in a rightbar on the site would show up as a headline for an article. It all seemed very random too. Very strange indeed.

We contacted the Google News team, assuming that content providers could use some kind of microformat to help the Google document parser.

We would be very willing to say:

<h1 class="googlenews-headline header">Headline</h1>
<h1 rel="headline">Headline</h1>

No such thing existed. They thought that one of the problems was that the headline had a link within it (as it acts as a permalink to itself). They assume that a headline can not also be a list, so they ignore it.

As we go back and forward on this, I then think. Wait a minute. Why are they bothering screenscraping our HTML when we have RSS feeds for everything?

Surely it would be simpler to grok our feed than scrape our HTML? *sigh*

Read: Google News: Complicated algorythms, but not the simple ones

Previous Topic

Next Topic


	Web Artima.com