April 08, 2005
Google starts parsing language

Google has begun actually parsing the language on web pages to be able to answer simple factual questions by quoting webpages. As an example here's the population of Denmark. I think Peter Norvig gave a talk on this at some recent O'Reilly conference where he talked about using Google's vast amounts of data to beat traditional AI approaches to language parsing.

If my guess on how this works is correct, then Google's approach rhymes perfectly with some of my own, sadly unrealized, ideas on how to build parsers and also with Jeff Hawkins' ideas on brain function as described in On Intelligence. What's interesting is to use our sensory experience to build good prediction models for further sensory experience. In Google's case the sensory data consists of text utterances on millions of web pages. Google only lists one answer to a question, but I would be very surprised if these weren't in fact just the most likely answers based on statistics derived from googles database of text data.
In short, I'd be surprised if this was just Google's own implementation of Googlism.

Some of the less pretty results - e.g. the answer to the question Who is Jane Fonda also indicate that Google is actually storing these autogenerated assertions as metadata (i.e. maintaining an "Is" property for the string "Jane Fonda"). My guess is that such a database of actual utterances is your best shot at any good model of language and reasoning. Obviously you want to add structure to your model - but this too has to be based on the statistics of actual utterances.
If I was a stockholder in Cycorp - a company busy building a basic database of this kind of knowledge manually - I would be trying to get out of that investment. The odds of succeeding by parsing actual utterances in the metadata format natural to us (language) should be much higher than the odds of succeeding in doing this manually.

Comments (post your own)
