Tuesday 19 May 2009

Wolfram|Alpha and Natural Language

So, this blog seemed to not actually get off the ground. Soon after I started it my computer died, and had to be repaired. When I got it back I had forgotten about this blog mostly. Here, then, is the much delayed first real post, on a completely different subject than originally intended. I wonder if I still have anyone reading.

Wolfram|Alpha was launched yesterday to much fanfare on the internet. It is a computational knowledge engine (trademarked!) that attempts to answer queries by deconstructing the query and computing the answer from a structured database of objective information. This differs from a search engine like Google or a semantic search engine which both, in different ways, simply attempt to match existing text to the query. In Google's case this can be a problem if it misunderstands your query, and in a semantic search engine's case this can be a problem if the answer is buried in pages and pages of text. It may link you to a place where you can get an answer, but finding that answer will take more of your time.

Wolfram|Alpha is attempting to change all that by giving you actual answers instead of links to actual answers. Right now it seems best at very simple queries, ones that aren't even questions, okay at very basic formal questions, and downright horrible at the natural questions you might want to ask it. You input a date and it will try to tell you any notable events that happened on that date, how many days, months, years, etc. it's been since that date, what day of the week that date was, what day of the year it was. I say try because its database does seem lacking in the notable events category. When I queried 1 Jan 2000 it returned all its nice numerical data, and that it was New Year's Day, but nothing about it being the first day of the third millennium. Thinking the people who made Alpha are as sticklers about this as I am I tried 1 Jan 2001, the real start of the new millennium, and all I get is that it was the day Ray Walston died. I click on Ray Walston and get that he is an actor, his place and date of birth and death and that's it. Nothing more. No films he's been in, no awards won, nothing actually about him except very bare facts about his life.

The Wolfram|Alpha about page says:
We aim to collect and curate all objective data; implement every known model, method, and algorithm; and make it possible to compute whatever can be computed about anything.

The first bit pertains to the problem I had above, a simple lack of data. It's not a huge problem, data is something that will get added as time goes on, and there is even a function to submit data with a corroborating link for the Alpha staff to review. I find it curious that I can look up tiny towns in Norway, but a search for Kingston spits out the capital of Jamaica with an option for 12 other Kingstons around the world, but Alpha seems entirely clueless as to the existence of the original Kingston outside London. Isn't Stephen Wolfram British, wasn't he born in London? (He was.) Data, however, will be added, and it's not much of a gripe.

What I do have more of an issue with is the second part of the above aim, to implement every known model, method and algorithm. It's a great aim, and in many respects they've implemented a huge amount of models, methods and algorithms, but almost of all of them direly mathematical. Queries about forenames give interesting graphs of usage and rankings over time, queries about countries mostly come back with statistics about those countries. It's very good at basic what and when questions, What is the United States?, When did Marilyn Monroe die?, and mildly okay at who questions, thought it doesn't come back with as much data as I'd like. However, give it a How question and it pees itself. How did Marilyn Monroe die? or How was the United States created? and it comes back with Wolfram|Alpha isn't sure what to do with your input. In this case it seems that it's not that it doesn't have the data, it may or may not, but that it simply can't understand the question? Really? Really? I know PR bots on AIM that could answer that first one with I'm sorry, I don't know the manner of Marilyn Monroe's death. It wouldn't have an answer, but it would at least understand the syntax of the question.

Alpha is, in general, very bad at many natural language queries. It has a few that it is very good at it, but ask it Who shot JFK? or When was slavery abolished in the United States? and it gives you a blank stare. I am going to conjecture on the reason for this, keeping in mind from now on that I don't know the actual reason for this, and I could be completely wrong.

Wolfram|Alpha is essentially a front-end for Mathematica, Wolfram Research's flagship product, sold at great expense to businesses, think tanks, and universities around the world. Mathematica is a very sophisticated computation program for scientists, engineers, mathematicians and the like. I used it in high school. It takes computational problems in a certain syntax and gives you an answer. This is why Alpha is so good at basic mathematical and statistical questions, it feeds Mathematica the query, gives it access to Alpha's knowledge database, and Mathematica does what it does best. Alpha takes your query and puts it into syntax the Mathematica engine can understand. Mathematica's job is to do the actual computation. In light of this it's obvious why Alpha is very good at the things it is very good at it - they're the things Mathematica is very good at (and it should be, it's been around and developed for over 20 years).

What Mathematica can't do, at least not the versions I used and I can find nothing in current literature saying it can do it now, is process natural language. Alpha has essentially two jobs: to maintain a database for Mathematica to do computations on and to interpret human queries into a Mathematica-ready syntax. The first one is just the data problem again, and will be solved as more and more data gets added. The second is the natural language problem. Alpha needs to better understand how to compute language, and for this it needs to better use computational linguistics.

Linguistics, as a field, is the science of language. Linguists do experiments on language, to see how it behaves, to determine why it behaves the way it does, to determine why people use certain words certain ways. Many universities pair it with cognitive science. Computational linguistics is about how a computer can understand language by knowing certain things about words and how they should fit together and what it means when they do fit together. Very roughly a computer should interpret Who shot JFK? as there was a person and that person did something called shot to someone or something called JFK and the user wants to know who the person that did this is. The computer should be able to look up and figure out what shot and JFK mean and then compute the answer if it has the data available. This is a gross oversimplification of a very complex field, but there you go. Doesn't seem too hard, does it? Alpha can't answer it.

It's entirely possibly it can't answer it because it doesn't actually know the answer, even though it may understand the question. To show that it has an actual problem parsing a very natural question it must be shown to know something, but not be able to answer a question about it. Let's take something I know Alpha does know, the flying time of an aircraft from Birmingham to London. If you ask Alpha What is the distance between Birmingham and London?, it lists direct travel times, including the time an aircraft would take. So I ask it How long does it take a plane to get from Birmingham to London? ... and I get a blank stare and an option to click on a supposedly related input, geometric plane surfaces.

So it seems there is something about my question Alpha simply can't understand. At first I thought it might not understand the word plane as being an aircraft instead of a geometric plane, hence the related input it wanted me to try, but that isn't the case. If I ask How long does it take to fly from Birmingham to London? it comes back with the answer, and even tells me that it has interpreted the question as Birmingham to London by Plane. Even though I never used the word plane in my query it understands that by fly I mean in a plane. So it knows what a plane is. What about the rest of the question? If I drop 'plane' from the question and just ask How long does it take to get from Birmingham to London? it happily gives me information which includes the flight time from Birmingham to London.

Alpha understands what How long does it take to get from Birmingham to London? is asking. It seems to understand what a plane is, in the flying sense. It knows how long a plane takes to get from Birmingham to London. But ask it "How long does a plane take to get from Birmingham to London?" and it utterly shits itself. It does this with any synonym of plane I tried: aircraft, airplane, aeroplane, etc.

Alpha's own FAQ concedes that if Alpha comes back to you not understanding your query you might have to reformulate your question to get an answer. This seems very natural to people used to getting creative with their Google search terms in order to get an answer. But Google is not Alpha. Google was never meant to be something you asked a question and got a fact based answer from, however nominally good at it it may be. Google is a search engine, not a question answering service. Alpha, however, is a question answering service, and it should be able to understand any question asked of it as long as it's asked with natural grammar and spelled correctly. Users shouldn't need to have to learn a specific way of asking or a specific syntax to use when asking these questions.

Sure, the tehcnorati that first start using Alpha will be fine learning syntax and special ways to ask, but if Alpha ever wants to move on from them to internet users at large it needs to get better at natural language. Internet illiterate Grandpa John isn't going to want to learn a whole knew way to ask questions to find out roughly how long it'll take his grandson's plane to get to him, he's going to want to be able to just ask and get an answer back.

Alpha's primary goal seems to be to give answers to objective questions easily and quickly, avoiding the fiddleness of a Google search and clicking on links, or reading through half a Wikipedia article to find one fact. To simply make a query and get an answer in one try. At the moment it can't do it a lot of the time, mostly because it has a severe lack of useful data, but also because even when it does have the data it needs you to ask in a very specific way. It's a fun tool, and I've had fun playing with it, but it's not something I'll be using very often. My biggest worry is that once it does have all the information in the world the natural language problem will be even more apparent, as it'll have all the answers but no way to understand half the questions. Solution? Wolfram Research needs to hire some computational linguists, right now.