I'm working on a project to generate questions from sentences. Right now, I'm at a point where I can generate questions like:
"Angela Merkel is the chancelor of Germany." -> "Angela Merkel is who?"
Now, of course, I want the questions to look like "Who is...?" instead. Is there any easy way to do this that I haven't thought of yet?
My current idea would be to train an English(not quite question) -> English(question) translator, maybe using existing machine translation engines like moses. Is this overkill? How much data would I need? Are there corpora that address this or a similar problem? Is using a general translation engine even appropriate for this task?
Check out Michael Heilman's dissertation Automatic Factual Question Generation from Text for background on question generation and to see what his approach to this problem looks like. You can find more by searching for research on "question generation". He mentions a corpus from Microsoft: the Microsoft Research Question-Answering Corpus.
I don't think that an approach based solely on (current) statistical machine translation approaches is going to work that well, since you're usually going to need a deeper syntactic analysis of the source sentence to do a good job of generating an appropriate question. For simple questions like your example, it's pretty easy to design syntactic tree transformations to generate the question, but it gets much trickier as soon as the sentences get a little more complicated.
Off the top of my head, if you restrict yourself to relatively simple questions, you could do a parse, and then flip around the elements to get the question. How do you decide the question word though? Who, What, Where, Why... for this you'll need a classifier that looks at the elements of a sentence. Angela Merkel should be easy to classify as a person/name, so she gets s 'Who', Berlin should be in a dictionary of geos, so it gets 'Where'.
I'm not sure about specific software, but I'd probably do it with NLTK, using a dependency parse and then whatever classification scheme you feel like.
Ultimately your success depends on how big your input and output space is. I'd go for the absolute simplest possible problem first.
Related
I look for spell checker that could use language model.
I know there is a lot of good spell checkers such as Hunspell, however as I see it doesn't relate to context, so it only token-based spell checker.
for example,
I lick eating banana
so here at token-based level no misspellings at all, all words are correct, but there is no meaning in the sentence. However "smart" spell checker would recognize that "lick" is actually correctly written word, but may be the author meant "like" and then there is a meaning in the sentence.
I have a bunch of correctly written sentences in the specific domain, I want to train "smart" spell checker to recognize misspelling and to learn language model, such that it would recognize that even thought "lick" is written correctly, however the author meant "like".
I don't see that Hunspell has such feature, can you suggest any other spell checker, that could do so.
See "The Design of a Proofreading Software Service" by Raphael Mudge. He describes both the data sources (Wikipedia, blogs etc) and the algorithm (basically comparing probabilities) of his approach. The source of this system, After the Deadline, is available, but it's not actively maintained anymore.
One way to do this is via a character-based language model (rather than a word-based n-gram model). See my answer to Figuring out where to add punctuation in bad user generated content?. The problem you're describing is different, but you can apply a similar solution. And, as I noted there, the LingPipe tutorial is a pretty straightforward way of developing a proof-of-concept implementation.
One important difference - to capture more context, you may want to train a larger n-gram model than the one I recommended for punctuation restoration. Maybe 15-30 characters? You'll have to experiment a little there.
I'm looking for an alternative to JMegahal that is just as simple, and easy to use, but yields better results. I know JMegahal uses Markov chains to generate new strings, and I know that they're not necessarily the best. I was pointed towards Bayesian Network as the best conceptual solution to this problem, but I cannot find any libraries for Java that are easy to use at all. I saw WEKA, but it seemed bloated, and hard to follow. I also saw JavaBayes, but it was almost completely undocumented (their javadocs contained little to no information, and the variables were poorly named) and the library was blatantly written in C-style, making it stand out in Java.
You might want to consider extending JMegahal to filter the input sentences. Back in the mid-90s, Jason Hutchens had written a C version of this 4th-order Markov strings algorithm (it was probably used as inspiration for the JMegahal implementation actually). At that time, Jason has added filters to improve on the implementations (by replacing 'you' by 'I', etc...). By doing some basic string manipulation meant to change the subject from the speaker to the system, the output became a lot more coherent. I think the expanded program was called HeX.
Reference 1
Reference 2
I am working in a delivery company. We currently solve 50+ locations routes by "hand".
I have been thinking about using Google Maps API to solve this problem, but I have read that there is a 24 points limit.
Currently we are using rails in our server so I am thinking about using a ruby script that would get the coordinates of the 50+ locations and output a reasonable solution.
What algorithm would you use to approach this problem?
Is Ruby a good programming language to solve this type of problem?
Do you know of any existing ruby script?
This might be what you are looking for:
Warning:
this site gets flagged by firefox as attack site - but I doesn't appear to be. In fact I used it before without a problem
[Check revision history for URL]
rubyquiz seems to be down ( has been down for a bit) however you can still check out WayBack machine and archive.org to see that page:
http://web.archive.org/web/20100105132957/http://rubyquiz.com/quiz142.html
Even with the DP solution mentioned in another answer, that's going to require O(10^15) operations. So you're going to have to look at approximate solutions, which are probably acceptable given that you currently do them by hand. Look at http://en.wikipedia.org/wiki/Travelling_salesman_problem#Heuristic_and_approximation_algorithms
Here are a couple of tricks:
1: Lump locations that are relatively close into one graph, and turn those locations into a single node in your main graph. This lets you be greedy without too much work.
2: Use an approximation algorithm.
2a: My favorite is bitonic tours. They're pretty easy to hack up.
See Update
Here's a py lib with a bitonic tour and here's another
Let me go look for a ruby one. I'm having trouble finding more than just the RGL, which has efficiency issues....
Update
In your case, the minimum spanning tree attack should be effective. I can't think of a case where your cities wouldn't meet the triangle inequality. This means that there should be a relatively sort of kind of almost fast rather decent approximation. Particularly if the distance is euclidean, which I think, again, it must be.
One of the optimized solution is using Dynamic Programming but still very expensive O(2**n), which is not very feasible, unless you use some clustering and distributing computing, ruby or single server won't be very useful for you.
I would recommend you to come up with a greedy criteria instead of using DP or brute force which would be easier to implement.
Once your program ends you can do some memoization, and store the results somewhere for later lookups, which can as well save you some cycles.
in terms of the code, you ll need to implement vertices, edges that have weights.
ie: vertex class which have edges with weights, recursive. than a graph class that will populate the data.
I worked on using meta-heurestic algorithms such as Ant Colony Optimazation to solve TSP problems for the Bays29 (29-city) problem, and it gave me close to optimal solutions in very short time. You can potentially use the same.
I wrote it in Java though, I will link it here anyways, because I am currently working on a port to ruby:
Java: https://github.com/mohammedri/ant_colony_java_TSP
Ruby: https://github.com/mohammedri/aco-ruby (incomplete)
This is the dataset it solves for: https://github.com/jorik041/osmsharp/blob/master/Core/OsmSharp.Tools/Benchmark/TSPLIB/Problems/TSP/bays29.tsp
Keep in mind I am using the Euclidean distance between each city i.e. the straight line distance, I don't think that is ideal in a real life situation considering roads and a city map etc. but it may be a good starting point :)
If you want the cost of the solution produced by the algorithm is within 3/2 of the optimum then you want the Christofides algorithm. ACO and GA don't have a guaranteed cost.
What would be an intelligent way to store text, so that it can be intelligently parsed and translated later on.
For example, The employee is outstanding as he can identify his own strengths and weaknesses and is comfortable with himself.
The above could be the generic text which is shown to the user prior to evaluation. If the user is a Male (say Shaun) or female (say Mary), the above text should be translated as follows.
Mary is outstanding as she can identify her own strengths and weaknesses and is comfortable with herself.
Shaun is outstanding as he can identify his own strengths and weaknesses and is comfortable with himself.
How do we store the evaluation criteria in the first place with appropriate place or token holders. (In the above case employee should be translated to employee name and based on his gender the words he or she, himself or herself needs to be translated)
Is there a mechanism to automatically translate the text with the above information.
The basic idea of doing something like this is called Mail Merge.
This page seems to discus how to implement something like this in Ruby.
[Edit]
A google search gave me this - http://freemarker.org/
I don't know much about this library, but it looks like what you need.
This is a very broad question in the field of Natural Language Processing. There are numerous ways to go around it, the questions you asked seem too broad.
If I understand correctly part of your question this could be done this way :
#variable{name} is outstanding as #gender{he/she} can identify #gender{his/hers} own strengths and weaknesses and is comfortable with #gender{himself/herself}.
Or:
#name is outstanding as #he can identify #his own strengths and weaknesses and is comfortable with #himself.
... if gender is the major problem.
I have had some experience working with a tool called Grammatica, when building a custom user input excel like formula parsing and evaluation engine. It may not be to the level of sophistication you're looking for but it's a start. This basically uses many of the same concepts that popular code compiler parsers employ. It's definitely worth checking out.
I agree with Kornel, this question is too broad. What you seem to be talking about is semantics for which RDF's and OWL can be a good starting point. Read about modeling semantics using markup and you can work your way up from there.
Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.
There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.
You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.
Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.