metaphone versus soundex versus NYSIIS - machine-learning

I'm trying to come up with an implicit spell checker that will use the mappings of input words to some kind of more general phonetic representation to account for typos that might occur, basically for a search bar that will automatically correct your spelling to a degree. Two things that I've been looking into are metaphone, nysiis and soundex, but I don't really know which would be better for this application.
I would like there to be preferentially more matches than less matches, and I would like the matching to be a bit more general and so for that reason I was thinking of going with soundex which seems to be a more approximate mapping than the original metaphone, but I don't really know how large the difference in vagueness is. I know that nysiis is pretty similar to soundex, but I don't have a good idea of how similar they are, or how nysiis compares to metaphone.
I am also looking for the solution that is quickest to execute. I know that these phonetic mappers are usually pretty quick, but I'm not sure which would be fastest, considering I would like to be able to check spelling without an increase in search time, speed is a consideration. Thoughts?

I managed to find a wonderful article on this over here:
http://www.informit.com/articles/article.aspx?p=1848528
Not quite everything I was looking for, but a pretty large amount of it.

Related

reading/parsing common lisp files from lisp without all packages available or loading everything

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

Find almost-duplicate strings in Objective-C on iOS

I have a list of song tracks that I uploaded from the iTunes API. Some of them are duplicates, but not perfect duplicates. For example, one might say "All 4 u" vs "All for you", or "Some song" vs "some song feat. some other artist"
I want to be able to identify the duplicates. Is the best way to compute the Levenshtein distance for all pairs? That seems excessive.
I'm working in the Cocoa Touch framework for iOS programming so if anyone knows of any libraries that would help a lot.
Why do you consider computing the Levenshtein distance excessive? What algorithm would you use if you were sitting down to a list with pencil and paper?
That said, Levenshtein is likely necessary, but not sufficient. I would start by normalizing the strings. In some cases, a string might normalize a couple of ways and you'll need to do both. Normalization would look like:
convert to lowercase
Strip any leading numbers followed by punctuation ( "1.", "1 - ", etc.)
Tentatively strip anything after "feat." or "with"
This is an example of special knowledge about your problem set. You're going to have to use a lot of special knowledge like this.
"Tentatively" means you should probably keep both the stripped and non-stripped versions of the string
Keep in mind that things including "feat." might be remixes, so you have to be careful about assuming duplicates. This is of course true of almost any attempt at de-dupping. There are often multiple versions.
Tentatively expand common abbreviations (u=>you, 4=>for, 2=>two, w/=>with, etc. etc.)
Tentatively strip anything in parentheses
Strip English articles (a, an, the). Maybe even strip all very short words (3 or less characters) as a first pass.
Doing this well is complicated and will require a lot of trial and error. I've done a lot of contact de-dupping in the past, and one piece of advice: start conservative. It is very easy to accidentally de-dupe way too much. Build a big list of test data that you've de-duped by hand and test, test, test after every algorithm change. Make sure your UI can present the user with anything you're uncertain about, because there are going to be many, many records that you can't be certain about. (This is true even when you do it by hand. Look at a big list of human-entered titles and tell me which ones are duplicates 100% without listening to the tracks. A computer isn't going to do better than you at this.)
I'm not aware of any publicly available library for this. It's been solved by many people many times (search for "dedupe song titles" or anything similar). But it's generally commercial software.
One more piece of advice for this, since it's a huge O(n^2) or worse problem. Look for bucketing opportunities. If you can match artists first, then albums, then tracks, you can divide and conquer in much less time.

Alternatives to JMegahal

I'm looking for an alternative to JMegahal that is just as simple, and easy to use, but yields better results. I know JMegahal uses Markov chains to generate new strings, and I know that they're not necessarily the best. I was pointed towards Bayesian Network as the best conceptual solution to this problem, but I cannot find any libraries for Java that are easy to use at all. I saw WEKA, but it seemed bloated, and hard to follow. I also saw JavaBayes, but it was almost completely undocumented (their javadocs contained little to no information, and the variables were poorly named) and the library was blatantly written in C-style, making it stand out in Java.
You might want to consider extending JMegahal to filter the input sentences. Back in the mid-90s, Jason Hutchens had written a C version of this 4th-order Markov strings algorithm (it was probably used as inspiration for the JMegahal implementation actually). At that time, Jason has added filters to improve on the implementations (by replacing 'you' by 'I', etc...). By doing some basic string manipulation meant to change the subject from the speaker to the system, the output became a lot more coherent. I think the expanded program was called HeX.
Reference 1
Reference 2

English query generation through machine translation systems

I'm working on a project to generate questions from sentences. Right now, I'm at a point where I can generate questions like:
"Angela Merkel is the chancelor of Germany." -> "Angela Merkel is who?"
Now, of course, I want the questions to look like "Who is...?" instead. Is there any easy way to do this that I haven't thought of yet?
My current idea would be to train an English(not quite question) -> English(question) translator, maybe using existing machine translation engines like moses. Is this overkill? How much data would I need? Are there corpora that address this or a similar problem? Is using a general translation engine even appropriate for this task?
Check out Michael Heilman's dissertation Automatic Factual Question Generation from Text for background on question generation and to see what his approach to this problem looks like. You can find more by searching for research on "question generation". He mentions a corpus from Microsoft: the Microsoft Research Question-Answering Corpus.
I don't think that an approach based solely on (current) statistical machine translation approaches is going to work that well, since you're usually going to need a deeper syntactic analysis of the source sentence to do a good job of generating an appropriate question. For simple questions like your example, it's pretty easy to design syntactic tree transformations to generate the question, but it gets much trickier as soon as the sentences get a little more complicated.
Off the top of my head, if you restrict yourself to relatively simple questions, you could do a parse, and then flip around the elements to get the question. How do you decide the question word though? Who, What, Where, Why... for this you'll need a classifier that looks at the elements of a sentence. Angela Merkel should be easy to classify as a person/name, so she gets s 'Who', Berlin should be in a dictionary of geos, so it gets 'Where'.
I'm not sure about specific software, but I'd probably do it with NLTK, using a dependency parse and then whatever classification scheme you feel like.
Ultimately your success depends on how big your input and output space is. I'd go for the absolute simplest possible problem first.

Solving the Travelling Salesman Problem in ruby (50+ locations)

I am working in a delivery company. We currently solve 50+ locations routes by "hand".
I have been thinking about using Google Maps API to solve this problem, but I have read that there is a 24 points limit.
Currently we are using rails in our server so I am thinking about using a ruby script that would get the coordinates of the 50+ locations and output a reasonable solution.
What algorithm would you use to approach this problem?
Is Ruby a good programming language to solve this type of problem?
Do you know of any existing ruby script?
This might be what you are looking for:
Warning:
this site gets flagged by firefox as attack site - but I doesn't appear to be. In fact I used it before without a problem
[Check revision history for URL]
rubyquiz seems to be down ( has been down for a bit) however you can still check out WayBack machine and archive.org to see that page:
http://web.archive.org/web/20100105132957/http://rubyquiz.com/quiz142.html
Even with the DP solution mentioned in another answer, that's going to require O(10^15) operations. So you're going to have to look at approximate solutions, which are probably acceptable given that you currently do them by hand. Look at http://en.wikipedia.org/wiki/Travelling_salesman_problem#Heuristic_and_approximation_algorithms
Here are a couple of tricks:
1: Lump locations that are relatively close into one graph, and turn those locations into a single node in your main graph. This lets you be greedy without too much work.
2: Use an approximation algorithm.
2a: My favorite is bitonic tours. They're pretty easy to hack up.
See Update
Here's a py lib with a bitonic tour and here's another
Let me go look for a ruby one. I'm having trouble finding more than just the RGL, which has efficiency issues....
Update
In your case, the minimum spanning tree attack should be effective. I can't think of a case where your cities wouldn't meet the triangle inequality. This means that there should be a relatively sort of kind of almost fast rather decent approximation. Particularly if the distance is euclidean, which I think, again, it must be.
One of the optimized solution is using Dynamic Programming but still very expensive O(2**n), which is not very feasible, unless you use some clustering and distributing computing, ruby or single server won't be very useful for you.
I would recommend you to come up with a greedy criteria instead of using DP or brute force which would be easier to implement.
Once your program ends you can do some memoization, and store the results somewhere for later lookups, which can as well save you some cycles.
in terms of the code, you ll need to implement vertices, edges that have weights.
ie: vertex class which have edges with weights, recursive. than a graph class that will populate the data.
I worked on using meta-heurestic algorithms such as Ant Colony Optimazation to solve TSP problems for the Bays29 (29-city) problem, and it gave me close to optimal solutions in very short time. You can potentially use the same.
I wrote it in Java though, I will link it here anyways, because I am currently working on a port to ruby:
Java: https://github.com/mohammedri/ant_colony_java_TSP
Ruby: https://github.com/mohammedri/aco-ruby (incomplete)
This is the dataset it solves for: https://github.com/jorik041/osmsharp/blob/master/Core/OsmSharp.Tools/Benchmark/TSPLIB/Problems/TSP/bays29.tsp
Keep in mind I am using the Euclidean distance between each city i.e. the straight line distance, I don't think that is ideal in a real life situation considering roads and a city map etc. but it may be a good starting point :)
If you want the cost of the solution produced by the algorithm is within 3/2 of the optimum then you want the Christofides algorithm. ACO and GA don't have a guaranteed cost.

Resources