I was looking for some good options for fuzzy comparison in Rails.
Essentially, I have a set of strings that I'd like to compare against some strings in my database and I'd like to get the closest one if applicable. In this particular case, I'm not so interested in detecting letters out of order/mis-spellings, but rather the ability to ignore extraneous words (extra information, punctuation, words like: the, and, it etc) and pick out the best match. These strings will usually be somewhere between 2-7 words long.
What would you suggest is the best gem/method of doing that? I've looked at amatch (http://flori.github.com/amatch/doc/index.html) but I was wondering what else was out there.
Thanks!
Have a look and a play with Thinking Sphinx http://freelancing-god.github.com/ts/en/
I can heartily recommend it
There is also a superb Railscast on how to use it here
http://railscasts.com/episodes/120-thinking-sphinx
Otherwise use ARel - but you are going to have to implement your own fuzzy logic (Not something I'd recommend)
Have a look on this FuzzyMatch gem
It may help you.
Related
I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").
The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).
I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.
Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.
Quoting the Xapian manual: http://xapian.org/docs/stemming.html
It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.
I am currently thinking of how to find a location from a text, such as a blogpost, without the user having to input any additional information. For example a post could look like this:
"Aberdeen, With a Foot on the Seafloor
Since the early 1970s, Aberdeen, Scotland, has evolved from a gritty fishing town into the world’s center of innovation in technology for the offshore energy industry."
By reading it I realize that the post is about Aberdeen Scotland but how can I geotag it? I have been using the geocoder (https://github.com/alexreisner/geocoder) by Alex Reisner but it seems weird to check every word against the google/nominatim(osm). My initial idea was to simply bruteforce it by checking every word with the geocoder and try to see if there are similarities between the words. But it seems like there could be a better way around this.
Has anyone done anything similar to this? Any algorithm that could be suggested (or gem :) ) would be immensely appreciated!
I'm sure there have been projects dedicated to this - for example, google's uncanny ability to geotag and pick data out of your personal emails effortlessly.
The most obvious answer I can see here, would be to create a few regular expressions for locations. The most simple one would be for City, Country:
Regexp.new("((?:[a-z][a-z]+))(.)(\\s+)((?:[a-z][a-z]+))",Regexp::IGNORECASE);
This would recognize Aberdeen, Scotland, but also course, I or even thanks, bye. It would be a start though, to query only those recognized spots instead of every word in the document.
There are also widely known regular expressions for addresses, cities, etc. You could use those as well if you find your algorithm missing matches.
Cheers!
I am really new at rails but i was wondering.
What is the best practice for accessing model column names in rails when doing queries?
like i want to do a order by column called "title" in DESCENDING order. how would i do it (best practice)?
MyModel.order(:title.to_s.concat " DESC").all
MyModel.order("title DESC").all
or something else?
From my experience using hardcoded strings always proves the wrong approach in matters such as this mainly because the code becomes impossible to refactor.
in My IDE (i am using RubyMine) it is showing a nice code completition for the colum symbols so i am guessing will be easier to track the use this way?
Thanks.
In my opinion MyModel.order("title DESC").all is the better choice here. Readability and complexity of the other choice are bad. Although performance might not be a consideration the other choice also scores bad in this section.
Apart from that, you should never write code by your IDE intellisense ability - your code should be navigable and readable in all IDEs. I use Vim and it completes strings as good as it completes symbols so no difference here.
EDIT:
If your order was ASC then you could use MyModel.order(:title).all which is definitely better than MyModel.order("title").all
Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.
I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.