I am doing feature selection in machine learning, where i would like to detect words like happyyyyyyyyy,gooood,loooooooove and replace it as happy,good,love. I tried using regex to replace consecutive repeated letters with one of the same but if I do that, Works fine with looooooooove -> love and fails in goooooood -> god. I collected a list of English words like book,cool,chilling,breeze,etc but this list is not sufficient for my dataset. I need reference to continue , as collecting a list of words is really time consuming. Thanks for your replies.
To get your reference, use the regex (.)\1+
with something like grep to match words from a word list (take a look at Dictionary text file for a good place to start).
You should end up with a list of words that have consecutive letters, so you will have your reference.
Related
So I am working on a artist classification project that utilizes hip hop lyrics from genius.com. The problem is these lyrics are user generated, so the same word can be spelled in various different ways, especially if it is slang which is a very common case in hip hop.
I looked into spell correction using hunspell/pyhunspell, but the problem with that is it doesn't fix slang misspellings. I technically could make a mini dictionary with a bunch of misspelled variations but that is effectively useless because there could be a dozen variations of the same word over my (growing) 6000 song corpus.
Any suggestions?
You could try to stem your words. More information on stemming here. This would help grouping together words with close spelling variations.
A popular stemming scheme is the Porter Stemmer, which implementation can be found in most NLP packages, eg. NLTK
I would discard, if possible, short words, or contracted words which somehow are too hard to automatically correct them (conditioned on checking that it won't affect your final result).
For longer words, you may want to use metrics like Levenshtein distance or Jaro similarity. The first one consists of the minimum number of additions, deletes or replaces to convert one candidate word into another. The second one, provides a similar result, between 0 and 1, and putting more emphasis in the last characters of a word.
If you have access to the correct version of your slang word, you could convert the closest candidates to the correct one. Of course, trying not to apply it to different correct words.
If you're working with Python, here some implementations are provided.
I want to do some mining on tweets. Is there any more specific stop word list for tweets such as removing "lol" and other twitter smiley?
I guess you should merge ordinary stop word list, like this one or that, with the specific acronyms dictionary, e.g. this slang dictionary, or that, or that, or that (the last one seems to be the easiest for parsing, see comments here for the idea).
I'm not aware of a specific stopwords list, but you could get a list of most frequent single words here:
http://clic.cimec.unitn.it/amac/twitter_ngram/ (download en.1grams.gz)
To detect and then ignore smilies use: https://github.com/brendano/tweetmotif
You may also find these tools useful:
https://github.com/willf/segment (if you want to segment hashtags)
https://github.com/amacinho/Rovereto-Twitter-Tokenizer (if you don't)
I'm not aware of a Twitter-specific stop word list, but it is common practice to simply remove the n most frequent words from your analyses, where n could be 100, for example. Depending on what you would like to do, smileys may actually provide very relevant information.
I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").
The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).
I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.
Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.
Quoting the Xapian manual: http://xapian.org/docs/stemming.html
It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.
Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.
This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.
I have identified the high level steps in the algorithm as follows
Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
Add the remaining words to a hashmap. If the word is already in the map then increment its count.
To get the top 10 words , iterate through the hashmap and find out the top 10 counts.
Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )
Also if I want to track phrases what could be the approach ?
For example if I have a text saying "This honey is very good"
I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"
Any suggestions would be greatly appreciated.
Thanks in advance
For detecting phrases, I suggest to use chunker. You can use one provided by NLP tool like OpenNLP or Stanford CoreNLP.
NOTE
honey is very good is not a phrase. It is clause. very good is a phrase.
In Information Retrieval System, those common word are called Stop Words.
Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.
If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.
You're facing an n-gram-like problem.
Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)
Check out the NLTK library. It has code that does number one two and three:
1 Removing common words can be done using stopwords or a stemmer
2,3 getting the most common words can be done with FreqDist
Second you can use tools from Stanford NLP for tracking your text
I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").