How to account for variation in spelling (especially for slang) for Word Embeddings/Word2Vec generation using song lyrics? - machine-learning

So I am working on a artist classification project that utilizes hip hop lyrics from genius.com. The problem is these lyrics are user generated, so the same word can be spelled in various different ways, especially if it is slang which is a very common case in hip hop.
I looked into spell correction using hunspell/pyhunspell, but the problem with that is it doesn't fix slang misspellings. I technically could make a mini dictionary with a bunch of misspelled variations but that is effectively useless because there could be a dozen variations of the same word over my (growing) 6000 song corpus.
Any suggestions?

You could try to stem your words. More information on stemming here. This would help grouping together words with close spelling variations.
A popular stemming scheme is the Porter Stemmer, which implementation can be found in most NLP packages, eg. NLTK

I would discard, if possible, short words, or contracted words which somehow are too hard to automatically correct them (conditioned on checking that it won't affect your final result).
For longer words, you may want to use metrics like Levenshtein distance or Jaro similarity. The first one consists of the minimum number of additions, deletes or replaces to convert one candidate word into another. The second one, provides a similar result, between 0 and 1, and putting more emphasis in the last characters of a word.
If you have access to the correct version of your slang word, you could convert the closest candidates to the correct one. Of course, trying not to apply it to different correct words.
If you're working with Python, here some implementations are provided.

Related

How to find the characteristics of a bunch of word Clusters?

My Motivations I'm trying to learn German and realized there's a confounding fact with the structure of German: every noun has a gender which seems unrelated to the noun itself in many cases.
Unlike languages such as English, each noun has a different definite article, depending on gender: der (masculine), die (feminine), and das (neuter). For example:
das Mädchen ("the girl"), der Rock ("the skirt), die Hose ("the trousers/pants"). So, there seems to be no correlation between gender assignment of nouns and their meanings.
The Data
I gathered up to 5000 German words with 3 columns (das, der, die) for each word with 1's and 0's. So, my data is already clustered with one hot encoding and I'm not trying to predict anything.
Why I'm here I am clueless on where to start, how to approach this problem as the concept of distance in clustering doesn't make sense to me in this setting. I can't think of a way to generate an understandable description of these clusters. The mixed data makes it impossible for me to think of some hard-coded metrics for evaluation.
So, my question is:
I want to find some patterns, some characteristics of these words that made them fall in a specific cluster. I don't know if I'm making any sense but some people managed to find some patterns already (for example word endings, elongated long objects tend to be masculine etc., etc) and I believe ML/AI could do a way better job at this. Would it be possible for me to do something like this?
Some personal thoughts
While I was doing some research (perhaps, naive), I realized the potential options are decision trees and cobweb algorithms. Also, I was thinking if I could just scrape a few images (say 5) for every word and try to run some image classification and see the intermediate NN's to see if any specific shapes support a specific object gender. In addition to that, I was wondering whether scraping the data of google n-gram viewers of these words could help in anyway. I couldn't think of a way to use NLP or its sub domains.
Alternatives If everything I just wrote sounds nonsensical, please suggest me a way to make visual representations of my dataframe (more like nodes and paths with images at nodes, one for each cluster) in Python so that I could make pictorial mind maps and try to by heart them.
The ultimate purpose is to make learning German simpler for myself and possibly for others

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

Algorithm for keyword/phrase trend search similar to Twitter trends

Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.
This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.
I have identified the high level steps in the algorithm as follows
Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
Add the remaining words to a hashmap. If the word is already in the map then increment its count.
To get the top 10 words , iterate through the hashmap and find out the top 10 counts.
Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )
Also if I want to track phrases what could be the approach ?
For example if I have a text saying "This honey is very good"
I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"
Any suggestions would be greatly appreciated.
Thanks in advance
For detecting phrases, I suggest to use chunker. You can use one provided by NLP tool like OpenNLP or Stanford CoreNLP.
NOTE
honey is very good is not a phrase. It is clause. very good is a phrase.
In Information Retrieval System, those common word are called Stop Words.
Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.
If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.
You're facing an n-gram-like problem.
Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)
Check out the NLTK library. It has code that does number one two and three:
1 Removing common words can be done using stopwords or a stemmer
2,3 getting the most common words can be done with FreqDist
Second you can use tools from Stanford NLP for tracking your text

How do I design a heuristic for matching translated sentences?

Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.

What are some good methods to find the "relatedness" of two bodies of text?

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.
What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?
I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.
I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.
These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.
You could also look into Soundex for words that "sound alike" phonetically.
I've never used it, but you might want to look into Levenshtein distance
Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)
One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.
And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases
This is quite doable for reasonable large texts, however harder for smaller texts.
I did it once like this, and it worked pretty well:
Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.
This book may be relevant.
Edit: here is a related SO question
Phonetic algorithms
The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.
I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name
A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

Resources