Character n-gram vs word features in NLP - machine-learning

I'm trying to predict if reviews on yelp are positive or negative by performing linear regression using SGD.I tried two different feature extractors.The first was the character n-gram and the second was separating words by space. However, I tried different n values for the character n-gram, and found that the n value that gave me the best test error.I noticed that this test error (0.27 in my test data) was nearly identical to the test error from extracting the words separated by space.Is there a reason behind this coincidence?Shouldn't the character n-gram have a lower test error since it extracted more features than the word features?
Character n-gram: ex. n=7
"Good restaurant" => "Goodres" "oodrest" "odresta" "drestau" "restaur" "estaura" "stauran" "taurant"
Word features:
"Good restaurant" => "Good" "restaurant"

Looks like the n-gram method simply produced a lot of redundant, overlapping features which do not contribute to the precision.

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Use pos tagging in bag of words

I'm using the bag of words for text classification.
Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb
Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".

How To Train Stanford NER For Names That Include Spaces?

The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."
The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.

Stanford GloVe's lack of punctuation?

I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?
Is there a pre-made data set that includes such things? Would this even work?
I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.
pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from http://nlp.stanford.edu/projects/glove/ have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.
I have worked a bit with the word vectors used by Senna, and I am looking at the vocab list.
http://ml.nec-labs.com/senna/
I definitely see entries for punctuation.
A trick for handling numbers is to replace every digit with 0, and then learn a distribution for each pattern. For instance 1999 is mapped to 0000, 01-01-2015 is mapped to 00-00-0000, etc...
Senna has entries for these patterns like 0000, etc...
I will look over GloVe and try to update this answer soon...
It is totally ok and also common to also handle punctuation as single tokens for word vector generation. Also see for example word2vec papers. I assume that the prebuilt word2vec datasets have punctuations. And i'm sure the prebuilt glove vectors have also punctuations.
The are a lot of tokenizers separating the punctuations as seperate word. One I know for sure is the ARK Tweet Tokenizer.
I have used such a conversition for numbers and punctiotions. It is not a good way but slightly can be useful.
for numbers I convert all numbers to "NUM".
ex: 178 = "NUM" or 654 = "NUM"
for punctiotions I convert them to "PUNC".
ex: apple, orange, banana = apple "PUNC" orange "PUNC" banana
this is not a good solution but works someway.

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - http://svmlight.joachims.org/) format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?
You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . The last one requires more complex software.

Resources