Test set in language detection completely different from training set

Test set in language detection completely different from training set - machine-learning

I am trying to program a machine learning algorithm to learn from training data and classify the language of the instance. There are 4 total classifications: Polish, French, Slovak, German.
In the training data, the data is full sentences, but when looking at the test data, the data is represented by just single characters.
For example, an instance of my training data looks like this:
"Et oui cest la fille du patron Il fait tout"
But my testing data looks like this:
"e e n t l n r i a e i a v i t s r e t n"
How come my training dataset is so different from my testing dataset, and what would be an appropriate feature selection for this problem?

It is suspicious you have train set like this. The only method comes in mind is to use probability given distribution if you have large enough paragraphs you can calculate percent value counts for each letter given language and match it with your data. For example, it is known that in large enough English text letter "a" appears ~8.167%, letter "e" ~ 12.702% however in German "a" occurs ~ 6.% and "e" ~ 16.4%. Other languages have different distributions.
Check this Wikipedia article: https://en.wikipedia.org/wiki/Letter_frequency

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?

I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Second word completion with python

I'm new to machine learning and I'm trying to come up with a model that will complete all second words in phrases. I couldn't find solution to this exact problem although there are lots of tutorials on generating text with RNN.
So, consider you have the 2 following files:
1) a word dictionary for training
Say we have a table with 2 columns of word pairs: 'complete' and 'sample' such that the first column includes different word pairs ("Hello dear", "my name", "What time", "He goes", etc.) and the second one includes first words and only a part (> 2 letters) of second words ("Hello de", "my nam", "What ti", "He goe", etc.).
2) a table for testing
It's a table that consists of only 'sample' column.
The aim is to add 'complete' column to the second table with complete pairs of words.
I came up with the only way to do this:
compute the frequences of all first words (P(w1))
compute the frequences of all complete second words (P(w2))
compute the frequences of all first words given complete second words (P(w1|w2))
predict complete second words using Bayes rule:
w2 = argmax_{w2} ( P(w2|w1)) = argmax_{w2} (P(w1|w2) * P(w2))
for each w1 in the test table w2 is the most probable w2 or the most frequent w2 (if w1 is not in the dict).
The problem is this algorithm doesn't work sufficiently well. How can I somehow optimise the probabilities (maybe gradient descent might be helpful?)? Is there any other way to address this task?

Character n-gram vs word features in NLP

I'm trying to predict if reviews on yelp are positive or negative by performing linear regression using SGD.I tried two different feature extractors.The first was the character n-gram and the second was separating words by space. However, I tried different n values for the character n-gram, and found that the n value that gave me the best test error.I noticed that this test error (0.27 in my test data) was nearly identical to the test error from extracting the words separated by space.Is there a reason behind this coincidence?Shouldn't the character n-gram have a lower test error since it extracted more features than the word features?
Character n-gram: ex. n=7
"Good restaurant" => "Goodres" "oodrest" "odresta" "drestau" "restaur" "estaura" "stauran" "taurant"
Word features:
"Good restaurant" => "Good" "restaurant"

Looks like the n-gram method simply produced a lot of redundant, overlapping features which do not contribute to the precision.

How to construct a character based seq2seq model in tensorflow

What changes are required to the existing seq2seq model in tensorflow so that I can use character units rather then the existing word units for the seq2seq task? And will this be a good configuration for a predictive ext application?
The following function signatures may need modification for this task:
def embedding_rnn_seq2seq(encoder_inputs, decoder_inputs, cell,
num_encoder_symbols, num_decoder_symbols,
output_projection=None, feed_previous=False,
dtype=dtypes.float32, scope=None):
Apart from reduced input output vocabulary what other parameter changes would be be required to implement such a character level seq2seq model ?

I think you could use the existing seq2seq model in tensorflow without any code changes for character-based units if you prepare your input data files by whitespace separating your training examples like this:
The quick brown fox.
Becomes:
T h e _SPACE_ q u i c k _SPACE_ b r o w n _SPACE_ f o x .
Then your vocabulary naturally becomes characters not words.
You can experiment vocab sizes, with embedding size, eliminate embedding layer, etc. to see what works best for your data.

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - http://svmlight.joachims.org/) format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?

You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . The last one requires more complex software.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart