Can I use embedding layer instead of one hot as category input? - machine-learning

I am trying to use FFM to predict binary labels. My dataset is as follows:
sex|age|price|label
0|0|0|0
1|0|1|1
I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:
sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1
Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.
My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.
I have no any other related dataset, so I can not use any
pre-trained embedding models. I can only randomly initialize the embedding
weights and the train it by my own dataset. Will this way approach
work?
If I use embedding layer instead of one hot encoding, does it
mean that each attribute will belongs one field?
What is the difference between these two methods? Which is better?

Yes you can use embeddings and that approach does work.
The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.
A higher embedding size means it will capture more details on the relationship between the categorical variables.
In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.
On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.
For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.

Related

How to apply tf-idf on multiple predictors, don't want to concatenate into a single column

I have two predictors - want to vectorize each one of them using tf-idf (don't want to concatenate them since we need to have separate vocabulary for each). Should I apply the tf-idf vectorizers on each and then join the features.
For e.g. If i apply tf-idf on predictor1, I get 100 features from that and 200 from predictor2. My features for the training data would simply be 300 (100+200). Am i thinking correctly here?
I will get two matrices from this (one for each predictor), can i concatenate these using numpy functions and use them as features?
Your suggestion on getting this done is correct. The most common way of using two vectors like this is to concatenate them into a longer vector and then feed it to the model.
If, for some reason, this doesn't work out for you, we can explore alternatives based on what your constraints are.
For example, if your constraint is total dimension size, one way to solve this would be to create a multilayered MLP autoencoder
We can train it with the combined vectors as both input and output until the encoder is trained
Subsequently, we can use any intermediate layer's activations as input to our model
It would be easier to suggest a solution if you can describe your constraints in the question.

how to define labels at machine learning excepts one-hot-encoding

I want to make model that can classify attributes not class.
for example, when I input this image
my model output ' this furniture have [ brown color, 4 legs, fabric sheet ] '
I used pre-trained ResNet but it doesn't work well.
so I tried to make new model but I can't define Label values
I think it can't achieve my goal with one-hot-encoding.
how can I implements?
give me some Idea..
You're right to say that this probably doesn't work with one-hot-encoding, let's take a look at what options you do have.
Option 1: Still one hot encoding
If you want your model to only have a limited number of attributes outputted, and they are non-overlapping, you can have k one-hot encoded output layers.
For example, if you have the attributes color, # of legs, material, these are never overlapping. You can then have your model predict a color, number of legs, and a material for each input image. These can be represented and learned using 3 one-hot encoded vectors.
Pros:
typically nicer to train
will not have colliding predictions
Cons:
require separation of class
Option 2: Don't use softmax, sigmoid FTW
If you use a sigmoidal activation instead of softmax (which is what I am assuming you're using), each output node is independent of other output nodes. This way, each output will give its own probability likelihood.
In this scenario, your label will not be one-hot encoded, but rather it will be a binary vector, with variable number of 1s and 0s.
Instead of finding the max probability, you would most likely want to take a threshold probability, i.e. take all outputs with a probability of >80% as the predicted labels when evaluating.
Pros:
Does not require hand-made separation of attributes (since we are treating each class as independent of one another)
Easy representation for variable number of attributes
Cons:
Mathematically, and from experience as well, this tends to be much harder to train
It is possible (and quite frankly, it will be likely) you will get colliding predictions, i.e. both 4 legs and 3 legs may come out of your neural network. You will need to handle these cases.
This really comes down to a preference thing, and based on what sort of data you are working with. If you can choose attributes in a way that you can cleanly separate options for the neural network to choose from like color and material (assuming you can't have two colors or two materials), the first option is probably best.
There are a couple of other ways to approach this problem, but these seem most closely applicable.

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Optimizing word2vec model comparisons

I have a word2vec model for every user, so I understand what two words look like on different models. Is there a more optimized way to compare the trained models than this?
userAvec = Word2Vec.load(userAvec.w2v)
userBvec = Word2Vec.load(userBvec.w2v)
#for word in vocab, perform dot product:
cosine_similarity = np.dot(userAvec['president'], userBvec['president'])/(np.linalg.norm(userAvec['president'])* np.linalg.norm(userBvec['president']))
Is this the best way to compare two models? Is there a stronger way to see how two models compare rather than word by word? Picture 1000 users/models, each with similar number of words in the vocab.
There's a faulty assumption at the heart of your question.
If the models userAvec and userBvec were trained in separate sessions, on separate data, the calculated angle between the userAvec['president'] and userBvec['president'] is, alone, essentially meaningless. There's randomness in the algorithm initialization, and then in most modes of training – via things like negative-sampling, frequent-word-downsampling, and arbitrary reordering of training examples due to thread-scheduling variability). As a result, even repeated model-training with the exact same corpus and parameters can result in different coordinates for the same words.
It's only the relative distances/directions, among words that were co-trained in the same iterative process, that have significance.
So it might be interesting the compare whether the two model's lists of top-N similar words, for a particular word, are similar. But the raw value of the angle, between the coordinates of the same word in alternate models, isn't a meaningful measure.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Resources