Sequence Models Word2vec - machine-learning

I am working on data-set with more than 100,000 records.
This is how the data looks like:
email_id cust_id campaign_name
123 4567 World of Zoro
123 4567 Boho XYz
123 4567 Guess ABC
234 5678 Anniversary X
234 5678 World of Zoro
234 5678 Fathers day
234 5678 Mothers day
345 7890 Clearance event
345 7890 Fathers day
345 7890 Mothers day
345 7890 Boho XYZ
345 7890 Guess ABC
345 7890 Sale
I am trying to understand the campaign sequence and predict the next possible campaign for the customers.
Assume I have processed my data and stored it in 'camp'.
With Word2Vec-
from gensim.models import Word2Vec
model = Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)
The problem with this model is that it accepts tokens and spits out text-tokens with probabilities in return when looking for similarities.
Word2Vec accepts this form of input-
['World','of','Zoro','Boho','XYZ','Guess','ABC','Anniversary','X'...]
And gives this form of output -
model.wv.most_similar('Zoro')
[Guess,0.98],[XYZ,0.97]
Since I want to predict campaign sequence, I was wondering if there is anyway I can give below input to the model and get the campaign name in the output
My input to be as -
[['World of Zoro','Boho XYZ','Guess ABC'],['Anniversary X','World of
Zoro','Fathers day','Mothers day'],['Clearance event','Fathers day','Mothers
day','Boho XYZ','Guess ABC','Sale']]
Output -
model.wv.most_similar('World of Zoro')
[Sale,0.98],[Mothers day,0.97]
I am also not sure if there is any functionality within the Word2Vec or any similar algorithms which can help predicting campaigns for individual users.
Thank you for your help.

I don't believe that word2vec is the right approach to model your problem.
Word2vec uses two possible approaches: Skip-gram (given a target word predict its surrounding words) or CBOW (given the surrounding words predict the target word). Your case is similar to the context of CBOW, but there is no reason why the phenomenon that you want to model would respect the linguistic "rules" for which word2vec has been developed.
word2vec tends to predict the word that occurs more frequently in combination with the targeted one within the moving window (in your code: window=4). So it won't predict the best possible next choice but the one that occurred most often in the window span of the given word.
In your call to word2vec (Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)) you are also using min_count=5 so the model is ignoring the words that have a frequency less than 5. Depending on your dataset size, there could be a loss of relevant information.
I suggest to give a look to forecasting techniques and time series analysis methods. I have the feeling that you will obtain better prediction using these techniques rather word2vec. (https://otexts.org/fpp2/index.html)
I hope it helps

Related

Is a small vocabulary for Neural Nets ok?

I am designing a neural network to try and generate music. The neural network would be a 2 layered LSTM (Long Short Term Memory).
I am hoping to encode the music into a many-hot format for training, ie it would be a 1 if that note was playing and a 0 if that note was not playing.
Here is an excerpt of what this data would look like:
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000011010100100001010000000000000000000000
There are 88 columns which represent 88 notes and each now represents a new beat. The output will be at a character level.
I am just wondering since there are only 2 characters in the vocabulary, would the probability of a 0 being next always be higher than the probability of a 1 being next?
I know for a large vocabulary, a large training set is needed, but I only have a small vocabulary. I have 229 files which corresponds to about 50,000 lines of text. Is this enough to prevent the output being all 0s?
Also, would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?
Thanks in advance
A small vocabulary is fine as long as your dataset not skewed overwhelmingly to one of the "words".
As to "would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?", each timestep is represented as 88 characters. Each character is a feature of that timestep. Your LSTM should be outputting the next timestep, so you should have 88 nodes. Each node should output the probability of that node being present in that timestep.
Finally since you are building a Char-RNN I would strongly suggest using abc notation to represent your data. A song in ABC notation looks like this:
X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
This is perfect for Char-RNNs because it represents every song as a set of of characters, and you can run conversions from MIDI to ABC and vice versa. All you have to do is train your model to predict the next character in this sequence instead of dealing with 88 output nodes.

Using SVM to predict text with label

I have data in a csv file in the following format
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.
Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.
I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.
Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.
With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:
hashmap["Jon"] -> "Name"
The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:
Name Power Money Bought_item
Jon Red 30 yes
George blue 20 no
Tom Red 40 no
Bob purple 10 yes
We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.
Your problem would have to look more like:
Feature1 Feature2 Feature3 Category
1.0 foo bar Name
3.1 bar foo Name
23.4 abc def Money
22.22 afb dad Power
223.1 dad vxv Money
You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.
Edit:
So frame it this way:
Name Power Money Label
Jon Red 30 Foo
George blue 20 Bar
Tom Red 40 Foo
Bob purple 10 Bar
OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.
Standardise Money so that it has a range between, approximately, -1/1.
LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.
Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.

Machine Learning algorithm for finding drug based on diagnosis

Training Data Set:
--------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: Headache
> Medicine: **Crocin**
---------------------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: no headache
> Medicine: Paracetamol
----------------------------------
Give sample data set with drug/medicne prescribed to patient.
How to find what medicine based on patient info(age/weight) and diagnosis(fever/headeache/etc)?
The task you are aiming at is classification since the target values are a nominal scale.
Getting the vocabulary right is crucial since all the rest of the work is already done by others such as in sklearn library for Python which contains most relevant algorithms and plenty of data to test them and learn the algorithms.
It seems you have four variables as input:
age - metric variable
weight - metric variable
Diagnosis one - nominal variable
Diagnosis two - nominal variable
You will have to encode you nominal variables, where I would recommend an array of all possible diagnosis such as:
Fever, Headache, Stomach pain, x - [0, 0, 0, 0]
Now each array element will be set to 1 if the diagnosis is correct and 0 else.
Therefore you have a total of 2 + n input variables, whereas n is the number of possible symptoms.
Then you can simply go to the sklearn library and start using the most simple classification algorithm: Nearest Neighbour Classification
If this does not yield good result (probably results will be not good), you can start to use more sophisticated models (SVM, RandomForest). Yet first you should learn the vocabulary and use simple models to get to know the methods and the processing chain.

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

SVM Classifying binary data DNA

I'm working SVM in R software and I would appreaciate any input you may provide.
I have a data set that I need to train with SVM, the format of the data is the following
ToPredict Data1 Data2 Data3 Data4 DNA
S 1 12 1 11 000000000100
B -1 17 14 3 11011110111110111
S 1 4 0 4 0000
The question that I have is regarding the DNA column.
SVM is able to get an input like DNA and still calculate reliable predictions?
For my data set, 0≠00 or 1≠001 therefore, it cannot be taken as integers.Every value represents information that needs to be processed and the order is very important, it's a string of binary values, either is 1 or 0.
The 0101 information could be displayed as ABAB etc. (A=0, B=1)
How can I train a SVM with the data above?
Thank you.
For SVMs to work, "all" you need to have a kernel function.
So what is a sensible kernel function for your "DNA strings"? You probably don't need to be able to prove it is a proper kernel, but you can get away with a good similarity measure.
How would you evaluate similarity of your sequences? I cannot help you on that, because I don't know what the data means; this is up to the user (i.e. you) to specify.

Resources