I have data set of grade in four lessons (for example lesson a,lesson b,lesson c,lesson d) for 100 students and let's imagine this grades are In association with
grade of lesson f.
I want to implement naive Bayes for Forecast grade lesson f by that four grade but I don't know how use input for this.
I read naive Bayes for spam mail detection and in that, Possibility of each word Calculated.
But for grade I do not know what Possibility I must calculate.
I have tried like spam but for this example I have just four names (for each lesson)
In order to do a good classification, you need to have some information in plus about student than class they are taking. Following your exemple, spam detection is based on words, stop words which are generally spam (buy, promotion, money) or origin in http headers.
For the case to predict student grade, you could imagine having information about student like : social class, is he doing sport, male or female and so on.
Getting back to your question, it is not the name of the lessons which are interesting but the grades each students got at this lessons. You need to take grades of each four lessons and lesson f to train the naive Bayes classifier.
Your entry might look like that:
StudentID gradeA gradeB gradeC gradeD gradeF
1 10 9 8 5 8
2 3 5 3 8 8
3 5 3 1 1 2
4 10 10 10 5 4
After training your classifier you will pass new entry for a new student like that:
StudentID gradeA gradeB gradeC gradeD
1058 1 5 8 4
The classifier will be able to predict the grade for lesson F taking into consideration the precedant grades.
You might have notice that I intentionnally did a training dataset where gradeF is highly correlated with gradeD. It is what the Bayes classifier will try to learn, just in a more complexe way.
Related
Let us say I implemented a random forest algorithm with 20 trees using 20 random subsets of training data.
and there are 4 different class labels that can be predicted.
So, what exactly should be called a majority verdict.
If there are a total of 20 trees then should a majority verdict require that the highest voted class label is having at least 10 votes or does it simply need to be higher than other lables.
example:
Total Trees = 20, Class Labels are {A,B,C,D}
Scenario 1:
A= 10 votes
B= 4 votes
C= 3 votes
D = 3 votes
Clearly,A is the winner here
Scenario 2:
A= 6 votes
B= 5 votes
C= 5 votes
D = 4 votes
Can A be called the winner here?
If you are making a hard-decision, meaning you are asked to return the best guess, then yes A is the winner.
To capture the difference between these two cases, you can consider a soft-decision system instead, where you return the winner with a confidence value. An example confidence in this case can be the ratio of votes of A. Then, the first case would be a more confident estimate than the latter
I am working on data-set with more than 100,000 records.
This is how the data looks like:
email_id cust_id campaign_name
123 4567 World of Zoro
123 4567 Boho XYz
123 4567 Guess ABC
234 5678 Anniversary X
234 5678 World of Zoro
234 5678 Fathers day
234 5678 Mothers day
345 7890 Clearance event
345 7890 Fathers day
345 7890 Mothers day
345 7890 Boho XYZ
345 7890 Guess ABC
345 7890 Sale
I am trying to understand the campaign sequence and predict the next possible campaign for the customers.
Assume I have processed my data and stored it in 'camp'.
With Word2Vec-
from gensim.models import Word2Vec
model = Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)
The problem with this model is that it accepts tokens and spits out text-tokens with probabilities in return when looking for similarities.
Word2Vec accepts this form of input-
['World','of','Zoro','Boho','XYZ','Guess','ABC','Anniversary','X'...]
And gives this form of output -
model.wv.most_similar('Zoro')
[Guess,0.98],[XYZ,0.97]
Since I want to predict campaign sequence, I was wondering if there is anyway I can give below input to the model and get the campaign name in the output
My input to be as -
[['World of Zoro','Boho XYZ','Guess ABC'],['Anniversary X','World of
Zoro','Fathers day','Mothers day'],['Clearance event','Fathers day','Mothers
day','Boho XYZ','Guess ABC','Sale']]
Output -
model.wv.most_similar('World of Zoro')
[Sale,0.98],[Mothers day,0.97]
I am also not sure if there is any functionality within the Word2Vec or any similar algorithms which can help predicting campaigns for individual users.
Thank you for your help.
I don't believe that word2vec is the right approach to model your problem.
Word2vec uses two possible approaches: Skip-gram (given a target word predict its surrounding words) or CBOW (given the surrounding words predict the target word). Your case is similar to the context of CBOW, but there is no reason why the phenomenon that you want to model would respect the linguistic "rules" for which word2vec has been developed.
word2vec tends to predict the word that occurs more frequently in combination with the targeted one within the moving window (in your code: window=4). So it won't predict the best possible next choice but the one that occurred most often in the window span of the given word.
In your call to word2vec (Word2Vec(sentences=camp, size=100, window=4, min_count=5, workers=4, sg=0)) you are also using min_count=5 so the model is ignoring the words that have a frequency less than 5. Depending on your dataset size, there could be a loss of relevant information.
I suggest to give a look to forecasting techniques and time series analysis methods. I have the feeling that you will obtain better prediction using these techniques rather word2vec. (https://otexts.org/fpp2/index.html)
I hope it helps
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
Let's say I have a set of training examples where A_i is an attribute and the output is Iris-setosa
The values in the dataset are
A1, A2, A3, A4 outcome
3 5 2 2 Iris-setosa
3 4 2 2 Iris-setosa
2 4 2 2 Iris-setosa
3 6 2 2 Iris-setosa
2 5 3 2 Iris-setosa
3 5 2 2 Iris-setosa
3 5 2 3 Iris-setosa
4 6 2 2 Iris-setosa
3 7 2 2 Iris-setosa
from analysis the range of attribute are:
A1 ----> [2,3,4]
A2 ----> [4,5,6,7]
A3 ----> [2,3]
A4 ----> [2,3]
I have defined:
A1 ----> [Low(2),Medium(3),High(4)]
A2 ----> [Low(4,5),Medium(6),High(7)]
A3 ----> [Low(<2),Medium(2),High(3)]
A4 ----> [Low(<2),Medium(2),High(3)]
I have set like below:
A1, A2, A3, A4 outcome
Medium Low Medium Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Low Low Medium Medium Iris-setosa
Medium Medium Medium Medium Iris-setosa
Low Low High Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Medium Low Medium High Iris-setosa
High Medium Medium Medium Iris-setosa
Medium High Medium Medium Iris-setosa
I know I have to define the fitness function. What is it for this problem? In my actual problem there are 50 training examples but this is a similar problem.
How can I optimize rule by using GA? How can I encode?
Suppose if I input (4,7,2,3), how optimization can help me classify whether the input is Iris-setosa or not?
Thank You for your patience.
The task you describe is known as one-class classification.
Identifying elements of a specific class amongst all elements, by learning from a training set containing only the objects of that class is
... different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes.
A viable approach is to build the outlier class data artificially and train using a two class model but it can be tricky.
When generating artificial outlier data you need a wider range of possible values than the target data (you have to ensure that the target data is surrounded in all attribute directions).
The resulting two-class training data set tends to be unbalanced and
large.
Anyway:
if you want to try Genetic Programming for one-class classification take a look at
One-Class Genetic Programming - Robert Curry, Malcolm I. Heywood (presented in EuroGP'10, the 13th European conference on Genetic Programming)
also consider the anomaly detection techniques (a simple introduction is the 9th week of the Coursera Machine Learning class by Andrew Ng; notes here).
Okay, if you just want to know how to program a fitness function... Assume training data is list of tuples as so:
training_data = list((3,6,3,5),(8,3,1,2),(3,5,2,4)...)
Make a reference set for elements of A1, A2, etc. as follows, assuming first tuple tells us length of all the others (that way you can have any number of tuples in your training data):
A=[]
for x in training_data[0]:
res_list = set()
res_list.update(x[index] for x in training_data)
A.append(res_list)
index+=1
Now all your reference data is easily referred to (sets of A[0], A[1] etc). Let's make a fitness function that takes a tuple and return a fitness score that will help a GA converge on a right answer (1-4 if right elements, 5+ if in training_data). Play around with the scoring but these should work fine.
def fitness_function(target):
# Assume target is a tuple of same length as reference data
global A, training_data
score = 0
# Give a point for each element that is in the data set
index = 0
for t in target:
if t in A[index]:
score+=1
index += 1
# Give 5 points if entire tuple is exact match
if target in training_data:
score+=5
return score
What you have here is a multi-class classification problem that can be solved with Genetic Programming and related techniques.
I suppose that data are those from the well-known Iris data set: https://en.wikipedia.org/wiki/Iris_flower_data_set
If you need a quick start, you can use the source code of my method: Multi Expression Programming (which is based on Genetic Programming) which can be downloaded from here: https://github.com/mepx/mep-basic-src
There is a C++ source name mep_multi_class.cpp in the src folder which can "solve" iris dataset. Just call the read_training_data function with iris.txt file (which can also be downloaded from dataset folder from github).
Or, if you are not familiar with C++, you can try directly MEPX software which has a simple user-interface: http://www.mepx.org. A project with iris dataset can also be downloaded from github.
I am studying for my Machine Learning (ML) class and I have question that I couldn't find an answer with my current knowledge. Assume that I have the following data-set,
att1 att2 att3 class
5 6 10 a
2 1 5 b
47 8 4 c
4 9 8 a
4 5 6 b
The above data-set is clear and I think I can apply classification algorithms for new incoming data after I train my data-set. Since each instance has a label, it is easy to understand that each instance has a class that is labeled with. Now, my question is what if we had a class consisting of different instances such as gesture recognition data. Any class will have multiple instances that specifies its class. For example,
xcor ycord depth
45 100 10
50 20 45
10 51 12
the above three instances belong to class A and the below three instances belong to class B as a group, I mean those three data instances constitute that class together. For gesture data, the coordinates of movement of your hand.
xcor ycord depth
45 100 10
50 20 45
10 51 12
Now, I want every incoming three instances to be grouped either as A or B? Is it possible to label all of them either A or B without labeling each instance independently? As an example, assume that following group belongs to B, so I want all of the instances to be labelled together as B not individually because of their independent similarity to class A or B? If it is possible, how do we call it?
xcor ycord depth
45 10 10
5 20 87
10 51 44
I don't see an scenario where you might want to group an indeterminate number of rows in your dataset as features of a given class. They are either independently associated with a class or they are all features and therefore an unique row. Something like:
Instead of
xcor ycord depth
45 10 10
5 20 87
10 51 44
Would be something like:
xcor1 ycord1 depth1 xcor2 ycord2 depth2 xcor3 ycord3 depth3
45 10 10 5 20 87 10 51 44
This is pretty much the same approach that is used to model time series
It seems you may be confused between different types of machine learning.
The dataset given in your class is an example of a supervised classification algorithm. That is, given some data and some classes, learn a classifier that can predict classes on new, unseen data. Classifiers that you can apply to this problem include
decision trees,
support vector machines
artificial neural networks, etc.
The second problem you are describing is an example of an unsupervised classification problem. That is, given some data without labels, we want to find an automatic way to separate the different types of data (your A and B) algorithmically. Algorithms that solve this problem include
K-means clustering
Mixture models
Principal components analysis followed by some sort of clustering
I would look into running a factor analysis or normalizing your data, then running a K-means or gaussian mixture model. This should discover the A and B types of your data if they are distinguishable.
Take a peek at the use of neural networks for recognizing hand-written text. You can think of a gesture as a hand-written figure with an additional time component (so, give each pixel an "age".) If your training data also includes similar time data, then I think the technique should carry over well.