Suppose, I have a data set which contains two classes and more than 50,000 features. Most of the works I have found try to select features which distinguishes two classes. We called those selected features most important features. But which features are most relevant to which class can't be defined by those approaches that I want to know. For example,
f1 f2 f3 ....... f50000 class
sample 1: .5 .4 23......... .45 1
sample 2: .2 .56 .5......... .45 2
sample 3: .4 56 .23......... .45 2
sample 4: .3 .45 76......... .45 1
Here, f1= feature 1, f2=feature2 etc.
Suppose, Somehow I know, f1, f2, f3, f45, f344 is related with class 1, and f4, f5, f6, f90, f99 are related with class 2. Other features are not related with those classes. So the output be,
class1: f1, f2,f3,f45,f344
class2: f4,f5,f6,f90,f99
What will be the algorithms ?
It will be very helpful for me if anyone gives me any papers(deep learning or others) or references. Thanks in advance.
There are many ways to detect the significance of the features. A simple approach would be truncating features with low variance. Have a look at this scikit article if you'd like to use their implementation.
An other common why is to penalize the amount of features with L1/L2 regularization. This prevents the algorithm from using all weights. An implementation is in the same scikit article. I just found this github post which explains quite shortly and well the L2 regularization in combination with logistic regression.
Related
While using keras, particularly for a U-net, I am only aware of just specifying the model parameters in the following manner:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[mean_iou])
Now I can set the loss equal to whatever I define it to be. However, this loss function will be evenly applied to all classes. How do I make it so that mis-predictions for certain classes are weighed more than other mis-predictions.
For example, let's say I have the following classes in each image.
Class A, B, and C. Now, class A and B account for about 45% of the entire image and class C only accounts for about 10% of the entire image. However, I care much more about having high prediction for class C.
In this situation, the loss functions don't do such a good job since the class imbalance absorbs the loss of class c. Hence, I would like to figure out a way to weight the loss of one class higher than the other.
I am also open to other suggestions to solving this problem - like for instance, perhaps having two separate networks?
EDIT: Here is a follow up to this question that will be required to implement the answer that has been accepted by this post.
You can assign weights for each class manually. For example:
class_weight = {0: 0.2, 1: 0.3, 2: 0.25, 3: 0.25}
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[mean_iou], class_weight=class_weight)
or you can use this scikit library function
There are also many examples in the web, didn't any of them work for you?
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission
Let's say I have a set of training examples where A_i is an attribute and the output is Iris-setosa
The values in the dataset are
A1, A2, A3, A4 outcome
3 5 2 2 Iris-setosa
3 4 2 2 Iris-setosa
2 4 2 2 Iris-setosa
3 6 2 2 Iris-setosa
2 5 3 2 Iris-setosa
3 5 2 2 Iris-setosa
3 5 2 3 Iris-setosa
4 6 2 2 Iris-setosa
3 7 2 2 Iris-setosa
from analysis the range of attribute are:
A1 ----> [2,3,4]
A2 ----> [4,5,6,7]
A3 ----> [2,3]
A4 ----> [2,3]
I have defined:
A1 ----> [Low(2),Medium(3),High(4)]
A2 ----> [Low(4,5),Medium(6),High(7)]
A3 ----> [Low(<2),Medium(2),High(3)]
A4 ----> [Low(<2),Medium(2),High(3)]
I have set like below:
A1, A2, A3, A4 outcome
Medium Low Medium Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Low Low Medium Medium Iris-setosa
Medium Medium Medium Medium Iris-setosa
Low Low High Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Medium Low Medium High Iris-setosa
High Medium Medium Medium Iris-setosa
Medium High Medium Medium Iris-setosa
I know I have to define the fitness function. What is it for this problem? In my actual problem there are 50 training examples but this is a similar problem.
How can I optimize rule by using GA? How can I encode?
Suppose if I input (4,7,2,3), how optimization can help me classify whether the input is Iris-setosa or not?
Thank You for your patience.
The task you describe is known as one-class classification.
Identifying elements of a specific class amongst all elements, by learning from a training set containing only the objects of that class is
... different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes.
A viable approach is to build the outlier class data artificially and train using a two class model but it can be tricky.
When generating artificial outlier data you need a wider range of possible values than the target data (you have to ensure that the target data is surrounded in all attribute directions).
The resulting two-class training data set tends to be unbalanced and
large.
Anyway:
if you want to try Genetic Programming for one-class classification take a look at
One-Class Genetic Programming - Robert Curry, Malcolm I. Heywood (presented in EuroGP'10, the 13th European conference on Genetic Programming)
also consider the anomaly detection techniques (a simple introduction is the 9th week of the Coursera Machine Learning class by Andrew Ng; notes here).
Okay, if you just want to know how to program a fitness function... Assume training data is list of tuples as so:
training_data = list((3,6,3,5),(8,3,1,2),(3,5,2,4)...)
Make a reference set for elements of A1, A2, etc. as follows, assuming first tuple tells us length of all the others (that way you can have any number of tuples in your training data):
A=[]
for x in training_data[0]:
res_list = set()
res_list.update(x[index] for x in training_data)
A.append(res_list)
index+=1
Now all your reference data is easily referred to (sets of A[0], A[1] etc). Let's make a fitness function that takes a tuple and return a fitness score that will help a GA converge on a right answer (1-4 if right elements, 5+ if in training_data). Play around with the scoring but these should work fine.
def fitness_function(target):
# Assume target is a tuple of same length as reference data
global A, training_data
score = 0
# Give a point for each element that is in the data set
index = 0
for t in target:
if t in A[index]:
score+=1
index += 1
# Give 5 points if entire tuple is exact match
if target in training_data:
score+=5
return score
What you have here is a multi-class classification problem that can be solved with Genetic Programming and related techniques.
I suppose that data are those from the well-known Iris data set: https://en.wikipedia.org/wiki/Iris_flower_data_set
If you need a quick start, you can use the source code of my method: Multi Expression Programming (which is based on Genetic Programming) which can be downloaded from here: https://github.com/mepx/mep-basic-src
There is a C++ source name mep_multi_class.cpp in the src folder which can "solve" iris dataset. Just call the read_training_data function with iris.txt file (which can also be downloaded from dataset folder from github).
Or, if you are not familiar with C++, you can try directly MEPX software which has a simple user-interface: http://www.mepx.org. A project with iris dataset can also be downloaded from github.
I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.