sklearn LDA: difference between coef_ and scalings_? - machine-learning

I'm new to LDA. The documentation isn't very clear - what's the difference between coef_ and scalings_?
My data has many features (F1, F2, ..., F10000) and some labeled classes (C1, C2, ..., C5). I want to find the equations of the 2 lines (linear combinations of F1-F10000) that form the axes for the 2D space in which the classes (C1-C5) are maximally separated. How do I get the equations for these lines using sklearn?

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

get_feature_names for each label

I'm new to machine learning, started with multilable Text classification. I'm able to classify the new data based on trained modle. however some lables are miss predicted.
f
I want to see the weightage of the tokens or may be called features used for L1 and L2.
or example there are two lables L1 and L2. new records are associated to L1 but they are predicted as L2, these 2 have similar tokens with few difference.
My question to all is , can I see the features mapped by tfidfvectorizer to L1 and L2.something like below using get_feature_names() and 'Y' variables.
L1(hockey)-- 'ball','ground','net','stick'
L2(cricket)-- 'ball','ground','stick','stumps'

classifier using boolean formula

I am trying to find a classifier that is represented by an arbitrary boolean formula. Is it possible to do so ? I tried using the SVC from sklearn.svm using the linear kernel, but not sure if it is correct and if it is, how to extract a formula from the learned classifier.
Here's a simple dataset with 4 variables x,y,z,w (features) and labels 0 and 1. And any data with x=1 or y=1 will have a label 1 and everything else has label 0.
x,y,z,w,label
0,0,0,0,0
0,0,0,1,0
0,0,1,0,1
0,0,1,1,1
0,1,0,0,0
0,1,0,1,0
0,1,1,0,1
0,1,1,1,1
1,0,0,0,1
1,0,0,1,1
1,0,1,0,1
1,0,1,1,1
1,1,0,0,1
1,1,0,1,1
1,1,1,0,1
1,1,1,1,1
For this example, I want to extract the classifier represented by the formula x=1 or z=1. Eventually I will have more complex data represented by complex, arbitrary formula (e.g., (x= 1 or y=0) and (z=0) ... )
The input->output relationships in your data is non-linear, discrete and non-smooth. Any linear models will perform badly in this case. Try instead a DecisionTreeClassifier, which should be OK for your kind of data.
Alternatively you could use a Boolean Satisfiability solver, but this will only work if your data is deterministic and not fuzzy.

Dimensions of LSTM variant in Deep Mind's Differentiable Neural Computer (DNC)

I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.
By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.

Decision tree completeness and unclassified data

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Resources