I have a few one hot vectors(more than 2) of size 48 each. How can I add them? Is there any specific method or simple arithmetic addition? If arithmetic addition then how should I handle the carry bit?
I know this may be a very basic question but I am new to the field of Artificial Intelligence and need help.
One hot vectors are basically vectors. To the same summation applies to them ,which applies to normal vectors.
To add or subtract two vectors, add or subtract the corresponding components.
Let u=⟨u1,u2⟩ and v =⟨v1,v2⟩ be two vectors.
Then, the sum of u and v is the vector
u+v=⟨u1+v1,u2+v2⟩
Well adding one hot vector isn't a usual practice. It is exclusively task centric. I was dealing with timeseries data when posted this question.
Suppose you have three one hot vectors [0,1,0,0], [1,0,0,0] and [0,1,0,0]. The addition process that I followed was to add these vectors in bit-wise fashion i.e [0+1+0, 1+0+1, 0+0+0, 0+0+0]. The resultant vector would be [1,2,0,0].
However, the major concern (with very large numbers of vectors) is the case when the addition produces a carry. In such a case, one may simply treat the two (or three) digit number as one i.e. [12,2,0,100].
PS: This is a highly data-centric approach. There may exists better proposals.
Related
I am trying to figure it out which features are being considered in each subsampling in my classification problem, for this, I am assuming that there is a random subset of features of length max_features that is considered when building every tree.
I am interested in this because I am using two different types of features for my problem so I want to make sure that in every tree both types of features are being used for every node split. So one way to at least make each tree to consider all features is by setting the max_features parameter to None. So one question here would be:
Does that mean that both types of features are being considered for every node split?
Another one derived from the previous question is:
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well? Besides, can this subsampling be done by group of rows instead of randomly?
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter neither on Decision Trees nor on random forest since it is opposite to the whole point and definition of random forest in terms of correlation among trees (I am not completely sure about this statement).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Any suggestion or comment is very welcomed.
Feel free to correct any assumption.
In the source code I have been reading about this but could not find where this might be defined.
Source code inspected so far:
splitter.py code from decision tree
forest.py code from random forest
Does that mean that both types of features are being considered for every node split?
Given that you have correctly pointed out above that setting max_features to None will indeed force the algorithm to consider all features in every split, it is unclear what exactly you are asking here: all means all and, from the point of view of the algorithm, there are not different "types" of features.
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well?
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means that, in each sample, some rows will be missing while others will be present multiple times.
Random forest is in fact the combination of two independent ideas: bagging, and random selection of features. The latter corresponds essentially to "column subsampling", while the former includes the bootstrap sampling I have just described.
Besides, can this subsampling be done by group of rows instead of randomly?
AFAIK no, at least in the standard implementations (including scikit-learn).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Everything can be modified in the source code, literally; now, if it is really necessary (or even a good idea) to do so is a different story...
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter
It does not indeed, as this is the very characteristic that discriminates RF from the simpler bagging approach (short for bootstrap aggregating). Experiments have indeed shown that adding this random selection of features at each step boosts performance related to simple bagging.
Although your question (and issue) sounds rather vague, my advice would be to "sit back and relax", letting the (powerful enough) RF algorithm do its job with your data...
My dataset is composed of millions of row and a couple (10's) of features.
One feature is a label composed of 1000 differents values (imagine each row is a user and this feature is the user's firstname :
Firstname,Feature1,Feature2,....
Quentin,1,2
Marc,0,2
Gaby,1,0
Quentin,1,0
What would be the best representation for this feature (to perform clustering) :
I could convert the data as integer using a LabelEncoder, but it doesn't make sense here since there is no logical "order" between two differents label
Firstname,F1,F2,....
0,1,2
1,0,2
2,1,0
0,1,0
I could split the feature in 1000 features (one for each label) with 1 when the label match and 0 otherwise. However this would result in a very big matrix (too big if I can't use sparse matrix in my classifier)
Quentin,Marc,Gaby,F1,F2,....
1,0,0,1,2
0,1,0,0,2
0,0,1,1,0
1,0,0,1,0
I could represent the LabelEncoder value as a binary in N columns, this would reduce the dimension of the final matrix compared to the previous idea, but i'm not sure of the result :
LabelEncoder(Quentin) = 0 = 0,0
LabelEncoder(Marc) = 1 = 0,1
LabelEncoder(Gaby) = 2 = 1,0
A,B,F1,F2,....
0,0,1,2
0,1,0,2
1,0,1,0
0,0,1,0
... Any other idea ?
What do you think about solution 3 ?
Edit for some extra explanations
I should have mentioned in my first post, but In the real dataset, the feature is the more like the final leaf of a classification tree (Aa1, Aa2 etc. in the example - it's not a binary tree).
A B C
Aa Ab Ba Bb Ca Cb
Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2
So there is a similarity between 2 terms under the same level (Aa1 Aa2 and Aa3are quite similar, and Aa1 is as much different from Ba1 than Cb2).
The final goal is to find similar entities from a smaller dataset : We train a OneClassSVM on the smaller dataset and then get a distance of each term of the entiere dataset
This problem is largely one of one-hot encoding. How do we represent multiple categorical values in a way that we can use clustering algorithms and not screw up the distance calculation that your algorithm needs to do (you could be using some sort of probabilistic finite mixture model, but I digress)? Like user3914041's answer, there really is no definite answer, but I'll go through each solution you presented and give my impression:
Solution 1
If you're converting the categorical column to an numerical one like you mentioned, then you face that pretty big issue you mentioned: you basically lose meaning of that column. What does it really even mean if Quentin in 0, Marc 1, and Gaby 2? At that point, why even include that column in the clustering? Like user3914041's answer, this is the easiest way to change your categorical values into numerical ones, but they just aren't useful, and could perhaps be detrimental to the results of the clustering.
Solution 2
In my opinion, depending upon how you implement all of this and your goals with the clustering, this would be your best bet. Since I'm assuming you plan to use sklearn and something like k-Means, you should be able to use sparse matrices fine. However, like imaluengo suggests, you should consider using a different distance metric. What you can consider doing is scaling all of your numeric features to the same range as the categorical features, and then use something like cosine distance. Or a mix of distance metrics, like I mention below. But all in all this will likely be the most useful representation of your categorical data for your clustering algorithm.
Solution 3
I agree with user3914041 in that this is not useful, and introduces some of the same problems as mentioned with #1 -- you lose meaning when two (probably) totally different names share a column value.
Solution 4
An additional solution is to follow the advice of the answer here. You can consider rolling your own version of a k-means-like algorithm that takes a mix of distance metrics (hamming distance for the one-hot encoded categorical data, and euclidean for the rest). There seems to be some work in developing k-means like algorithms for mixed categorical and numerical data, like here.
I guess it's also important to consider whether or not you need to cluster on this categorical data. What are you hoping to see?
Solution 3:
I'd say it has the same kind of drawback as using a 1..N encoding (solution 1), in a less obvious fashion. You'll have names that both give a 1 in some column, for no other reason than the order of the encoding...
So I'd recommend against this.
Solution 1:
The 1..N solution is the "easy way" to solve the format issue, as you noted it's probably not the best.
Solution 2:
This looks like it's the best way to do it but it is a bit cumbersome and from my experience the classifier does not always performs very well with a high number of categories.
Solution 4+:
I think the encoding depends on what you want: if you think that names that are similar (like John and Johnny) should be close, you could use characters-grams to represent them. I doubt this is the case in your application though.
Another approach is to encode the name with its frequency in the (training) dataset. In this way what you're saying is: "Mainstream people should be close, whether they're Sophia or Jackson does not matter".
Hope the suggestions help, there's no definite answer to this so I'm looking forward to see what other people do.
I am trying to use relations between objects for a supervised learning task.
For eg, given a text like "Cats eat fish" , I would like to use the relation Cats-eat-fish as a feature for learning task (namely identifying the sense of a word). I thus would like to represent this relation numerically so that I could use it as a feature for a learning a model. Any suggestions on how I could accomplish that. I was thinking of hashing it to an integer but that could pose challenges like two relations semantically the same could have 2 very different hash values. I ideally would like 2 similar relations (for eg lives and resides) to hash to the same value. I guess I would also need to figure out if I could canonicalize relations before hashing.
Other approaches perhaps not using numerical features would also be useful. I am also wondering if there are graph based approaches to this problem.
I'd suggest making (very large numbers) of binary features for all possible relation types, and then possibly running some form of dimensionality reduction on the resulting (very sparse) feature space.
Another way to do this, which would reduce sparsity, would be to replace the bare words with entity types, for example [animal] eats [animal], or even [animate] eats [animate], and then use binary features in this space. You want to avoid mapping to numerical values on a single dimension because you'll impose spurious ordinal relations between features if you do.
How about representing verbs by features that would express typical words preceding the verb (usually subject) and typical words following the verb (usually object). Say you can take 500 most frequent words (or even better most discriminating words), then each verb would be represented as a 1000-dimensional vector. Each feature in the vector can be either binary (is there the word with frequency above certain threshold or not), or pure count, or probably best as logarithm. Then you can run PCA to reduce the vector to some smaller dimension.
The approach above is probabilistic which might be good or bad depending on what you want. IF you want to do it precisely with a lot of manual input, then look at situation semantics.
So I want to make sure I have this right. First, I'm an undergrad computer engineering major with much more hardware/EE experience than software. This summer I have found myself using a clustering algorithm that uses one-class SVM's. Is an SVM just a mathematical model used to classify/separate input data? Do SVM's work well on data sets with one attribute/variable? I'm guessing no to the latter, possibly because classification with a single attribute is practically stereotyping. My guess is SVM's perform better on datasets that have multiple attributes/variables to contribute to classification. Thanks in advance!
SVM tries to build hyperplane separating 2 classes (AFAIK, in one-class SVM there's one class for "normal" and one class for "abnormal" instances). With only one attribute you have one-dimensional space, i.e. line. Thus, hyperplane here is a dot on the line. If instances of 2 classes (dots on this line) may be separated by this hyperplane dot (i.e. they are linearly separable), then yes, SVM may be used. Otherwise not.
Note, that With several attributes SVM still may be used to classify even linearly unseparable instances. On the next image there are 2 classes in two-dimensional space (2 attributes - X and Y), one marked with blue dots, and the other with green.
You cannot draw line that can separate them. Though, so-called kernel trick may be used to produce much more attributes by combining existing. With more attributes you can get higher-dimensional space, where all instances can be separated (video). Unfortunately, one attribute cannot be combined with itself, so for one-dimensional space kernel trick is not applicable.
So, the answer to your question is: SVM may be used on sets with only one attribute if and only if instances of 2 classes are linearly separable by themselves.
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies