Incremental Adding of New Categories - machine-learning

I was wondering if there is any algorithm for incrementally adding new classes to existing classifier system. For e.g. if I have trained a system with 50 categories, and I want to add another 10 categories to the system, what methods should I look into? There are wide range of algorithms that allow incrementally updating system with additional training samples from existing categories, but I am not aware of methods that will allow adding more categories. Theoretically, I think Nearest Neighbor like algorithms can be applied to this task, but are there other algorithms that are suitable for large scale tasks (say updating a system trained with 500 categories with 50 additional categories? May be in the domain of incremental decision trees? Algorithms like incremental SVM do not scale very well for large number of categories. If there is any paper/code I would appreciate pointers to it.

If I understand your question correctly, you're asking about divisive clustering (you have a given set of data and want to re-cluster them with a larger number of groups than before).
Most algorithms I'm familiar with would require re-building the clustering basically from scratch. However, you might want to look at the BIRCH algorithm. Since it stores only a summary of the classes (without explicit data references), it is a) suitable for Big Data™, and b) it features a kind of distance measure that might tell you which category you should split next (in case you want to dynamically generate additional 50 "most distinct" categories).

Related

Ignore columns in SMOTE oversampling

I am having six feature columns and one target column, which is imbalanced.
Can I make oversampling method like ADASYN or SMOTE by creating synthetic records only for the four columns X1,X2,X3,X4 by copying exactly the same as constant (Month, year column)
Current one:
Expected one: It can create synthetic records by up-sampling target class '1' but the number of records can increase but the added records should have month and years (unchanged as shown below )
From a programming perspective, an identical question asked in the relevant Github repo back in 2017 was answered negatively:
[Question]
I have a data frame that I want to apply smote to but I wish to only use a subset of the columns. The other columns contain additional data for each sample and I want each new sample to contain the original info as well
[Answer]
There is no way to do that apart of extracting the column in a new matrix and process it with SMOTE.
Even if you generate a new samples you have to decide what to put as values there so I don't see how such feature can be added
Answering from a modelling perspective, this is not a good idea and, even if you could find a programming workaround, you should not attempt it - and arguably, this is the reason why the developer of imbalanced-learn above was dismissive even in the thought of adding such a feature in the SMOTE implementation.
Why is that? Well, synthetic oversampling algorithms, like SMOTE, essentially use some variant of a k-nn approach in order to create artificial samples "between" the existing ones. Given this approach, it goes without saying that, in order for these artificial samples to be indeed "between" the real ones (in a k-nn sense), all the existing (numerical) features must be taken into account.
If, by employing some programming alchemy, you manage at the end to produce new SMOTE samples based only on a subset of your features, putting the unused features back in will destroy any notion of proximity and "betweenness" of these artificial samples to the real ones, thus compromising the whole enterprise by inserting a huge bias in your training set.
In short:
If you think your Month and year are indeed useful features, just include them in SMOTE; you may get some nonsensical artificial samples, but this should not be considered a (big) problem for the purpose here.
If not, then maybe you should consider removing them altogether from your training.

Adding vocabulary and improve word embedding with another model that was built on bigger corpus

I'm new to NLP. I'm currently building a NLP system in a specific domain. After training a word2vec and fasttext model on my documents, I found that the embedding is not really good because I didn't feed enough number of documents (e.g. the embedding can't see that "bar" and "pub" is strongly correlated to each other because "pub" only appears a few in the documents). Later, I found a word2vec model online built on that domain-specific corpus which definitely has a way better embedding (so "pub" is more related to "bar"). Is there any way to improve my word embedding using the model I found? Thanks!
Word2Vec (and similar) models really require a large volume of varied data to create strong vectors.
But also, a model's vectors are typically only meaningful alongside other vectors that were trained together in the same session. This is both because the process includes some randomness, and the vectors only acquire their useful positions via a tug-of-war with all other vectors and aspects of the model-in-training.
So, there's no standard location for a word like 'bar' - just a good position, within a certain model, given the training data and model parameters and other words co-populating the model.
This means mixing vectors from different models is non-trivial. There are ways to learn a 'translation' that moves vectors from the space of one model to another – but that is itself a lot like a re-training. You can pre-initialize a model with vectors from elsewhere... but as soon as training starts, all the words in your training corpus will start drifting into the best alignment for that data, and gradually away from their original positions, and away from pure comparability with other words that aren't being updated.
In my opinion, the best approach is usually to expand your corpus with more appropriate data, so that it has "enough" examples of every word important to you, in sufficiently varied contexts.
Many people use large free text dumps like Wikipedia articles for word-vector training, but be aware that its style of writing – dry, authoritative reference texts – may not be optimal for all domains. If your problem-area is "business reviews", you'd probably do best finding other review texts. If it's fiction stories, more fictional writing. And so forth. You can shuffle these other text-soruces in with your data to expand the vocabulary coverage.
You can also potentially shuffle in extra repeated examples of your own local data, if you want it to effectively have relatively more influence. (Generally, merely repeating a small number of non-varied examples can't help improve word-vectors: it's the subtle contrasts of different examples that helps. But as a way to incrementally boost the influence of some examples, when there are plenty of examples overall, it can make more sense.)

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Is that make sense to construct a learning Model using only one feature?

in order to improve the accuracy of an adaboost classifier (for image classification), I am using genetic programming to derive new statistical Measures. Every Time when a new feature is generated, i evaluate its fitness by training an adaboost Classifier and by testing its performances. But i want to know if that procedure is correct; I mean the use of a single feature to train a learning model.
You can build a model on one feature. I assume, that by "one feature" you mean simply one number in R (otherwise, it would be completely "traditional" usage). However this means, that you are building a classifier in one-dimensional space, and as such - many classifiers will be redundant (as it is really a simple problem). What is more important - checking whether you can correctly classify objects using one particular dimensions does not mean that it is a good/bad feature once you use combination of them. In particular it may be the case that:
Many features may "discover" the same phenomena in data, and so - each of them separatly can yield good results, but once combined - they won't be any better then each of them (as they simply capture same information)
Features may be useless until used in combination. Some phenomena can be described only in multi-dimensional space, and if you are analyzing only one-dimensional data - you won't ever discover their true value, as a simple example consider four points (0,0),(0,1),(1,0),(1,1) such that (0,0),(1,1) are elements of one class, and rest of another. If you look separatly on each dimension - then the best possible accuracy is 0.5 (as you always have points of two different classes in exactly same points - 0 and 1). Once combined - you can easily separate them, as it is a xor problem.
To sum up - it is ok to build a classifier in one dimensional space, but:
Such problem can be solved without "heavy machinery".
Results should not be used as a base of feature selection (or to be more strict - this can be very deceptive).

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources