Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a question about how one can go about creating a neural network (or similar) architecture(s) that does the following:
Example
Say I have a model that uses feature 1 and feature 2 to predict a
target. That model does not perform well because I am limited to the
number of training examples that have a feature 1 and feature 2
populated.
If I were to have another neural network with feature 3 and feature 4,
and my goal is to predict the same target, how can I go about
combining the learning from both models to make the same target
prediction.
This continues for several other similar datasets with different
features but a common target
Explanation
I am only doing this because not every training example has a feature
1, 2, 3 and 4, therefore it cannot be incorporated into a single
model. But the only thing in common is that the models are trying to
predict the same target.
Question
What machine learning strategy (not just a neural network) would be
most appropriate for such a problem?
The model you describe is built out of 2 core sub-models.
Many feature-dependent encoders, one for each feature set. Features 1 and 2 can be combined by part of the model into some hidden representation. Features 3 and 4 would be translated into the same hidden representation, but would have a different sub-model with a different set of parameters to fit.
A single feature-independent decoder on top of the hidden representation, to predict your target.
When it comes to fitting the model, each encoder can only use the data where the desired feature set is available. It is fitting a representation to those features, so it needs to see them. But the decoder can be used for all of your data. This will capture the distribution of the targets, which is common because your targets are common.
This sort of model is appropriate when you believe that there is a meaningful hidden representation. That is, you believe that your feature sets are measuring similar things but in different ways.
That allows you to keep the encoder small, as it is doing a small translation from one way of measuring to another. Translating from the measurements to the target may still be difficult, but because that logic goes in the common decoder it can benefit from all the training data.
To make it concrete, a good example use case for such a model would be if if your features were width, height, volume, and weight. And let's say your target is shipping cost.
It is reasonable to say that the intermediate representation is well described by the concept of size. And it is also reasonable to say that translating from size to cost is an interesting problem in its own right, no matter how you measured size originally.
So the model formulation looks something like this:
# Feature encoders.
size ~ width + height
size ~ volume + weight
# Target decoder.
cost ~ size
Now, above I have been careful to describe the model design without any commitment to the type of model. But you did tag this question as related neural networks specifically, and I think that's a good choice.
For your simple example, using PyTorch, the model might look something like this:
import torch.nn as nn
import torch.nn.functional as F
class MultiEncoderSingleDecoder(torch.nn.Module):
def __init__(self, hid_sz):
super().__init__()
self.using_encoder = 0
self.encoders = torch.nn.ModuleList([
torch.nn.Linear(2, hid_sz),
torch.nn.Linear(2, hid_sz),
])
self.decoder = torch.nn.Linear(hid_sz, 1)
def set_encoder(self, use_encoder):
self.using_encoder = use_encoder
def forward(self, inp):
encoder = self.encoders[self.using_encoder]
return self.decoder(F.relu(encoder(inp)))
And then usage might look like so:
model = MultiEncoderSingleDecoder()
model.set_encoder(0)
# Do some training on the first feature set.
model.set_encoder(1)
# Do some more training on the second feature set.
# ...
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have used sklearn to create a basic multiclass naive bayes text classifier. I have 3 classes and around 800 rows of data. Class A has 564 rows, Class B has 159, and Class C has 82. As you can see the data is unbalanced among the classes and I understand that this can affect the accuracy because Bayes Theorem takes into account the probability of a word occurring in the text given that the text is of a specific class in order to figure out the probability of the text being of said class given that it has the word in the text. This was my first go, and I plan to get more data, as you might imagine Class A was the easiest to get while Class C was the hardest to attain.
I am however confused as to how I should be approaching creating and improving this model and how balanced the class data sets should be. If I were to get perfectly proportionate data for each class say 1000 rows of data for each class, or undersample the data i already have, wouldnt this affect the accuracy as well? Since in reality, the occurrence of Class C is actually definitely less likely than A and B. In reality the proportions of the classes are somewhat similar (although varying from person to person) to the probability of a text being of said Class. And since the Bayes Theorem also takes into account the Probability of a piece of text being a specific class in order to calculate the probability of a text being a specific class given that it contains a word, wouldn't creating a balanced dataset with equal number of rows for each class decrease the accuracy as the probability of a class occuring in production is not taken into account as the probability is now essentially constant and the same for all classes since they occur equally. Although making all classes equal does remove the bias of a word due to unbalanced datasets.
So I am unsure how to approach creating this model efficiently as I feel with unbalanced data, common words in Class C are perceived by the model to be more likely to occur in an email of Class A when in reality they are probably more common in C but the skewed data is creating this bias. On the other hand, making the classes balanced ignores the actual probability of a piece of text being a specific Class although I have no way of calculating a universal probability of each class that is accurate for all individuals, (does that mean that making the classes balanced has less of a negative effect on accuracy?). Any guidance is greatly appreciated, I am quite new to this.
Tldr; Don't undersample/oversample, use text augmentation instead.
Undersampling/oversampling can be helpful in certain situations, but certainly not in your case with only 800 rows of data. Undersampling would make you lose too much valuable data, and oversampling would result in unreliable outcome. A much better solution would be to augment your data.
There are libraries like Snorkel that allows you to augment textual data by swapping or replacing with synonyms for adjectives, verbs, nouns, etc. in a probablistic way, which can greatly increase your data size. I highly recommend you taking a look at it, as it's often used in both academia and in the industry.
In regards to your concern with balancing your dataset, there are a few factors that can affect the outcome. Examples include the size of your dataset and overfitting, how distinctive the features are at classifying the samples, presence of outliers, etc. Just because you have 10k samples of cancer patients and 5k of healthy people, doesn't necessarily mean your prediction will be 2:1 ratio on real life dataset. That's because the model isn't necessarily memorizing the distribution of each class, but rather how the features result in the prediction of the class.
So in your example, if each class have distinctive words that often distinguishes one class from the other, you'd want to provide samples with those words in other classes to make sure you're not overfitting each class on those words.
Hope this helps!
When training from an imbalanced training set, the variances of your classifier parameters grow large. The more skewed your prior class distribution is (A, B, C), the larger this problem becomes.
You are recommended, when possible, to train from a balanced training set (the same number of 'A' and 'B' and 'C' cases). Correction to the actual prior class distribution can take place afterwards, see correction formula for posterior probabilities.
Your subsets of cases from the different classes must be selected at random from your complete data set. This to avoid any selection bias.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.
What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.
Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?
Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.
The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.
In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.
I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).
Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.
There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.
Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)