Confusion about test & validation set labels in machine learning [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?

Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.

The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...

Related

Combining neural networks with different features but same target [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a question about how one can go about creating a neural network (or similar) architecture(s) that does the following:
Example
Say I have a model that uses feature 1 and feature 2 to predict a
target. That model does not perform well because I am limited to the
number of training examples that have a feature 1 and feature 2
populated.
If I were to have another neural network with feature 3 and feature 4,
and my goal is to predict the same target, how can I go about
combining the learning from both models to make the same target
prediction.
This continues for several other similar datasets with different
features but a common target
Explanation
I am only doing this because not every training example has a feature
1, 2, 3 and 4, therefore it cannot be incorporated into a single
model. But the only thing in common is that the models are trying to
predict the same target.
Question
What machine learning strategy (not just a neural network) would be
most appropriate for such a problem?
The model you describe is built out of 2 core sub-models.
Many feature-dependent encoders, one for each feature set. Features 1 and 2 can be combined by part of the model into some hidden representation. Features 3 and 4 would be translated into the same hidden representation, but would have a different sub-model with a different set of parameters to fit.
A single feature-independent decoder on top of the hidden representation, to predict your target.
When it comes to fitting the model, each encoder can only use the data where the desired feature set is available. It is fitting a representation to those features, so it needs to see them. But the decoder can be used for all of your data. This will capture the distribution of the targets, which is common because your targets are common.
This sort of model is appropriate when you believe that there is a meaningful hidden representation. That is, you believe that your feature sets are measuring similar things but in different ways.
That allows you to keep the encoder small, as it is doing a small translation from one way of measuring to another. Translating from the measurements to the target may still be difficult, but because that logic goes in the common decoder it can benefit from all the training data.
To make it concrete, a good example use case for such a model would be if if your features were width, height, volume, and weight. And let's say your target is shipping cost.
It is reasonable to say that the intermediate representation is well described by the concept of size. And it is also reasonable to say that translating from size to cost is an interesting problem in its own right, no matter how you measured size originally.
So the model formulation looks something like this:
# Feature encoders.
size ~ width + height
size ~ volume + weight
# Target decoder.
cost ~ size
Now, above I have been careful to describe the model design without any commitment to the type of model. But you did tag this question as related neural networks specifically, and I think that's a good choice.
For your simple example, using PyTorch, the model might look something like this:
import torch.nn as nn
import torch.nn.functional as F
class MultiEncoderSingleDecoder(torch.nn.Module):
def __init__(self, hid_sz):
super().__init__()
self.using_encoder = 0
self.encoders = torch.nn.ModuleList([
torch.nn.Linear(2, hid_sz),
torch.nn.Linear(2, hid_sz),
])
self.decoder = torch.nn.Linear(hid_sz, 1)
def set_encoder(self, use_encoder):
self.using_encoder = use_encoder
def forward(self, inp):
encoder = self.encoders[self.using_encoder]
return self.decoder(F.relu(encoder(inp)))
And then usage might look like so:
model = MultiEncoderSingleDecoder()
model.set_encoder(0)
# Do some training on the first feature set.
model.set_encoder(1)
# Do some more training on the second feature set.
# ...

What are the steps should we take to analyze a dataset? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.
What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.
Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight

Machine Learning Two class classification [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I built this ML model in Azure ML studio with 4 features including a date column.
Trying to predict if the price is going to be higher tomorrow than it is today. Higher = 1, not higher = 0
It is a Two class neural network (with a Tune model hyperparameters).
When I test it I expect to get a answer between 0 - 1 which I do. The problem comes when I change the feature from 1 to 0. And get almost a similar answer.
I thought that if 1 = a score probabilities of 0.6
Then a 0 (with the same features) should give a score of 0.4
A snapshot of the data (yes I know I need more)
Model is trained/tuned on the "Over5" feature, and I hope to get an answer from the Two class neural network module in the range between 0 -1.
The Score module also produce results between 1 and 0. Everything looks to be correct.
I changed normalization method (after rekommendation from commenter) but it does not change the output much.
Everything seems to be in order but my goal is to get a prediction of the likelihood that a day would finish "Over5" and result in a 1.
When I test the model by using a "1" in the Over5 column I get a prediction of 0.55... then I tested the model with the same settings only changing the 1 to a 0 and I still get the same output 0.55...
I do not understand why this is since the model is trained/tuned on the Over5 feature. Changing input in that column should produce different results?
Outputs of a neural network are not probabilities (generally), so that could be a reason that you're not getting the "1 - P" result you're looking for.
Now, if it's simple logistic regression, you'd get probabilities as output, but I'm assuming what you said is true and you're using a super-simple neural network.
Also, what you may be changing is the bias "feature", which could also lead to the model giving you the same result after training. Honestly there's too little information in this post to say for certain what's going on. I'd advise you try normalizing your features and trying again.
EDIT: Do you know if your neural network actually has 2 output nodes, or if it's just one output node? If there are two, then the raw output doesn't matter quite as much as which node had the higher output. If it's just one, I'd look into thresholding it somewhere (like >0.5 means the price will rise, but <=0.5 means the price will fall, or however you want to threshold it.) Some systems used in applications where false positives are more acceptable than false negatives threshold at much lower values, like 0.2.

Handing high Cardinality features with supervised ratio and weight of evidence [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Using Naive Bayes Classification to Identity a Twitter User's Gender [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either male or female. Now using Twitter4J I can get what the user's full name, number of friends, re-tweets, etc. So I was wondering if a combination of looking at a users name and also doing text analysis would be a possible answer. I was originally thinking I could make this like a rule based classifier where I could first look at the user's name then analyze their text and attempt to come to a conclusion of M or F. I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Also with the names, I would be checking some kind of dictionary to interpret whether the name was male or female. I know there are cases where it's hard to tell but that's why I'd be looking at their tweet texts as well. I also forgot to mention; with these 600,000 tweets, I have at minimum two tweets per user available to me.
Any ideas or input on classifying a user's gender would be greatly appreciated! I don't have a ton of experience in this area and I'm looking to learn anything I can get my hands on.
I'm guessing I would have trouble using something such as naive bayes since I don't have the real truth values?
Any supervised learning algorithm, such as Naive Bayes, requires preparing training set. Without the actual gender for some data you cannot build such a model. On the other hand, if you come out with some rule bases system (like the one based on the users' names) you can try a semi-supervised approach. Using your rule based system, you can create some labelling of your data, lets say that your rule based classifier is RC and can answer "Male", "Female", "Do not know", you can create a labelling of your data X using RC in a natural way:
X_m = { x in X : RC(x)="Male" }
X_f = { x in X : RC(x)="Female" }
Once you did it, you can create a training set for the supervised learning model using all your data except the one used for creating RC - so in this case - users' names (I assume, that RC answers "Male" or "Female" iff it is entirely "sure" about it). As a result, you will train a classifier, which will try to generalize concept of gender from all additional data (like words used, location etc.). Lets call it SC. After that, you can simply create a "complex" classifier:
C(x) = "Male" iff RC(x)= Male" or
(RC(x)="Do not know" && SC(x)="Male")
"Female" iff RC(x)= Female" or
(RC(x)="Do not know" && SC(x)="Female")
This way you can on one hand use the most valuable information (user name) in the rule based way, while in the same time exploit power of supervised learning for the "hard cases" while not having the "ground truth" in the first place.
You need to develop a vocabulary linking name and gender.
Then you have to define features for each tweet.
Finaly you can use weka (java), Matlab, Python to build the learing set.
Main issues:
Your language? To identify sex from name is easy in Italian (-a Female, -o Male [except Andrea, Luca] ) or get an eye here Does anyone know of a good library for mapping a person's name to his or her gender?
second issue is a bit complicate you a need a semantic dictionary or you van analyse only the destination of the tweet (#to) or presence of url or image

Resources