Machine Learning Two class classification [closed] - machine-learning

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I built this ML model in Azure ML studio with 4 features including a date column.
Trying to predict if the price is going to be higher tomorrow than it is today. Higher = 1, not higher = 0
It is a Two class neural network (with a Tune model hyperparameters).
When I test it I expect to get a answer between 0 - 1 which I do. The problem comes when I change the feature from 1 to 0. And get almost a similar answer.
I thought that if 1 = a score probabilities of 0.6
Then a 0 (with the same features) should give a score of 0.4
A snapshot of the data (yes I know I need more)
Model is trained/tuned on the "Over5" feature, and I hope to get an answer from the Two class neural network module in the range between 0 -1.
The Score module also produce results between 1 and 0. Everything looks to be correct.
I changed normalization method (after rekommendation from commenter) but it does not change the output much.
Everything seems to be in order but my goal is to get a prediction of the likelihood that a day would finish "Over5" and result in a 1.
When I test the model by using a "1" in the Over5 column I get a prediction of 0.55... then I tested the model with the same settings only changing the 1 to a 0 and I still get the same output 0.55...
I do not understand why this is since the model is trained/tuned on the Over5 feature. Changing input in that column should produce different results?

Outputs of a neural network are not probabilities (generally), so that could be a reason that you're not getting the "1 - P" result you're looking for.
Now, if it's simple logistic regression, you'd get probabilities as output, but I'm assuming what you said is true and you're using a super-simple neural network.
Also, what you may be changing is the bias "feature", which could also lead to the model giving you the same result after training. Honestly there's too little information in this post to say for certain what's going on. I'd advise you try normalizing your features and trying again.
EDIT: Do you know if your neural network actually has 2 output nodes, or if it's just one output node? If there are two, then the raw output doesn't matter quite as much as which node had the higher output. If it's just one, I'd look into thresholding it somewhere (like >0.5 means the price will rise, but <=0.5 means the price will fall, or however you want to threshold it.) Some systems used in applications where false positives are more acceptable than false negatives threshold at much lower values, like 0.2.

Related

Best practices to run a random forest model as fast as possible [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.
Here are some examples of what a was thinking:
Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)
Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)
How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?
Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
Estimating total time is difficult, because it isn't linear.
Some additional practical suggestions:
set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.

What are the steps should we take to analyze a dataset? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new in the field of data science, and I want to know about the key steps to get the properties of any dataset used for machine learning tasks.
What you ask is very general and your request is not well defined, but, I'll try to give you a short introduction to get you started.
knowledge required (as I see it):
Statistics and probability
Basic knowledge in mathematics
Basic knowledge of AI techniques and algorithms
The first step is every research is to define the problem, what are you trying to do?
for instance:
"I would like to predict if the next person who buys this car is a male or a female"
This kind of problem is a Classification problem, which means, the solution will label the "input" person as a male or a female correctly.
This is called a model, a model is a representation of the real world and its properties and using ML tools we wish to create it.
We do that by looking into history data, for example, lets say that out of 1000 male costumers and 1000 females, 850 males bought car X, while the rest bought car Y and 760 females bought car Y and the rest bought X.
now, if I tell you the next costumer bought car X, can you tell me its gender?
you are probably thinking its a male, but theres still a chance for it to be a female, yet, theres a higher probability it is in fact a male since we already know the pattern of male's and female's choices.
that's basically how it works, given a dataset, such as yours, you need to use it in order to predict something out of it.
Note: rather if your dataset is fit for this or not, or how much of an information gain you'll get from it is another story.
Now, each piece of data you can learn from is called a record:
first_name: 'LEROY', last_name: 'JENKINS', age: '25', gender: 'male' car_bought: 'x'
and each property is called a feature.
some features can be useless to you, in our example, only the gender is important, and the rest are useless, learning according to the useless feature may cause your model to learn invalid data.
also, some records may contain invalid data such is NULLs and missing data, first thing needed to do is to pre-process your data and get it ready for the learning.
once your data is ready, you can start the training, for that, you'll have to choose the most suitable algorithm for you, I wont go over the algorithms because there are a lot and you'll have to gain more knowledge about those, but there are many libraries for those and you should just google it.
I'll give you a short code example for a simple neural network usage to get you started to predict the outcome of a simple mathematical function: F(x) = 2*x
# prepare the dataset
X = np.arange(0.0, 1000.0, 1.0)
Y = np.empty(shape=0, dtype=float)
for x in X:
Y = np.append(Y, float(2*x)))
return X, Y
and a simple neural network using keras:
model = Sequential()
model.add(Dense(5, input_shape=(1,)))
model.add(Dense(1, activation='linear'))
# compile model
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['accuracy'])
# train model
m = model.fit(self.x_train, self.y_train, epochs=500, batch_size=1)
predicting using the output model:
for i in np.arange(2000.0, 2010.0, 1.0):
df = pandas.DataFrame()
df['X'] = [i]
print('f(',i,') = ',model.predict(df)[0][0])
will output:
f(2000.0) = 4000.0
f(2001.0) = 4002.0
f(2002.0) = 4004.0
.
.
even if the model never saw these numbers before it can now predict the output from learning the pattern from the dataset.
I dont expect you to understand how keras works or what it does, only to give you the feel of what is it like to use a ML algorithm.
I hope that answered your question and it can help you get started yourself.
Your question is too general you need to specify more. What do you mean by the properties of the dataset?
Nevertheless I'll try to answer what I understood from your question.
After choosing what kind of problem you have (classification or regression) you'll want to try and visualize your data to get a better sense of what you are doing.
Facets is an excellent tool to do this https://pair-code.github.io/facets/ . It will help you better comprehend how your data is distributed and maybe give you some extra insight on how to tackle your problem but how you use it depends on the problem you have.
You should also visualize your correlation matrix to see whether you have features that are heavily correlated and thus you can remove unnecessary features.
I remember when I started working on my first machine learning project things were overwhelming but the best tip I can give you is try to find a step by step guide that deals with a similar problem you are facing I'm sure you'll find plenty and try to clarify more your question we could give you more insight

Confusion about test & validation set labels in machine learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question with regards to the training and validation of a dataset.
I understand the concept of labels for training data i.e. y_train. What I don't get is that why should our testing/validation samples have labels as well.
I assume that by giving labels to the test samples, we define what they are before putting them through the algorithm right?
Let me put it this way, if I have a dataset of pictures of dogs and cats, and I label them 1 and 2, respectively. Then if I want to throw a picture (dog) to test my model, which was not in my training dataset, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
Can I have a testing/validation dataset without label?
Validation dataset is used to finetune the parameters in your model while the test set is used to check the accuracy. Without the label how can claim the correctness of your model. This concept is valid in supervised learning so one needs to have labels with testing and validation dataset.
The purpose of a test set is, as its name implies, to test the performance of your model in data that were not seen during training. And in order to get this performance indication, you certainly need data with known labels, in order to compare these labels (ground truth) with the corresponding model predictions, and to arrive to some quantitative measure (e.g. accuracy) of your model performance - something you can certainly not do without these labels being available in the test set.
if I want to throw a picture (dog) to test my model, why should I label it? If I label it 1, then I'm telling beforehand that it's a dog and if I label it 2, then it is a cat already.
You are using the term "test" very loosely here - this is not its meaning in the context of a test set (which context I just described above). Notice also that, the fact that the test labels are available, does not mean that they are being used by the model during prediction (they are certainly not - they are only used for comparison with the model predictions, as described above). Plus, you are referring to a very specific problem where the answer (cat/dog) is obvious to a human observer - try using the same rationale e.g. in a genomics problem, or in one that requests numeric predictions for, say, house prices, and you'll see that the situation is not that simple and straightforward (could you possibly name the price of a house by just looking at a row of numbers?)...
The same applies for a validation set, only the objective here is different (i.e. not model assessment, but model tuning).
Admittedly, some people use the term "test data" to mean in general any unseen data, but this is not correct; after you have build & assess your model using your training, validation, and test sets, you deploy it feeding it with new and obviously unseen data, for which it is certainly not expected to already know the labels...
There are literally dozens of online tutorials on the subject, and SO is arguably not the most appropriate forum for this kind of questions - I just hope I have given you a first good-enough general idea...

Difference between Regression and classification in Machine Learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.

Can machine learning algorithm learn to predict a "random" number from a generator to at least 50% accuracy? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Because all random number generators are all pseudo random number generators, can a machine learning algorithm eventually, with enough test data, learn to predict the next random number with 50% accuracy?
If you are generating just random bits (0 or 1) then any method will get 50%, literally any, ML or not, trained on not. Anything besides direct exploitation of underlying random number generator (like reading the seed, and then using as a predictor the same random number generator). So the answer is yes.
If you consider more "numbers" then no, it is not possible, unless you do not have a valid random number generator. The weaker is the process and better your model you try to learn, it is easier to predict what is happening. For example if you know exactly how random number generator looks like, and this is just iterated function with some parameters f(x|params), where we start with some random seed s and parameters params and then x1=f(s|params), x2=f(x1|params), ... then you can learn using ML the state of such system, this is just about finding the "params", which fitted to f generate the actual values. Now - more complex f, the more complex is the problem. For typical random number generators f is too complex to learn, because you cannot observe any relation between close values - if you predict "5.8" and answer was "5.81" then next sample from your model might be "123" and from true generator "-2". This is completely chaotic process.
To sum up: this is possible only for very easy cases:
either there are just 2 values (then there is nothing to learn, literally any method, which is not cheating, will get 50%)
or random number generator is seriously flawed, and you have a knowledge about what type of flaw it is, and you can design a parametric model to approximate this.

Resources