I understand the concept of the Knn Algo but i have a question. After we have classified new arriving test points , these test points become a part of our classification process? I mean, the future test points can get as closest points even pre-classified test points, or only the real and initial training points?
After we have classified new arriving test points , these test points become a part of our classification process?
No.
A test point gets classified and doesn't modify your initial dataset. So, if your dataset has 100 points, after a test point has arrived and been classified, your dataset will still have 100 points, not 101 points.
With that said, a test point will get classified towards the exactly same dataset, whether it arrives first or last (assuming that several other test points have already arrived).
Related
I have 6 variables with values of something between 0 and 2 and a function, where these variables are given into. It predicts the result of a football match by looking at the past matches of both teams.
Depending on the variables, obviously, the output of the function changes. Each variable determines how much a certain part of the algorithm is weighed (e.g. how much less a game should be weighed that was 6 months ago compared to a game that was a week ago).
My goal is now to find out what the perfect ratios between the different variables and thus between the different parts of the algorithm are, so that my algorithm predicts most matches correctly. Is there any way of achieving that?
I thought of doing something like this with machine learning, something similar to linear/polynomal regression.
To determine how close a tip is I thought of giving:
2 points for when the tendency was right (predicted that Team A would win and Team A did win)
4 points for when the goal difference is right (Prediction: Team A
wins 2:1, actual result: 1:0)
5 points for when the result is
predicted correctly (predicted result: 2:1 and actual result: 2:1)
Which would make a loss function of
maximal points for game (which is 5) - points for predicted result
If I am able to minimize that, hopefully, after looking at some training sets (past seasons), it will theoretically score the most amount of points, when you give it a new season together with the variables computed in beforehand as input.
Now I'm trying to find out, by how much and in which direction I have to change each of my variables so that the loss is made smaller each time you look at a new training set.
You probably have to look at how big the loss is but I dont know how to find out which variable to change and in which direction. Is that possible and if so, how do I do that?
Currently I'm using javascript.
I am assuming that you are trying to use gradient descent to train your regression model.
Loss functions that you can use with gradient descent have to be differentiable, so simply giving/subtracting points to certain properties of the prediction is not possible.
A loss function that may be suitable for this task is the mean squared error, which is simply computed by averaging the squared differences between predicted and expected values. Your expected values would then just be the scores of both teams in the game.
You would then have to compute the gradient of the loss of your prediction with respect to the weights that the prediction function uses to compute its outputs. This can be done using backpropagation (details of which are way too broad for this answer, there are many tutorials available on the web).
The intuition behind the gradient of a function is that it points in the direction of steepest ascend of your function. If you update your parameters in that direction, the output of your function will grow. If this value is the loss of your prediction function, you want it to be smaller, so you go a small step in the opposite direction of your gradient.
I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.
Suppose for K nearest neighbor algorithm, we have a original training data set x1,x2,...,xn and we test p1. After classification of p1, we put p1 into training data set.
The newest training data set now is {x1,x2,....,xn,p1} and we test p2... and so on.
I think the above is quite counter intuitive that we used "fake" data to train our program. But i cannot think any proof/reason to say why we cannot use the "fake" data.
It will only make the model more biased toward the original training set by updating the boundary between classes using its own prediction. In addition, adding more observations to your training set without offering any ground truth knowledge only makes the feature space more dense and reduces the impact of K which can lead to higher chance of over-fitting.
I have train dataset and test dataset from two different sources. I mean they are from two different experiments but the results of both of them are same biological images. I want to do binary classification using deep CNN and I have following results on test accuracy and train accuracy. The blue line shows train accuracy and the red line shows test accuracy after almost 250 epochs. Why the test accuracy is almost constant and not raising? Is that because Test and Train dataset are come from different distributions?
Edited:
After I have add dropout layer, reguralization terms and mean subtraction I still get following strange results which says the model is overfitting from the beginning!
There could be 2 reasons. First you overfit on the training data. This can be validated by using the validation score as a comparison metric to the test data. If so you can use standard techniques to combat overfitting, like weight decay and dropout.
The second one is that your data is too different to be learned like this. This is harder to solve. You should first look at the value spread of both images. Are they both normalized. Matplotlib normalizes automatically for plotted images. If this still does not work you might want to look into augmentation to make your training data more similar to the test data. Here I can not tell you what to use, without seeing both the trainset and the testset.
Edit:
For normalization the test set and the training set should have a similar value spread. If you do dataset normalization you calculate mean and std on training set. But you also need to use those calculated values on the test set and not calculate the test set values from the test set. This only makes sense if the value spread is similar for both the training and test set. If this is not the case you might want to do per sample normalization first.
Other augmentation that are commonly used for every dataset are oversampling, random channel shifts, random rotations, random translation and random zoom. This makes you invariante to those operations.
I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).