Online/non batch mode learning + semi-supervised learning - machine-learning

I got a set of data in a form(around 5000 points) :
var1 var2 var3 var4
0.1 0.2 0.4 0.5
0.1 0.2 0.2 0.5
0.1 0.9 0.4 0.6
0.9 0.9 0.8 0.9
0.1 0.1 0.4 0.8
.
.
.
The values have range from zero to one. The higher the value the higher the chances to get classified as 1. However, the data is unlabeled and I do not know which var has higher weight.
Originally I am weighting equally all the variables with 1/4 (1/number_of_variables) and get the median value of the results, classify the values up to median with 1 and down of median to 0:
var1 var2 var3 var4 score output
0.1 0.2 0.4 0.5 0.3 0
0.1 0.2 0.2 0.5 0.25 0
0.1 0.9 0.4 0.6 0.5 1
0.9 0.9 0.8 0.9 0.875 1
0.1 0.1 0.4 0.8 0.35 0
.
.
.
Later, a user can re-label 10 to 20 points out of the 5000 points.(the user-defined labels are considered the true one).
How can I update the model in order to take into account the labels of the user? As a user can label in different times I was thinking of a compination of online learning and a semi-supervised approach. How would you approach the whole task?
Edit:
A main idea that I had is the following:
Consider each var individually and classify each point with 1 if the value is higher than the mean value of the var column and 0 if it is less.
Train the above results individually with logistic regression.(Each var is trained to a different model.
Give an initial weight to each model(for example the value of 1)
When a user classify some examples the models are evaluated throught a simple adaboost algorithm and the weights are updated.
Do you think it could be a way of doing?

Related

Different number of inputs causes sematic problem in NN during training?

I would like to forecast stock price. Sometimes I have 3 input data, sometimes 5 or 7. ( Sometimes there are 3 limit orders around price change, sometimes 5 ).
I would like to train a NN to predict a number. The number will be predicted based on 3-5 inputs. The problem is that sometimes I have 3 inputs, sometimes 4 or 5.
Lets say sometime:
Input:
[ 0.1, 0.3, 0.5, 0.7, 0.1 ]
Output:
0.7
And another time:
Input:
[ 0.4, 0.1, 0.6, ? , ? ]
Output:
0.5
What should I write at the question mark? null or 0 or undefined?
I try to use the very basic NN network of brain.js and I know my model is not optimised for forecasting but please ignore this now.
I think if I would use 0 when I don't have enough input, then it would cause semantic problem because the NN would think that the 0 input has an actual meaning regarding the input data and will handle it as an actual valid data.
Edit:
Is it possibble to mix the type of input?
For example I could have an extra input field which would be boolean type and it would tell that the next input element should be treat or not.
If I would have 5 information to predict, then I would make a 7 element input array with 2 extra input field, which tell that 0.5 and 0.7 should be treaten as numbers:
[0.1 , 0.4 , 0.1 , BOOLEAN-YES , 0.5 , BOOLEAN-YES , 0.7 ];
If I would have 3 information, then I would make also a 7 element input with also 2 extra input field, which tell that 0 and 0 should be ignore and NOT treat as numbers:
[0.1 , 0.4 , 0.1 , BOOLEAN-NO , 0 , BOOLEAN-NO , 0 ];
With this method I could manage that every input would be same size. Could it be done? How could I do this?
Boolean-no and Boolean-yes could be simply 0 and 1?

cost becoming NaN after certain iterations

I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.
This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.
Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine.
Can anyone explain what I'm doing wrong??
# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1
m = X.shape[1] # 177
W = np.random.randn(3,X.shape[0])*0.01 # (3,13)
b = 0
cost = 0
alpha = 0.0001 # seems too small to me but for bigger values cost becomes NaN
for i in range(100):
Z = np.dot(W,X) + b
t = np.exp(Z)
add = np.sum(t,axis=0)
A = t/add
loss = -np.multiply(y,np.log(A))
cost += np.sum(loss)/m
print('cost after iteration',i+1,'is',cost)
dZ = A-y
dW = np.dot(dZ,X.T)/m
db = np.sum(dZ)/m
W = W - alpha*dW
b = b - alpha*db
This is what I get :
cost after iteration 1 is 6.661713420377916
cost after iteration 2 is 23.58974203186562
cost after iteration 3 is 52.75811642877174
.............................................................
...............*upto 100 iterations*.................
.............................................................
cost after iteration 99 is 1413.555298639879
cost after iteration 100 is 1429.6533630169406
Well after some time i figured it out.
First of all the cost was increasing due to this :
cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.
Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.
So I applied feature scaling :
var = np.var(X, axis=1)
X = X/var
Now learning rate can be much bigger (<=0.001).

Converting Continuous Model Probability Score to a Categorical Rating

I have a standard xgboost classification model that has been trained and now predicts a probability score. However, for the purposes of making the user interface simpler, I would like to convert this score to a 5 star rating scheme. I.e. discretizing the score.
What are intelligent ways of deriving the thresholds for this quantization such that the high ratings represents a high probability score with high confidence?
For example, I was considering generating the confidence intervals along with the prediction and grouping high confidence high score as 5 stars. High confidence low score as 1 star. High confidence medium high score as 4 star and so on.
I investigated multiple solutions for this and prototyped a V0 solution. The main requirements for the solution are as follows:
As the rating level increases (5 star is better than 1 star) the # of false positives must decrease.
The user doesnt have to manually define thresholds on the score probabilities and the thresholds are derived automatically.
The thresholds are derived from some higher level business requirement.
The thresholds are derived from the labelled data and can be rederived as new information is found.
Other solutions considered:
Confidence interval based rating. For example, you could have a high predicted score of 0.9 but low confidence (i.e. large confidence interval) and a high predicted score of 0.9 but high confidence (i.e. small interval). I suspect we might want the latter to be a 5 star candidate while the former a 4* perhaps?
Identifying Convexity and concavity of ROC curve to identify points of max value
Use Youden index to identify optimal point
Final solution - Sample ROC curve with a given set of business requirements (set of FPR's associated to each star rating) and then translate to thresholds.
Note: This worked but assumes a somewhat monotonic precision curve which may not always be the case. I improved the solution by formulating the problem as an optimization problem where the rating thresholds were the degree of freedom and the objective function was the linearity of the conversion rates between each rating bucket. Im sure you could try out different objective functions but for my purpose that worked really well.
References:
Converting Continuous Model Probability Score to a Categorical Rating
http://www.medicalbiostatistics.com/roccurve.pdf
http://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/
Prototype Solution:
import numpy as np
import pandas as pd
# The probas and fpr/tpr/thresholds come from the roc curve.
probas_ = xgb_model_copy.fit(features.values[train], label.values[train]).predict_proba(features.values[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(label.values[test], probas_[:, 1])
fpr_req = [0.01, 0.3, 0.5,0.9]
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return idx
fpr_indexes = [find_nearest(fpr, fpr_req_val) for fpr_req_val in fpr_req]
star_rating_thresholds = thresholds[fpr_indexes]
star_rating_thresholds = np.append(np.append([1],star_rating_thresholds),[0])
candidate_ratings = pd.cut(probas_,
star_rating_thresholds[::-1], labels=[5,4,3,2,1],right=False,include_lowest=True)
star_rating_thresolds
array([1. , 0.5073538 , 0.50184137, 0.5011086 , 0.4984425 ,
0. ])
candidate_ratings
[5, 5, 5, 5, 5, ..., 2, 2, 2, 2, 1]
Length: 564
Categories (5, int64): [5 < 4 < 3 < 2 < 1]
youcan use Pandas.cut() method:
In [62]: np.random.seed(0)
In [63]: a = np.random.rand(10)
In [64]: a
Out[64]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 , 0.64589411, 0.43758721, 0.891773 , 0.96366276, 0.38344152])
In [65]: pd.cut(a, bins=np.linspace(0, 1, 6), labels=[1,2,3,4,5])
Out[65]:
[3, 4, 4, 3, 3, 4, 3, 5, 5, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
UPDATE: #EranMoshe has added an important point - "you might want to normalize your output before cutting it into categorical values".
Demo:
In [117]: a
Out[117]: array([0.6 , 0.8 , 0.85, 0.9 , 0.95, 0.97])
In [118]: pd.cut(a, bins=np.linspace(a.min(), a.max(), 6),
labels=[1,2,3,4,5], include_lowest=True)
Out[118]:
[1, 3, 4, 5, 5, 5]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
Assuming a classification problem: either 1's or 0's
When calculating the AUC of the ROC curve, you sort the "events" by your model prediction. So at the top, you'll most likely have a lot of 1's, and the farther you go down on that sorted list, more 0's.
Now lets say you try to determine whats the threshold of score "5".
You can count the relative % of 0's in your data you are willing to suffer.
Given the following table:
item A score
1 1 0.99
2 1 0.92
3 0 0.89
4 1 0.88
5 1 0.74
6 0 0.66
7 0 0.64
8 0 0.59
9 1 0.55
If I want "user score" "5" to have 0% false positives I would determine the threshold for "5" to be above 0.89
If I can tolerate 10% false positives I would determine the threshold to be above 0.66.
You can do the same for each threshold.
In my opinion this is a 100% business decision and the smartest way to pick those thresholds is by your knowledge of the users.
If users expects "5" to be a perfect prediction of the class (life and death situation) go with the 0% false positives.

How to convert Euclidean distance to range 0 and 1 like Cosine Similarity?

Want to map Euclidean distance to the range [0, 1], somewhat like the cosine similarity of vectors.
For instance
input output
0 1.0
1 0.9 approximate
2 0.8 to 0.9 somewhere
inf 0.0
I tried the formula 1/(1+d), but that falls away from 1.0 too quickly.
It seems that you want the fraction's denominator to grow more slowly (the denominator is the bottom part, which you have as (d+1) so far). There are various ways to handle this. For instance, try a lower power for d, such as
1 / (1 + d**(0.25))
... or an exponential decay in the denominator, such as
1 / (1.1 ** d)
... or using a trig function to temper your mapping, such as
1 - tanh(d)
Would something in one of these families work for you?

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Resources