Data Science: Scoring methodology - machine-learning

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)

I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

Related

Padding time-series subsequences for LSTM-RNN training

I have a dataset of time series that I use as input to an LSTM-RNN for action anticipation. The time series comprises a time of 5 seconds at 30 fps (i.e. 150 data points), and the data represents the position/movement of facial features.
I sample additional sub-sequences of smaller length from my dataset in order to add redundancy in the dataset and reduce overfitting. In this case I know the starting and ending frame of the sub-sequences.
In order to train the model in batches, all time series need to have the same length, and according to many papers in the literature padding should not affect the performance of the network.
Example:
Original sequence:
1 2 3 4 5 6 7 8 9 10
Subsequences:
4 5 6 7
8 9 10
2 3 4 5 6
considering that my network is trying to anticipate an action (meaning that as soon as P(action) > threshold as it goes from t = 0 to T = tmax, it will predict that action) will it matter where the padding goes?
Option 1: Zeros go to substitute original values
0 0 0 4 5 6 7 0 0 0
0 0 0 0 0 0 0 8 9 10
0 2 3 4 5 6 0 0 0 0
Option 2: all zeros at the end
4 5 6 7 0 0 0 0 0 0
8 9 10 0 0 0 0 0 0 0
2 3 4 5 0 0 0 0 0 0
Moreover, some of the time series are missing a number of frames, but it is not known which ones they are - meaning that if we only have 60 frames, we don't know whether they are taken from 0 to 2 seconds, from 1 to 3s, etc. These need to be padded before the subsequences are even taken. What is the best practice for padding in this case?
Thank you in advance.
The most powerful attribute of LSTMs and RNNs in general is that their parameters are shared along the time frames(Parameters recur over time frames) but the parameter sharing relies upon the assumption that the same parameters can be used for different time steps i.e. the relationship between the previous time step and the next time step does not depend on t as explained here in page 388, 2nd paragraph.
In short, padding zeros at the end, theoretically should not change the accuracy of the model. I used the adverb theoretically because at each time step LSTM's decision depends on its cell state among other factors and this cell state is kind of a short summary of the past frames. As far as I understood, that past frames may be missing in your case. I think what you have here is a little trade-off.
I would rather pad zeros at the end because it doesn't completely conflict with the underlying assumption of RNNs and it's more convenient to implement and keep track of.
On the implementation side, I know tensorflow calculates the loss function once you give it the sequences and the actual sequence size of each sample(e.g. for 4 5 6 7 0 0 0 0 0 0 you also need to give it the actual size which is 4 here) assuming you're implementing the option 2. I don't know whether there is an implementation for option 1, though.
Better go for padding zeroes in the beginning, as this paper suggests Effects of padding on LSTMs and CNNs,
Though post padding model peaked it’s efficiency at 6 epochs and started to overfit after that, it’s accuracy is way less than pre-padding.
Check table 1, where the accuracy of pre-padding(padding zeroes in the beginning) is around 80%, but for post-padding(padding zeroes in the end), it is only around 50%
In case you have sequences of variable length, pytorch provides a utility function torch.nn.utils.rnn.pack_padded_sequence. The general workflow with this function is
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
embedding = nn.Embedding(4, 5)
rnn = nn.GRU(5, 5)
sequences = torch.tensor([[1,2,0], [3,0,0], [2,1,3]])
lens = [2, 1, 3] # indicating the actual length of each sequence
embeddings = embedding(sequences)
packed_seq = pack_padded_sequence(embeddings, lens, batch_first=True, enforce_sorted=False)
e, hn = rnn(packed_seq)
One can collect the embedding of each token by
e = pad_packed_sequence(e, batch_first=True)
Using this function is better than padding by yourself, because torch will limit RNN to only inspecting the actual sequence and stop before the padded token.

Neural Network Character Recognition

Suppose I'm trying to create a Neural Network to recognize characters on a simple 5x5 grid of pixels. I have only 6 possible characters (symbols) - X,+,/,\,|
At the moment I have a Feedforward Neural Network - with 25 input nodes, 6 hidden nodes and a single output node (between 0 and 1 - sigmoid).
The output corresponds to a symbol. Such as 'X' = 0.125, '+' = 0.275, '/' = 0.425 etc.
Whatever the output of the network (on testing) is, corresponds to whatever character is closest numerically. i.e - 0.13 = 'X'
On Input, 0.1 means the pixel is not shaded at all, 0.9 means fully shaded.
After training the network on the 6 symbols I test it by adding some noise.
Unfortunately, if I add a tiny bit of noise to '/', the network thinks it's '\'.
I thought maybe the ordering of the 6 symbols (i.,e - what numeric representation they correspond to) might make a difference.
Maybe the number of hidden nodes is causing this problem.
Maybe my general concept of mapping characters to numbers is causing the problem.
Any help would be hugely appreciated to make the network more accurate.
The output encoding is the biggest problem. You should better use a one-hot encoding for the output so that you have six output nodes.
For example,
- 1 0 0 0 0 0
X 0 1 0 0 0 0
+ 0 0 1 0 0 0
/ 0 0 0 1 0 0
\ 0 0 0 0 1 0
| 0 0 0 0 0 1
This is much easier for the neural network to learn. At prediction time, pick the node that has the highest value as your prediction. For example, if you have below output values at each output node:
- 0.01
X 0.5
+ 0.2
/ 0.1
\ 0.2
| 0.1
Predict the character as "X".

Handling features not correlated with output prediction?

I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?
In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)

Difference-in-difference analysis in SPSS

I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne
This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".
General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .
Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.

Finding standard deviation using only mean, min, max?

I want to find the standard deviation:
Minimum = 5
Mean = 24
Maximum = 84
Overall score = 90
I just want to find out my grade by using the standard deviation
Thanks,
A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:
1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071
Also, what does an 'overall score' of 90 mean if the maximum is 84?
I actually did a quick-and-dirty calculation of the type M Rad mentions. It involves assuming that the distribution is Gaussian or "normal." This does not apply to your situation but might help others asking the same question. (You can tell your distribution is not normal because the distance from mean to max and mean to min is not close). Even if it were normal, you would need something you don't mention: the number of samples (number of tests taken in your case).
Those readers who DO have a normal population can use the table below to give a rough estimate by dividing the difference of your measured minimum and your calculated mean by the expected value for your sample size. On average, it will be off by the given number of standard deviations. (I have no idea whether it is biased - change the code below and calculate the error without the abs to get a guess.)
Num Samples Expected distance Expected error
10 1.55 0.25
20 1.88 0.20
30 2.05 0.18
40 2.16 0.17
50 2.26 0.15
60 2.33 0.15
70 2.38 0.14
80 2.43 0.14
90 2.47 0.13
100 2.52 0.13
This experiment shows that the "rule of thumb" of dividing the range by 4 to get the standard deviation is in general incorrect -- even for normal populations. In my experiment it only holds for sample sizes between 20 and 40 (and then loosely). This rule may have been what the OP was thinking about.
You can modify the following python code to generate the table for different values (change max_sample_size) or more accuracy (change num_simulations) or get rid of the limitation to multiples of 10 (change the parameters to xrange in the for loop for idx)
#!/usr/bin/python
import random
# Return the distance of the minimum of samples from its mean
#
# Samples must have at least one entry
def min_dist_from_estd_mean(samples):
total = 0
sample_min = samples[0]
for sample in samples:
total += sample
sample_min = min(sample, sample_min)
estd_mean = total / len(samples)
return estd_mean - sample_min # Pos bec min cannot be greater than mean
num_simulations = 4095
max_sample_size = 100
# Calculate expected distances
sum_of_dists=[0]*(max_sample_size+1) # +1 so can index by sample size
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
sum_of_dists[len(samples)] += min_dist_from_estd_mean(samples)
samples.append(random.normalvariate(0,1))
expected_dist = [total/num_simulations for total in sum_of_dists]
# Calculate average error using that distance
sum_of_errors=[0]*len(sum_of_dists)
for iternum in xrange(num_simulations):
samples=[random.normalvariate(0,1)]
while len(samples) <= max_sample_size:
ave_dist = expected_dist[len(samples)]
if ave_dist > 0:
sum_of_errors[len(samples)] += \
abs(1 - (min_dist_from_estd_mean(samples)/ave_dist))
samples.append(random.normalvariate(0,1))
expected_error = [total/num_simulations for total in sum_of_errors]
cols=" {0:>15}{1:>20}{2:>20}"
print(cols.format("Num Samples","Expected distance","Expected error"))
cols=" {0:>15}{1:>20.2f}{2:>20.2f}"
for idx in xrange(10,len(expected_dist),10):
print(cols.format(idx, expected_dist[idx], expected_error[idx]))
Yo can obtain an estimate of the geometric mean, sometimes called the geometric mean of the extremes or GME, using the Min and the Max by calculating the GME= $\sqrt{ Min*Max }$. The SD can be then calculated using your arithmetic mean (AM) and the GME as:
SD= $$\frac{AM}{GME} * \sqrt{(AM)^2-(GME)^2 }$$
This approach works well for log-normal distributions or as long as the GME, GM or Median is smaller than the AM.
In principle you can make an estimate of standard deviation from the mean/min/max and the number of elements in the sample. The min and max of a sample are, if you assume normality, random variables whose statistics follow from mean/stddev/number of samples. So given the latter, one can compute (after slogging through the math or running a bunch of monte carlo scripts) a confidence interval for the former (like it is 80% probable that the stddev is between 20 and 40 or something like that).
That said, it probably isn't worth doing except in extreme situations.

Resources