How to evaluate Markov Model accuracy - machine-learning

I have created the following Markov chain Model. And I am struggling to prove mathematically that my model works correctly, or doesn't work.
Sequence: Start, state1, state2, state3, state3, state2, state1, state2, state1, end
States: start, state1, state2, state3, end
Distribution:
Start: 1
state1: 3
state2: 3
state3:2
end: 1
Pairs of Token:
(Start,state1):1
(state1, state2):2, (state1,end):1
(state2,state3):1, (state2, state1): 2
(state3,state3):1, (state3, state2):1
(end, ‘None’):1
Possible tokens to follow each key:
Start: [state1]
state1: [state2, end]
state3: [state3, state2]
state2: [state3, state1]
end: ‘None’
Transition Matrix:
Start state1 state2 state3 end
Start 0 1 0 0 0
state1 0 0 0 0.666... 0.33...
state3 0 0 0.5 0.5 0
state2 0 0.66... 0.33... 0 0
end 0 0 0 0 0
I use MLE to calculate transition matrix.
Question: How can I prove mathematically, if my Model is working… e.g. like calculating the mean error squared.
I have an Idea. If I take the sequence as the right one. And try to make suggestions using the matrix.
In each step, I will test if the suggestion is the same comparing to my proven sequence. And so sums up the error (error_sum). My Error will be error = error_sum/ all_steps.
Theoretically, this will work. But what I am looking for is a method which is mathematically proven, and I can demonstrate why this method was a good idea to use. Can you provide me, with any suggestions?

Related

How does CountVectorizer deal with new words in test data?

I understand how CountVectorizer works in general. It takes word tokens and creates a sparse count matrix of documents (rows) and token counts (columns), that we can use for ML modeling.
However, how does it deal with new words that can presumably show up in test data, that weren't in the training data? Does it just ignore them?
Also, from a modeling standpoint, should the assumption be that if certain words are so rare that they didn't show up in the training data at all, and that they aren't relevant for any modeling you might perform?
I am assuming you are referring to the scikit-learn CountVectorizer. Not that I know if any other myself.
Yes, when new documents are encoded, words that are not part of the vocabulary(created from the training data) are ignored by the count vectorizer.
Example of creating vocabulary: (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Now, use transform on new document and you can see that the Out of vocabulary words are ignored:
>>> print(vectorizer.transform(['not in any of the document second second']).toarray())
[[0 1 0 0 0 2 1 0 0]]
With respect to the rare words that are not part of the training data, I would agree to your statement that it is not significant for modeling since we would want to believe that the words that are most relevant to create and generalize a good model are already part of the training data.

Data Science: Scoring methodology

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

Neural Network Character Recognition

Suppose I'm trying to create a Neural Network to recognize characters on a simple 5x5 grid of pixels. I have only 6 possible characters (symbols) - X,+,/,\,|
At the moment I have a Feedforward Neural Network - with 25 input nodes, 6 hidden nodes and a single output node (between 0 and 1 - sigmoid).
The output corresponds to a symbol. Such as 'X' = 0.125, '+' = 0.275, '/' = 0.425 etc.
Whatever the output of the network (on testing) is, corresponds to whatever character is closest numerically. i.e - 0.13 = 'X'
On Input, 0.1 means the pixel is not shaded at all, 0.9 means fully shaded.
After training the network on the 6 symbols I test it by adding some noise.
Unfortunately, if I add a tiny bit of noise to '/', the network thinks it's '\'.
I thought maybe the ordering of the 6 symbols (i.,e - what numeric representation they correspond to) might make a difference.
Maybe the number of hidden nodes is causing this problem.
Maybe my general concept of mapping characters to numbers is causing the problem.
Any help would be hugely appreciated to make the network more accurate.
The output encoding is the biggest problem. You should better use a one-hot encoding for the output so that you have six output nodes.
For example,
- 1 0 0 0 0 0
X 0 1 0 0 0 0
+ 0 0 1 0 0 0
/ 0 0 0 1 0 0
\ 0 0 0 0 1 0
| 0 0 0 0 0 1
This is much easier for the neural network to learn. At prediction time, pick the node that has the highest value as your prediction. For example, if you have below output values at each output node:
- 0.01
X 0.5
+ 0.2
/ 0.1
\ 0.2
| 0.1
Predict the character as "X".

Handling features not correlated with output prediction?

I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?
In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)

Using LIBLINEAR in transition-based dependency parsing

I am going to do some work for transition-based dependency parsing using LIBLINEAR. But I am confused how to utilize it. As follows:
I set 3 feature templates for my training&testing processes of transition-based dependency parsing:
1. the word in the top of the stack
2. the word in the front of the queue
3. information from the current tree formed with the steps
And the feature defined in LIBLINEAR is:
FeatureNode(int index, double value)
Some examples like:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
But I want to define my features like(one sentence 'I love you') at some stage:
feature template 1: the word is 'love'
feature template 2: the word is 'you'
feature template 3: the information is - the left son of 'love' is 'I'
Does it mean I must define features with LIBLINEAR like: -------FORMAT 1
(indexes in vocabulary: 0-I, 1-love, 2-you)
LABEL ATTR1(template1) ATTR2(template2) ATTR3(template3)
----- ----- ----- -----
SHIFT 1 2 0
(or LEFT-arc,
RIGHT-arc)
But I have go thought some statements of others, I seem to define feature in binary so I have to define a words vector like:
('I', 'love', 'you'), when 'you' appears for example, the vector will be (0, 0, 1)
So the features in LIBLINEAR may be: -------FORMAT 2
LABEL ATTR1('I') ATTR2('love') ATTR3('love')
----- ----- ----- -----
SHIFT 0 1 0 ->denoting the feature template 1
(or LEFT-arc,
RIGHT-arc)
SHIFT 0 0 1 ->denoting the feature template 2
(or LEFT-arc,
RIGHT-arc)
SHIFT 1 0 0 ->denoting the feature template 3
(or LEFT-arc,
RIGHT-arc)
Which is correct between FORMAT 1 and 2?
Is there some something I have mistaken?
Basically you have a feature vector of the form:
LABEL RESULT_OF_FEATURE_TEMPLATE_1 RESULT_OF_FEATURE_TEMPLATE_2 RESULT_OF_FEATURE_TEMPLATE_3
Liblinear or LibSVM expect you to translate it into integer representation:
1 1:1 2:1 3:1
Nowadays, depending on the language you use there are lots of packages/libraries, which would translate the string vector into libsvm format automatically, without you having to know the details.
However, if for whatever reason you want to do it yourself, the easiest thing would be maintain two mappings: one mapping for labels ('shift' -> 1, 'left-arc' -> 2, 'right-arc' -> 3, 'reduce' -> 4). And one for your feature template result ('f1=I' -> 1, 'f2=love' -> 2, 'f3=you' -> 3). Basically every time your algorithms applies a feature template you check whether the result is already in the mapping and if not you add it with a new index.
Remember that Liblinear or Libsvm expect a sorted list in ascending order.
During processing you would first apply your feature templates to the current state of your stacks and then translate the strings to the libsvm/liblinear integer representation and sort the indexes in ascending order.

Resources