How does CountVectorizer deal with new words in test data? - machine-learning

I understand how CountVectorizer works in general. It takes word tokens and creates a sparse count matrix of documents (rows) and token counts (columns), that we can use for ML modeling.
However, how does it deal with new words that can presumably show up in test data, that weren't in the training data? Does it just ignore them?
Also, from a modeling standpoint, should the assumption be that if certain words are so rare that they didn't show up in the training data at all, and that they aren't relevant for any modeling you might perform?

I am assuming you are referring to the scikit-learn CountVectorizer. Not that I know if any other myself.
Yes, when new documents are encoded, words that are not part of the vocabulary(created from the training data) are ignored by the count vectorizer.
Example of creating vocabulary: (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Now, use transform on new document and you can see that the Out of vocabulary words are ignored:
>>> print(vectorizer.transform(['not in any of the document second second']).toarray())
[[0 1 0 0 0 2 1 0 0]]
With respect to the rare words that are not part of the training data, I would agree to your statement that it is not significant for modeling since we would want to believe that the words that are most relevant to create and generalize a good model are already part of the training data.

Related

TfidfVectorizer returns empty elements

I am currently doing sentiment analysis project. I fit the vectorizer with my train data in dataframe format. Then I transform the test data with the same vectorizer but it returns nothing for me. I did check the TfidfVectorizer.get_feature_names() and the desired transform word already exist inside the features. What is wrong with my vectorizer?
Code:
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
x = vectorizer.fit_transform(data['clean_text'])
print(vectorizer.get_feature_names()[9427])
# output sad
print(vectorizer.transform(["sad"]))
# empty result
print(vectorizer.transform(["sad"]).toarray())
# return a whole 0 array
Sample data format (dataframe)
sentiment clean_text
0 0 [respond, go]
1 1 [sooo, sad]
2 1 [bulli]
3 1 [leav, alon]
4 1 [cry]

Analyzing data in SPSS using ROC Curve For categorical variables (nominal)

I use SPSS v25 to build ROC.
I have DataSet with the following data:
Case# Dosage Result
1 DosagA healthy
2 DosagA sick
3 DosagB sick
4 DosageC healthy
....
To analyse Using ROC, I encoded Result as:
Healty =1, sick =0
Case# Dosage Result
1 DosagA 1
2 DosagA 0
2 DosagB 0
4 DosageC 1
....
When trying to build ROC with: Test Variable= result, State variable= Dosage
I get error message:
String Variable are not allowed in the list
Do I have to Encode Dosage with numeric values like:
Case# Dosage Result
1 1 1
2 1 0
2 2 0
4 3 1
...
Or
What is the best solution for ROC Curve using categorical variables (nominal)?
The prediction needs to be numeric. The ROC curve gives sensitivity vs. 1 - specificity for different cut points on a predictor, whether that's a single predictor or a score based on something like a logistic regression.

Encode a categorical feature with multiple categories per example

I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)

Data Science: Scoring methodology

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

Difference-in-difference analysis in SPSS

I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne
This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".
General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .
Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.

Resources