Is it more reasonable to remove "IDK" from a survey in SPSS for rudimentary analyses? - spss

I attached a graph showing the typical array of responses to a survey measured ordinally in SPSS. While generally, the answers are quantifiable/compared as 1–5 corresponding to answers ranging from "Not Satisfied" to "Very Satisfied," many questions have a 6 (equating to "IDK"), although it's usually a minority response. I worry that for my purposes it's just noise and distracting from the 'weight' of other, more certain answers.
So, does it make statistical sense to measure opinions this way if I'm looking for ordinal certainty on a 0-6 scale? I'm wondering now if it's best to recode anything, change the Measure, or remove all 6s from the data totally.
Thanks for checking this post out!

A Likert scale assumes that response categories have a rank order. The survey question is apparently measuring the satisfaction of Monk Seal Resting and Sea Turtle Nesting, and you can't really rank "I don't know" on a satisfaction scale, where, for instance, 1 might represent "Not Satisfied" and 5 might represent "Very Satisfied".
If the purpose of your analysis is to measure statistical differences or levels of agreement between groups, then you probably want to remove "blank" and "I don't know". If you are creating a visualization to show the survey responses, and it would be helpful for your audience to know that a number of people chose "I don't know", and a number of people chose not to respond to the question, then it might make sense to keep those responses in your analysis. I would recommend reordering them so that they appear together at the beginning or end of your graph, instead of sandwiching the true Likert scale responses.

Related

Computing a similarity score for a set of sentences

My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different ways. Ideally, there would be very little similarity in the syntax of the utterances in the set.
Here's an example for an intent inquiring about medical insurance coverage
Bad set of utterances
Is my daughter covered by insurance?
Is my son covered by medical insurance?
Will my son be covered by insurance?
Decent set of utterances
How can I look up whether we have insurance coverage for the whole family?
Seeking details on eligibility for medical coverage
Is there a document that details who is protected under our medical insurance policy?
I want to be able to take all of the utterances associated to an intent and analyze them for similarity. I would expect my set of bad utterances to have a high similarity score and my set of decent utterances to have a low similarity score.
I've tried playing around with a few doc2vec tutorials, but I feel like I'm missing something. I keep seeing stuff like this:
Train a set of data and then measure the similarity of a new sentence to your set of data
Measure the similarity between two sentences
I need to have an array of sentences and understand how similar they are to each other.
Any advice on achieving this?
Answering some questions:
What makes the bad utterances bad?The utterances themselves are not bad, it is the lack of variety between them. If most of the training had been like the “bad” set, then real user utterances of greater variety will not be recognized correctly.
Are you trying to discover new intents? No, this is for prerelease training, trying to improve the effectiveness of it.
Why do bad utterances have high similarity scores and decent utterances have low similarity scores? This is a hypothesis. I know how varied real user utterances are, and I have found my trainers fall into ruts when training, asking things the same way, and not seeing good accuracy results. Improving the variety in the utterances tends to result in better accuracy.
What will I do with this info? I’ll use it to assess the training quality of an intent, to determine if more training is likely necessary. In the future we might build real time tools as utterances are being added to let trainers know if they’re being too repetitive.
Most applications of text vectors benefit from the vectors capturing the "essential meaning" of a text, **without* regard to variances in word choice.
That is, it's considered a feature, not a flaw, if two completely different wordings with similar meaning have nearly the same vector. (Or, if some similarity-measure indicates they are totally similar.)
For example, to contrive an example similar to yours, consider the two phrasings:
"health coverage for brother"
"male sibling medical insurance"
There's no reuse of words, but the likely intended meaning is the same – so a good text-vectorization for typical purposes would create very similar vectors. And a similarity-measure using those vectors, or otherwise using the words/word-vectors as input, would indicate very high similarity.
But from your clarifying answers, it seems you actually want a more superficial "similarity" measure. You'd like a measure that reveals when certain phrasings show variety/contrast in their wording. (And specifically, you already know form other factors, like how they were hand-crafted, that groups of these phrasings are semantically related.)
What you want this similarity measure to show is actually a behavior that many projects using text-vectors would consider a failure of the vectors. So semantic methods like those in Word2Vec, Paragraph Vectors (aka "Doc2Vec"), etc are likely the wrong tool for your goal.
You could probably do well with a simpler measure based just on the words, or perhaps character-n-grams, of the texts.
For example, for two texts A and B, you could just tally the number of shared words (that appear in both A and B), and divide by the total number of unique words in both A and B, to get a 0.0 to 1.0 "word choice similarity" number.
And, when considering a new text against a set of prior texts, if its average similarity to the prior texts is low, it'd be "good" for your purposes.
Rather than just words, you could also use all n-character substrings ("n-grams") of your texts – which might help better highlight differences in word-forms, or common typos, which may also be useful variances for your purposes.
In general, I'd look at the scikit-learn text-vectorization functionality for ideas:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

How can I group text questions that are similar to each other?

I have a dataset of 200k questions, and I would like to group them together by similarity/duplicates.
How can I use NLP/machine learning to group these questions with similar intents together?
Given a question and a list of questions, how can I find the question or questions that are similar or duplicates?
Are there any services that can do this?
Generally, you'd want to convert the questions into a abstract numerical format (such as a single high-dimensional vector, or 'bags of words/vectors'), from which it is then possible to calculate numerical pairwise similarities between questions.
For example: you could turn each question into a simple average of the word-vectors for its individual words. (Those word-vectors might come from your own training corpus, that matches the questions' usage domain exactly, or from some other outside source that's good enough.)
If the word-vectors are 300-dimensional, averaging all the words-vectors of a question together then gives you a 300-dimensional vector for the question. You can then use a typical measure of vector-similarity, such as "cosine similarity", to get a number from -1.0 to 1.0 for each pair of questions, with larger values indicating "more similar".
Such a simple approach is often a good baseline. Being smarter about dropping some words, or weighting words by their observed significance (eg by "TF/IDF" weighting) may improve it.
But there are other ways to get summary vectors that may work better than a simple average. One relatively straightforward algorithm, largely similar to the way word-vectors are created, is called "Paragraph Vectors", and is sometimes called in popular libraries (like Python gensim) "Doc2Vec". It's not quite a simple average of word-vectors, as much as creating a synthetic word-like token for a full text, which then is trained to be as good as possible at predicting the text's words. Again, once you have a (for example) 300-dimensional text-vector, calculating cosine-similarity can rank question similarities.
There's also an interesting algorithm called "Word Mover's Distance", which leaves the texts as variable-sized bags of each constituent word-vector, as if each word-vector was a pile-of-meaning. It then calculates the "effort" to move the piles from one text's shape-of-piles, to another text's – and less effort seems to correlate well with humans' sense of text similarity. (However, finding these minimal-shifts is a lot more computationally expensive than simple cosine-similarity – so this works best with short texts, or small corpuses, or when you can massively parallelize the computation.)
Once you have any of these numeric-similarity measures working, then you can also clustering algorithms to find groups of highly-related questions – and often once you have those groups, the most-common words in those groups (as opposed to others), or human editorial work, can name the groups.

Why is my p value too low with > 1,000,000 obs (p<<0.01)

I know that generally a low P value is good since I want to reject the H0 hypothesis. But my problem is an odd one, and I would appreciate any help or insight you may give me.
I work with huge data sets (n > 1,000,000), each representing data of one year. I am required to analyse the data and find out whether the mean of the year is significantly different than the mean of the previous year. Yet everyone would prefer it to be non-significant instead of significant.
By "significant" I mean that I want to be able to tell my boss, "look, these non-significant changes are noise, while these significant changes represent something real to consider."
The problem is that simply comparing the two averages with a t-test always results in a significant difference, even if the difference is very very small (probably due to the huge sample size) and falls within the O.K zone of reality. So basically the way I perceive it, a p value does not function well for my needs.
What do you think I should do?
There is nothing wrong with the p value. Even slight effects with this number of observations will be flagged for significance. You have rightfully asserted that the effect size for such a sample is very weak. This basically nullifies whatever argument can be made for using the p value alone for "significance"...while the effect can be determined to not be by chance, its actual usefulness in the real world is likely low given it doesn't produce anything predictable.
For a comprehensive book on this subject, see the often-cited book by Jacob Cohen on power analysis. You can also check out my recent post on Cross Validated regarding two regression models with significant p values for predictors, but with radically different predictive power.

What is the difference between inferential analysis and predictive analysis?

Objective
To clarify by having what traits or attributes, I can say an analysis is inferential or predictive.
Background
Taking a data science course which touches on analyses of Inferential and Predictive. The explanations (what I understood) are
Inferential
Induct a hypothesis from a small samples in a population, and see it is true in larger/entire population.
It seems to me it is generalisation. I think induct smoking causes lung cancer or CO2 causes global warming are inferential analyses.
Predictive
Induct a statement of what can happen by measuring variables of an object.
I think, identify what traits, behaviour, remarks people react favourably and make a presidential candidate popular enough to be the president is a predictive analysis (this is touched in the course as well).
Question
I am bit confused with the two as it looks to me there is a grey area or overlap.
Bayesian Inference is "inference" but I think it is used for prediction such as in a spam filter or fraudulent financial transaction identification. For instance, a bank may use previous observations on variables (such as IP address, originator country, beneficiary account type, etc) and predict if a transaction is fraudulent.
I suppose the theory of relativity is an inferential analysis that inducted a theory/hypothesis from observations and thought experimentations, but it also predicted light direction would be bent.
Kindly help me to understand what are Must Have attributes to categorise an analysis as inferential or predictive.
"What is the question?" by Jeffery T. Leek, Roger D. Peng has a nice description of the various types of analysis that go into a typical data science workflow. To address your question specifically:
An inferential data analysis quantifies whether an observed pattern
will likely hold beyond the data set in hand. This is the most common
statistical analysis in the formal scientific literature. An example
is a study of whether air pollution correlates with life expectancy at
the state level in the United States (9). In nonrandomized
experiments, it is usually only possible to determine the existence of
a relationship between two measurements, but not the underlying
mechanism or the reason for it.
Going beyond an inferential data analysis, which quantifies the
relationships at population scale, a predictive data analysis uses a
subset of measurements (the features) to predict another measurement
(the outcome) on a single person or unit. Web sites like
FiveThirtyEight.com use polling data to predict how people will vote
in an election. Predictive data analyses only show that you can
predict one measurement from another; they do not necessarily explain
why that choice of prediction works.
There is some gray area between the two but we can still make distinctions.
Inferential statistics is when you are trying to understand what causes a certain outcome. In such analyses there is a specific focus on the independent variables and you want to make sure you have an interpretable model. For instance, your example on a study to examine whether smoking causes lung cancer is inferential. Here you are trying to closely examine the factors that lead to lung cancer, and smoking happens to be one of them.
In predictive analytics you are more interested in using a certain dataset to help you predict future variation in the values of the outcome variable. Here you can make your model as complex as possible to the point that it is not interpretable as long as it gets the job done. A more simplified example is a real estate investment company interested in determining which combination of variables predicts prime price for a certain property so it can acquire them for profit. The potential predictors could be neighborhood income, crime, educational status, distance to a beach, and racial makeup. The primary aim here is to obtain an optimal combination of these variables that provide a better prediction of future house prices.
Here is where it gets murky. Let's say you conduct a study on middle aged men to determine the risks of heart disease. To do this you measure weight, height, race, income, marital status, cholestrol, education, and a potential serum chemical called "mx34" (just making this up) among others. Let's say you find that the chemical is indeed a good risk factor for heart disease. You have now achieved your inferential objective. However, you are satisfied with your new findings and you start to wonder whether you can use these variables to predict who is likely to get heart disease. You want to do this so that you can recommend preventive steps to prevent future heart disease.
The same academic paper I was reading that spurred this question for me also gave an answer (from Leo Breiman, a UC Berkeley statistician):
• Prediction. To be able to predict what the responses are going to be
to future input variables;
• [Inference].23 To [infer] how nature is associating the response
variables to the input variables.
Source: http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Resources