I have carried out an evaluation of a product using likert scale questionnaire and imported the date into SPSS. I have my columns arranged as follows:
ID, Group, Q1, Q2, Q3, Q4
I have two different groups completing the questionnaire, with each person a different numerical ID. Under the Q columns, I have the score given for that person (from 1-5) from the Likert Scale.
In all there are over 300 responses.
I am running analysis using 'descriptive statistics/frequencies' from the menubar and not getting the tables I am looking for. Basically, it is including all respondents together, whereas I would like it to compare the two groups in the tables.
How can I get descriptive statistics on questionnaire items by group using SPSS?
In addition, if you have any further tips as to what analysis I could perform on this type of data in SPSS I'd be most grateful. I'd like to show that there isn't a significant difference in opinions between the groups, and from looking at the data, it appears that this is the case.
One option
split file by group
run descriptive statistics as usual
See this SPSS FAQ item from UCLA on how to analyze data by categories.
The short answer to you question is, crosstabs Q1 to Q4 by group. will produce the table you want. Or if you have the ctables package available a more compact table will be produced by
variable level group_id Q1 to Q4 (nominal).
ctables
/table Q1 + Q2 + Q3 + Q4 by group_id.
Either can be elaborated on to produce other statistics if wanted. It seems to me a chi-square test would be sufficient for your question.
As far as further analysis it is a bit of an open-ended question that needs more focus to be able to effectively answer. I frequently suggest visual exploration for such exploratory analysis, and hence I would suggest perusing this question on the site, Visualizing Likert responses using R or SPSS for potential ideas about how to visualize the responses. Another motivating post may be How to visualize 3D contingency matrix?.
There are a ton of other questions related to analyzing likert responses on this site though, and it is difficult to give any more specific advice without a more specific motivation for the analysis.
While the above answers all have their good points, I usually prefer this procedure (type the following into a syntax window and Run):
means q1 to q4 by group/stat anova.
This will give you group means, sample sizes, and standard deviations as well as tests of the difference in means between the groups, for each of the variables Q1 to Q4. Of course, the tests will only give you valid results to the extent that your data meet the standard assumptions of anova. Some may say that variables measured on an ordinal 1-5 scale are not suitable for anova, and in academic contexts this is often true, but in business contexts most people are willing to sacrifice some rigor for the sake of convenience. It's much more convenient to compare 4x2=8 means than it is to compare the distributions of 4x5x2=40 categories of responses.
This can easily be done by using the "Crosstabs" function in SPSS for Windows:
Analyze --> Descriptive Statistics --> Crosstabs. Move the dependent variable(s) into the "Row(s)" box, then move the grouping variable into the "Column(s)" box, then click OK.
Related
I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!
Data at hand: 1000 questionnaires with a finite database of questions, say 100 questions about name, gender, income etc. Each questionnaire contains 10 to 30 questions from this question database. The wording of a certain question remains identical across different questionnaires. The 100 questions have their unique label (Q1 to Q100) in the database.
Task: creating a new questionnaire. Assuming I know which questions (say 20 questions including Q1, Q5, Q10, Q22 etc) I need to ask on the new questionnaire, I need to know what order should I place these questions.
Machine learning question: how do I learn the patterns from the existing data to help myself order the 20 questions on my new questionnaire?
A simple but inaccurate solution would count the order of each question label on the existing data. Say Q1 appeared 300 times on the existing data and 70% of time it is the first question on the questionnaire, so I predict Q1 would have order = 1 on any new questionnaire.
Alternatively, I can compute the average order of each question on the existing data. Say Q1 has a mean order of 2.53 and Q10 has a mean order of 1.33. Then when I create a new questionnaire that contains both Q1 and Q10, Q1 will be placed after Q10.
The above methods fail to capture the relationship between the questions. For instance, maybe Q5 always appears after Q6. I hope the algorithm can capture hidden patterns like this.
I am building an Item Based Recommender System for 10 millions users who
rate categories over 20 possible categories (news categories like politic,
sport etc...)
I would like for each one of them to be recommended at least another
category which they don't know (no rating).
I runned a GenericUserBasedRecommender and asked for recommendations for
each user but It looks extremely long: maybe 1000 user proceeded per minute.
My questions are:
1- Can I run this same GenericUserBasedRecommender on hadoop and would it
really be faster? I saw and run an ItemBasedRecommender with command line on
a cluster, but I would rather run a User Based one.
1,5 - I saw many users not having a single recommendations. What is the alogrithm criterium to determine if a user get a recommendation? I thought It could be that the user who don't get recommendations are the one who only give a single rating, but I don't understand why.
2- Is there another smarter way to deal with my problem? Maybe some clustering
solution instead of recommendation? I don't exactly see how.
3- Finally, am I right when I say that the algorithms who have no command line
are not to be used with hadoop?
Thank you for your answers.
Sometimes you won't get recommendations for certain items or users because there are few items over which they overlap. It could also be a case where the user data may be 'enough', but his behaviour/use patterns are very unique and/or disagreement with popular trends in the data.
You could perhaps try LogLikelihood or Tanimoto based ItemSimilarity.
Another thing you could look into is a Matrix Factorization based model. You could use the ALSWR Factorizer to generate recommendations. this method decomposes the original User-Item matrix, to a User-Feature, Item-Feature and Diagonal matrix,--> then reduces the dimensionality-->and then recronstructs the matrix which is closest to the original matrix with same rank. You might lose some data this method, but the missing values in the user-item matrix are imputed and you get estimate preference/recommendation values.
If you have the features and not just implicit ratings, you could probably experiment with clustering techniques, perhaps start with Hierarchical Clustering.
I did not quite get your last question.
It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientific Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com
Is there a way to compare all information (aggregates, down to the detail level) between two OLAP cubes? For example, say I wanted to compare one cube created to work with sql server 2000 to that same cube, but migrated to run on sql server 2005/2008 - technically they should both return the same information for all dimension / measure combinations but I need a way to verify.
I am definitely NOT a developer, but I do have access to enterprise manager, and potentially SAS tools etc. and I know a bit of SQL but not much else. I know that you can compare two dimensional (i.e. tables) data sets with sql queries, and also with SAS - but I have never heard of a way to compare three dimensional cubes.
Am I out of luck on this one? The last thing that I want to have to do is view both cubes and compare all possible results side by side via excel or something, I hope that it can be automated somehow.
Comparing cubes means doing enough "slice-and-dice" queries to prove that you've queried all of the facts.
You can, simply, get a sum and count of the various fact and dimension tables. If those are the same, odds are good that any particular query will be the same between the two.
Without details on the dimensions and facts in question, it's hard to make a more specific recommendation.
However, consider that you can easily compute a set of subtotals for each dimension of the cube. If the dimensions are the same number of rows, the results will be the same number of rows. If the grand total is the same, then all that's left is row-by-row comparison of the subtotals.
If you do this once for each dimension, you should have some confidence that they're the same. Or, you'll find a difference that you can explore with more detailed queries.
The best approach is to compare the cube data by interchanging the rows and columns and verifying if all the counts and totals match properly.
For example, if you are having year-wise totals for a particular location, it would be a good approach to interchange the values between locations and the months and verifying whether they match properly.