SPSS Aggregating data from another dataset - spss

I'm making a SPSS syntax. I have two datasets. One of the individuals, the other consisting of their groups. I have aggregated the mean of individuals in that active dataset. However, I need to have this constant in the other, the groups dataset. So that I can compare group means with the overall mean. Please help.

As mentioned in comment MATCH FILES command will help you to achieve the result.

Do you know the "insert" command which generates syntax for you?

Related

How to identify relevant columns in very wide tables using AI and Machine Learning?

I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!

Integrate multiple same structure datasets in one database

I have 8 different datasets with the same structure. I am using Neo4j and need to query all of them at different points on the website I am developing. What would be the approaches at storing the datasets in one database?
One idea that comes to my mind is to supply for each node an additional property that would distinguish nodes of one dataset from nodes of the other ones. But that seems too repetitive and wrong for me. The other idea is just to create 8 databases and query them separately but how could I do that? Running each one in its own port seems crazy.
Any suggestions would be greatly appreciated.
If your datasets are in a tree structure, you could add a different root node to each of them that you could use for reference, similar to GraphAware TimeTree. Another option (better than a property, I think) would be to differentiate each dataset by adding a specific label to nodes from that dataset (i.e. all nodes from "dataset A" get a :DataSetA label)
I imagine that the specific structure of your dataset may yield other options. For example, if you always begin traversals of the dataset from a few set locations, you only need to be able to determine which dataset the entry points are a part of, because once entered, all traversals would be made within the same dataset <-- if that makes sense.

How to select random subset of cases in SPSS based on student number?

I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.

How to get descriptive statistics on questionnaire items by group using SPSS?

I have carried out an evaluation of a product using likert scale questionnaire and imported the date into SPSS. I have my columns arranged as follows:
ID, Group, Q1, Q2, Q3, Q4
I have two different groups completing the questionnaire, with each person a different numerical ID. Under the Q columns, I have the score given for that person (from 1-5) from the Likert Scale.
In all there are over 300 responses.
I am running analysis using 'descriptive statistics/frequencies' from the menubar and not getting the tables I am looking for. Basically, it is including all respondents together, whereas I would like it to compare the two groups in the tables.
How can I get descriptive statistics on questionnaire items by group using SPSS?
In addition, if you have any further tips as to what analysis I could perform on this type of data in SPSS I'd be most grateful. I'd like to show that there isn't a significant difference in opinions between the groups, and from looking at the data, it appears that this is the case.
One option
split file by group
run descriptive statistics as usual
See this SPSS FAQ item from UCLA on how to analyze data by categories.
The short answer to you question is, crosstabs Q1 to Q4 by group. will produce the table you want. Or if you have the ctables package available a more compact table will be produced by
variable level group_id Q1 to Q4 (nominal).
ctables
/table Q1 + Q2 + Q3 + Q4 by group_id.
Either can be elaborated on to produce other statistics if wanted. It seems to me a chi-square test would be sufficient for your question.
As far as further analysis it is a bit of an open-ended question that needs more focus to be able to effectively answer. I frequently suggest visual exploration for such exploratory analysis, and hence I would suggest perusing this question on the site, Visualizing Likert responses using R or SPSS for potential ideas about how to visualize the responses. Another motivating post may be How to visualize 3D contingency matrix?.
There are a ton of other questions related to analyzing likert responses on this site though, and it is difficult to give any more specific advice without a more specific motivation for the analysis.
While the above answers all have their good points, I usually prefer this procedure (type the following into a syntax window and Run):
means q1 to q4 by group/stat anova.
This will give you group means, sample sizes, and standard deviations as well as tests of the difference in means between the groups, for each of the variables Q1 to Q4. Of course, the tests will only give you valid results to the extent that your data meet the standard assumptions of anova. Some may say that variables measured on an ordinal 1-5 scale are not suitable for anova, and in academic contexts this is often true, but in business contexts most people are willing to sacrifice some rigor for the sake of convenience. It's much more convenient to compare 4x2=8 means than it is to compare the distributions of 4x5x2=40 categories of responses.
This can easily be done by using the "Crosstabs" function in SPSS for Windows:
Analyze --> Descriptive Statistics --> Crosstabs. Move the dependent variable(s) into the "Row(s)" box, then move the grouping variable into the "Column(s)" box, then click OK.

Mahout slope one and categories

I would like to use slope one as item reccomender. The problem is that I have groups of items that are not correlated each other.It seems to me that there is no way to tell Mahout to use the diff storage just for a group of products.I want to achieve this because the groups have an average of 100 items and I prefer to not create a mongoDbDiffStorage from scratch. Does the rescorer tell what differences should be computed in order to avoid to store useless data?
Thanks
You mean that you know that you want to consider some items completely unrelated? No, you'd have to write your own DiffStorage for that.
Or, I might suggest that if these item sets are completely unrelated, then what you really have is a recommender problem for each group of items, and can use a Recommender for each subset of data. This would probably be easier and more efficient in many ways.

Resources