My dataset looks like down below, but there are way more brands and categories.
I would like to transfer it the way that the brand is a row, and attribute (good quality, affordable) in the column
I've tried VARSTOCASES and i can calculate mean from it but thats not my desirable output
I need to posses brand names somehow - should withdraw it from all of my variables by
compute brand=char.substr(brand, 16)
like
compute brand=char.substr(P1_Good_Quality_BMW, 16)
I am fine with the varstocases part, then I can put my output like GQ to column, but dont know how to possess all of the names of brands and to let them match mean values of attributes
Thank you in advance for your help
This will get the data in the structure you intended - with a row for each brand and a column for each attribute:
varstocases /make GQ from P1_GoodQuality_BMW P1_GoodQuality_Audi P1_GoodQuality_Mercedes
/make Afford from P2_Affordable_BMW P2_Affordable_Audi P2_Affordable_Mercedes
/index=brand(GQ).
* at this point you should have the table you were trying to create,
* we'll just extract the brand names properly.
compute brand=char.substr(brand, 16).
execute.
* now we have the data structured nicely, we can aggregate by brand.
dataset declare agg.
aggregate /out=agg /break=brand /GQ Afford = mean (GQ Afford).
dataset activate agg.
Related
I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?
Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...
Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.
Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict
I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!
This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".
I have a dataset where each case is a student and I have a variable for sex (SEX), as well as one for major (MAJOR). The variable for sex has 2 possible values (male and female), whereas the one for major has dozens (biology, mathematics, etc.).
I would like to use that dataset to create another dataset with one case for each major and 3 variables: MAJOR, MALE and FEMALE. The value of the variable MALE for each major should be the number of men enrolled in that major and the value of the variable FEMALE should be the number of women enrolled in it. The value of MAJOR should just be the label of the value of the variable MAJOR in the original dataset corresponding to that case.
Just so it's clear, when I look at the dataset I would like to create, there should be one line per major, with one column MAJOR that contains the label of each major, one for MALE that contains the number of men enrolled in each major and one column for FEMALE that contains the number of women enrolled in each major.
The dataset I have was created with SPSS and I have never used that program, so I have no idea how to do that, even though it's probably very easy. I would be very grateful for your help!
Best,
Philippe
When your file is open, open a new syntax window, put the following code in it and run it:
dataset name OrigFile.
compute male=(SEX="MALE").
compute female=(SEX="FEMALE").
dataset declare NewFile.
aggregate /outfile='NewFile' /break=major /male female=sum(male female).
after running this you will have two open datasets - you original one and the new one you wanted to create.
I am relatively new to using Pig for my work. I have a huge table (3.67 Mil Entries) with fields -- id, feat1:value, feat2:value ... featN:value. Where id is text and feat_i is the feature name and value is thevalue for the feature i for a given id.
The number of features may vary for each tuple since its a sparse representation.
For example this is an example of 3 rows in data
id1 f1:23 f3:45 f7:67
id2 f2:12 f3:23 f5:21
id3 f7:30 f16:8 f23:1
Now the task is to group queries that have common features. I should be able to get those set of queries that have any feature overlapping.
I have tried several things. CROSS and JOINS create explosion in data and reducer gets stuck. Im not familiar with conditioning GROUP BY command.
Is there a way to write a condition in GROUP BY such that it selects only those queries that have common features.
For the above rows result will be:
id1, id2
id1, id3
Thanks
I can't think of an elegant way to do this in pig. There is no possibility to group b based on some condition.
However, you could GROUP ALL your relation and pass it to a UDF that compares each record with every other record. Not very scalable and a UDF is required, but it would do the job.
I would try and not parse the string.
If its possible, read the data as two columns: the ID column and the features columns.
Then I would cross join with a features table. It would essentially be a table looking like this:
f1
f2
f3
etc
Create this manually in Excel and load it onto your HDFS.
Then I would group by the features column and for each feature I would print all IDs
essentially, something like this:
features = load ' features.txt' using PigStorage(',') as (feature_number:chararray);
cross_data = cross features, data;
filtered_data = filter cross_data by (data_string_column matches feature_number);
grouped = group filtered_data by feature_number;
Then you can print all the IDs for each feature.
The only problem would be to read the data using something other than Pig storage.
But this would reduce your cross join from 3.6M*3.6M to 3.6M*(number of features).
I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.