Question
Is there a fast, scalable way to replace number values by mapped text labels in my visualisations?
Background
I often find myself with questionnaire data of the following format:
ID Sex Age class Answer to question
001 1 2 5
002 2 3 2
003 1 3 1
004 2 5 1
The Sex, Age class and Answer column values actually map to text labels. For the example of Sex:
ID Description
0 Unknown
1 Man
2 Woman
Similar mappings are possible for the other columns.
If I create visualisations of e.g. the distribution of sex in my respondent group I'll get a visual showing that 50% of my data has sex 1 and 50% of my data has sex 2.
The data itself often originates from an Excel or csv file.
What I have tried
To make that visualisation meaningful to other people I:
create a second table containing the mapping between the value and label
create a relationship between the source data and the mapping
use the Description column of my mapping table as a category in my visualisations.
I have to do this for several columns in my dataset, which makes this a tedious process.
Ideal solution
A method that allows me to define, per column, a mapping between values and corresponding text labels. SPSS' VALUE LABELS command comes to mind.
You can simply create a calculated column on your table that defines how you want to map each ID values using a SWITCH function and use that column in your visual. For example,
Sex Label =
SWITCH([Sex],
1, "Man",
2, "Woman",
"Unknown"
)
(Here, the last argument is an else condition that gets returned if none of the previous get matched.)
If you want to do a whole bunch at a time, you can create a new table from your existing table using ADDCOLUMNS like this:
Test =
ADDCOLUMNS(
Table1,
"Sex Label", SWITCH([Sex], 1, "Man", 2, "Woman", "Unknown"),
"Question 1 Label", SWITCH([Question 1], 1, "Yes", 2, "No", "Don't Know"),
"Question 2 Label", SWITCH([Question 2], 1, "Yes", 2, "No", "Don't Know"),
"Question 3 Label", SWITCH([Question 3], 1, "Yes", 2, "No", "Don't Know")
)
Related
I am using Google Forms to create a survey with weighted answers. I've been able to make things work when there is just one possible correct answer - I made a separate tab with a set of answer tables with point values assigned, then used vlookup to call back and match the given response to the answer table and fetch the assigned point value.
=VLOOKUP(P2, Sheet2!$A$49:$B$50, 2, FALSE)
P2 is a value pulled from the "Form Responses" tab - in this case, a yes/no answer. Sheet 2 has a table for each question with the possible answers and the point values for each answer (A49=yes, A50=no)
However, for some of the questions, multiple answers are valid and I want to add up the total number of points for that given question. So for example:
What are your hobbies? and folks can choose from
Riding your bike
Playing football
Swimming
Going fishing
Painting
And the respective point values are 2, 2, 3, 4, 4
So then, if someone chose the "Swimming" and "Going fishing" checkboxes in the form, I'd get "7", and if someone chose "Riding your bike", "Playing football", and "Painting", I'd get "8".
I realize that the output from the Google form will list the chosen answers all in one cell (Playing football, Going fishing), so I'm not sure how to make it count each answer (especially since some of them are multi-word answers) and output the sum of the values.
VLOOKUP is not suitable in this case. try FILTER like:
=FILTER(Sheet2!B49:B50, Sheet2!A49:A50=P2)
then VLOOKUP it like:
=SUMPRODUCT(IFNA(VLOOKUP(FILTER(Sheet2!B49:B50, Sheet2!A49:A50=P2), sheetx!A:B, 2, 0)))
where sheetx!A:B is like:
Riding your bike
2
Playing football
2
Swimming
3
Going fishing
4
Painting
4
and if Sheet2!B49:B50 contains multiple comma+space separated values you will need to split them like:
=SUMPRODUCT(IFNA(VLOOKUP(FILTER(
IFERROR(SPLIT(Sheet2!B49:B50, ", ")), Sheet2!A49:A50=P2), sheetx!A:B, 2, 0)))
I have a table on sheet 1 that looks like the following:
Class
Section
Name
Mark
10
A
Tom
75
12
B
Bob
85
12
A
Roy
60
In another sheet (Sheet 2), I want to create another view that gives me ability to see average mark with optional filter of 'Class' and 'Section' Column.
I am trying to achieve this with the following IFS command in my Sheet 2 respective cell.
=IFS(AND(D2="All Classes",G2= "All Sections"), {result 1}, // No specific class or section chosen, should return average of all marks
AND(D2 <>"All Classes",G2= "All Sections"), {result 2}, // A specific class is chosen, should return average of all marks from that class
AND(D2="All Classes",G2 <> "All Sections"), {result 3}, // A specific section is chosen, should return average of all marks from that section
AND(D2<>"All Classes",G2<> "All Sections"), {result 4}) // A specific class and a section is chosen, should return average of all marks from that class and that section.
I am trying to figure out the respected result 1, 2, 3, 4 in the above command. Tried with both Filter and Query, but none of them seem to be working for me.
What built in function should I use?
Use this if Class column is numbers
=AVERAGE(QUERY(Sheet1!A:D,CONCATENATE("select D where A ",IF(D2="All Classes","is not null","="&D2)," and B ",IF(G2="All Sections","is not null","='"&G2&"'")),1))
Use this if Class column is strings
=AVERAGE(QUERY(Sheet1!A:D,CONCATENATE("select D where A ",IF(D2="All Classes","is not null","='"&D2&"'")," and B ",IF(G2="All Sections","is not null","='"&G2&"'")),1))
As the title says.
What I'm trying to do is a way to set the labels of a column equal to the value in another column.
A B
1 Car
2 Bike
3 Van
1 Car
3 Van
Column A contains the numeric values. Column B contains the labels.
I want to tell SPSS to take the value 1, and assign it the label "Car" (and so on) as clasically is done manually with:
VALUE LABELS
1 "Car"
2 "Bike"
3 "Van".
Execute.
The syntax below will automatically create a new syntax that adds the value labels as you described.
Before starting, I'm recreating the sample data you posted to demonstrate on:
data list list/A (f1) B (a10).
begin data
1 "Car"
2 "Bike"
3 "Van"
1 " Car"
3 "Van"
end data.
dataset name orig.
Now we get to work:
* first we aggregate the data to get only one line for every value/label pair.
dataset declare agg.
aggregate out=agg /break A B /nn=n.
dataset activate agg.
* now we use the data to create a syntax file with the value label commands.
string cat (a50).
compute cat=concat('"', B, '"').
write out="yourpath\my labels syntax.sps" /"add value labels A ", A, cat, ".".
execute.
* getting back to the original data we can now execute the syntax.
dataset activate orig.
insert file="yourpath\my labels syntax.sps".
I am trying to rank the data in one column in my google sheet so that there are no duplicate rankings. I've seen some solutions such as =RANK(A2,$A$2:$A$10)+COUNTIF($A$2:A2,A2)-1, but the problem is that it increments the duplicates based on occurrence in the sheet.
Let's say my data that I'd like ranked is as follows:
1
1
1
2
The rank order would be 2, 3, 4, 1. The problem is, if I change the second entry to 2 (so that my data is now 1, 2, 1, 2) the ranking order becomes 3, 1, 4, 2 instead of 3, 2, 4, 1 like I want. In the original data, the fourth entry was initially the highest and I'd like it to still have the higher rank, but since the formula counts occurrences it gets demoted. Any way to accomplish this?
No, not with native spreadsheet functions. Spreadsheet formulae have no "awareness" of which values were entered most recently.
You would need to resort to Google Apps Script run on an "on edit" trigger.
Here is my word vector :
google
test
stackoverflow
yahoo
I have assigned a value for these words as follows :
google : 1
test : 2
stackoverflow : 3
yahoo : 4
Here are some sample users and their words :
user1 google, test , stackoverflow
user2 test , google
user3 test , yahoo
user4 stackoverflow , yahoo
user5 stackoverflow , google
user6
To cater for users which do not have value contained in the
word vector I assign '0'
Based on this, this corresponds to :
user1 1, 2 , 3
user2 2 , 1 , 0
user3 2 , 4 , 0
user4 3 , 4 , 0
user5 3 , 1, 0
user6 0 , 0 , 0
I am unsure if these are the correct values or even is correct approach for applying values to each word vector value so can apply 'Eucludeian distance' and 'correlation'. I'm basing this on snippet from book 'Programming Collective Intelligence' :
"Collecting Preferences The first thing you need is a way to represent
different people and their preferences. If you were building a
shopping site, you might use a value of 1 to indicate that someone had
bought an item in the past and a value of 0 to indicate that they had
not. "
For my dataset I do not have preference values so I am just using a unique numerical value to represent if a user contains a word in word vector or not.
Are these the correct values to set for my word vector ? How should I determine what these values should be ?
To make distance and similarity metrics work out, you need one column per word in the vocabulary, then fill those columns with booleans zero and one as the corresponding words occur in samples. E.g.
G T SO Y!
google, test, stackoverflow => 1, 1, 1, 0
test, google => 1, 1, 0, 0
stackoverflow, yahoo => 0, 0, 1, 1
etc.
The squared Euclidean distance between the first two vectors is now
(1 - 1)² + (1 - 1)² + (1 - 0)² + (0 - 0)² = 1
which makes intuitive sense as the vectors differ in exactly one position. Similarly, the squared distance between the final two vectors is four, which is the maximal squared distance in this space.
This encoding is an extension of the "one-hot" or "one-of-K" coding, and it's a staple of machine learning on text (although few textbooks care to spell it out).