Calculate a bunch of data to display on stacked bar - spss

I'm struggeling with creating my first chart.
i have a dataset of ordinal scaled data from a survey.
There i have several question with the possible answer from 1 - 5.
So have around 110 answers from different persons which i want to collect and show in a stacked bar.
Those data looks like:
| taste | region | brand | price |
| 1 | 3 | 4 | 2 |
| 1 | 1 | 5 | 1 |
| 1 | 3 | 4 | 3 |
| 2 | 2 | 5 | 1 |
| 1 | 1 | 4 | 5 |
| 5 | 3 | 5 | 2 |
| 1 | 5 | 5 | 2 |
| 2 | 4 | 1 | 3 |
| 1 | 3 | 5 | 4 |
| 1 | 4 | 4 | 5 |
...
to can display that in a stacked bar chart, i need to sum that.
so i know at the end it need to be calculated like:
| | taste | region | brand | price |
| 1 | 60 | 20 | 32 | 12 |
| 2 | 23 | 32 | 54 | 22 |
| 3 | 24 | 66 | 36 | 65 |
| 4 | 55 | 68 | 28 | 54 |
| 5 | 10 | 10 | 12 | 22 |
(this is just to demonstarte, the values are not correct)
Or somehow there is already a function for it on spss but i have now idea where an how.
Any advice how to do that?

I can't think of a single command but there are many ways to get to where you want. Here's one:
first recreating your sample data:
data list list/ taste region brand price .
begin data
1 3 4 2
1 1 5 1
1 3 4 3
2 2 5 1
1 1 4 5
5 3 5 2
1 5 5 2
2 4 1 3
1 3 5 4
1 4 4 5
end data.
Now counting the values for each row:
vector t(5) r(5) b(5) p(5).
* the vector command is only nescessary so the new variables will be ordered compfortably for the following parts.
do repeat vl= 1 to 5/t=t1 to t5/r=r1 to r5/b=b1 to b5/p=p1 to p5.
compute t=(taste=vl).
compute r=(region=vl).
compute b=(brand=vl).
compute p=(price=vl).
end repeat.
Now we can aggregate and restructure to arrive to the the exact data structure you specified:
aggregate /outfile=* /break= /t1 to t5 r1 to r5 b1 to b5 p1 to p5 = sum(t1 to p5).
varstocases /make taste from t1 to t5 /make region from r1 to r5
/make brand from b1 to b5/ make price from p1 to p5/index=val(taste).
compute val = char.substr(val,2,1).
alter type val(f1).

Related

Cross join of two tables

I have two tables x and y. I want to join on column b such that I get z in output.
x:([a:1 2 1 3]; b:`a`a`b`b)
q) a | b
-----
1 | a
2 | a
1 | b
3 | b
y:([b:`a`a`a`b]; c:7 8 9 10)
q) b | c
------
a | 7
a | 8
a | 9
b | 10
Desired output:
q) a | b | c
-----------
1 | a | 7
1 | a | 8
1 | a | 9
2 | a | 7
2 | a | 8
2 | a | 9
1 | b | 10
3 | b | 10
Is this some sort of cross join?
An equi join (ej) will produce the result you want:
q)ej[`b;x;y]

How do I know that the answered questions are from the section as and how to calculate them

My scenario:
I have a form with several sections, each section contains 4 questions with an optional "yes" option. link google forms
section 1 | section 2 | section 3 | section 4 | section 5
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20
y y y y y y y y y y
My result:
As per the outline above, I would need to have a list with the sections and their yes counts link google spreadsheets
My need:
sec. 1 | sec. 2 | sec. 3 | sec. 4 | sec. 5
2 | 4 | 1 | 3 | 0
I would like to count only the questions marked "yes" for each section, but I do not have the section line in the google spreadsheet linked to the Google Form.
There is a easy way of doing this operation by first changing the answer to boolean values (y→1, n→0) as this:
section 1 | section 2 | section 3 | section 4 | section 5
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20
1 1 1 1 1 1 1 1 1 1
Then you can use the function COUNT() under each section to get the number of true values. Please, ask me any question to further clarify this answer.

Counting if range matches ranged criteria 1:1

I have an ongoing scoreboard with a friend for a game we play. It looks like this:
A B C D E F
+-----------------------------+-------+------+--------+--------+------------+
1 | Through the Ages Scoreboard | | | | | |
+-----------------------------+-------+------+--------+--------+------------+
2 | Game title | Kevin | M | First? | Winner | Difference |
+-----------------------------+-------+------+--------+--------+------------+
3 | thekoalaz's Game | 174 | 213 | Kevin | M | 39 |
4 | Game #0 | 242 | 126 | Kevin | Kevin | 116 |
5 | Game #1 | 105 | 146 | Kevin | M | 41 |
6 | Game #2 | 158 | 135 | Kevin | Kevin | 23 |
7 | Game #3 | 149 | 145 | M | Kevin | 4 |
8 | Game #4 | 91 | 145 | Kevin | M | 54 |
9 | Game #5 | 211 | 187 | M | Kevin | 24 |
10 | Game #6 | 160 | 158 | M | Kevin | 2 |
11 | Game #7 | 154 | 215 | Kevin | M | 61 |
12 | Game #8 | 169 | 177 | M | M | 8 |
13 | Game #9 | 135 | 129 | M | Kevin | 6 |
14 | Game #10 | 156 | 262 | Kevin | M | 106 |
15 | Game #11 | 205 | 171 | M | Kevin | 34 |
16 | Game #12 (2) | 186 | 203 | Kevin | M | 17 |
17 | | | | | | |
+-----------------------------+-------+------+--------+--------+------------+
Where there's space at the end of the board to add scores for future games.
How do I count how many times the player who goes first wins? In this case it should be 3: D4 = E4, D6 = E6, D12 = E12. Is this possible to do in a single formula? And I'd like to make adding future game scores "just work" with this as well.
Here, first is {K;K;K;K;M;K;M;M;K;M;M;K;M;K}
And winner is {M;K;M;K;K;M;K;K;M;M;K;M;K;M}
I tried =COUNTIF($E$3:$E, $D$3:$D), but this gives me 7, which I presume is the same as =COUNTIF($E$3:$E, $D$3), without the ranged criteria.
Other ranged criteria questions didn't seem to focus on this 1:1 necessity (or maybe I don't know how to word it).
Here's what I used:
=SUMPRODUCT(D3:D=E3:E, E3:E<>"")
Let's break it down.
D3:D=E3:E (also expressible as EQ(D3:D, E3:E)) - equality. I tried to figure out the concept of testing equality of ranges, but the best thing I could find was Microsoft's tutorial on array formulas. What I can say is if you just put =D3:D=E3:E in your Google sheet, it will just be one of the results--the one that matches the row. It requires =ArrayFormula(D3:D=E3:E) to enter as the array of equality results.
SUMPRODUCT - Sums the product of corresponding array elements between multiple arrays. For example, SUMPRODUCT({1,3}, {2,4}) = 1*2 + 2*4 = 10. If used with one array, it would just aggregate the array's values. TRUE=1 and FALSE=0, so when considering the array formula above, it will count how many times D3:D=E3:E is true. Ranges work as arrays, so maybe that's why wrapping the equality with ArrayFormula(...) isn't necessary
E3:E<>"" - Another array formula testing if the E cell is not empty (<> is the "not equals" sign). Because I want this to automatically work for any new entries, D3:D=E3:E will evaluate true for any empty entries (empty=empty). Mutliplying these two array formulas together is effectively an AND operator--"sum this if Dn=En AND En is not empty". To convince you, here are the truth tables:
+-----+---+---+ +------+---+---+
| AND | T | F | | MULT | 1 | 0 |
+-----+---+---+ +------+---+---+
| T | T | F | | 1 | 1 | 0 |
| F | F | F | | 0 | 0 | 0 |
+-----+---+---+ +------+---+---+

weka gives 100% correctly classified instances for every dataset

I'm not able to get accuracy, as every dataset I provide provides 100% accuracy for every classifier algorithm I apply. My data set is of 10 people.
It gives the same accuracy for naive bayes, J48, JRip classifier algorithm.
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| id | name | q1 | q2 | q3 | m1 | m2 | tut | fl | proj | fexam | total | grade |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| 1 | abv | 5 | 5 | 5 | 13 | 13 | 4 | 8 | 7 | 40 | 100 | p |
| 2 | ca | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 40 | 48 | f |
| 3 | ga | 4 | 2 | 3 | 5 | 10 | 4 | 5 | 6 | 20 | 59 | f |
| 4 | ui | 5 | 4 | 4 | 12 | 13 | 3 | 7 | 7 | 39 | 94 | p |
| 5 | pa | 4 | 1 | 1 | 4 | 3 | 2 | 4 | 5 | 22 | 46 | f |
| 6 | la | 2 | 3 | 1 | 1 | 2 | 0 | 4 | 2 | 11 | 26 | f |
| 7 | ka | 5 | 4 | 1 | 3 | 3 | 1 | 6 | 4 | 24 | 51 | f |
| 8 | ma | 5 | 3 | 3 | 9 | 8 | 4 | 8 | 0 | 20 | 60 | p |
| 9 | ash | 2 | 5 | 5 | 11 | 12 | 3 | 7 | 6 | 30 | 81 | p |
| 10 | opo | 4 | 2 | 1 | 13 | 1 | 3 | 7 | 3 | 35 | 69 | p |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
Make sure to not include any unique identifier column.
Also don't include the total.
Most likely, the classifiers learned that "name" is a good predictor and/or that you need total > 59 points total to pass.
I suggest you even withhold at least one exercise because of that - some classifiers will still learn that the sum of the individual points is necessary to pass.
I assume you want to find out if one part is most indicative of passing, i.e. "if you do well on part 3, you will likely pass". But to answer this question, you need to account for e.g. different amount of points per question, etc. - otherwise, your predictor will just identify which question has the most points...
Also, 10 is a much too small sample size!
You can see from the output that is displayed that the tree that J48 generated used only the variable fl, so I do not think that you have the problem that #Anony-Mousse referred to.
I notice that you are testing on the training set (see the "Test Options" radio buttons at upper left of the GUI). That almost always overestimates the accuracy. What you are seeing is overfitting. Instead, use cross-validation to get a better estimate of the accuracy you could expect on new data. With only 10 data points, you should use either 10 folds or 5.
Try testing your model on cross-validation on "k splits" or Percentage split.
Generally in Percentage Split: Training set is of 2/3 of dataset and Test set is 1/3.
Also, What I feel that your dataset is very small... There are chances of high accuracy in that case.

Normalization, changing it to 1nf, 2nf and 3nf

INVOICE
So i have to put this into 1NF, 2NF and 3NF
PROD_NUM PROD_LABEL PROD_PRICE
AA-E3422QW ROTARY SANDER 49.95
AA-E3422QW ROTARY SANDER 49.95
QD-300932X 0.25IN. DRILL BIT 3.45
RU-95748G BAND SAW 33.99
GH-778345P POWER DRILL 87.75
VEN_CODE VEN_NAME
211 NEVERFAIL, INC
211 NEVERFAIL, INC
211 NEVERFAIL, INC
309 BEGOOD, INC
157 TOUGHGO, INC
So far i have these as my 2NF. Am i going right? And how do i put the table into 3NF ?
So my 2nf will be like this ?2NF TABLE IMAGE
I think the picture you were given is considered 1NF.
And you initially showed 3NF, but you'll need an additional table to reference which Product is by what Vendor as well as modify the invoice table.
Vendor - Unique list of vendors
VEN_ID | VEN_CODE | VEN_NAME
-------|----------|---------------
1 | 211 | NEVERFAIL, INC
2 | 309 | BEGOOD, INC
3 | 157 | TOUGHGO, INC
Product - Unique list of products
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE
--------|------------|-------------------|-----------
1 | AA-E3422QW | ROTARY SANDER | 49.95
2 | QD-300932X | 0.25IN. DRILL BIT | 3.45
3 | RU-95748G | BAND SAW | 33.99
4 | GH-778345P | POWER DRILL | 87.75
Vendor_Product - the mapping between products and vendors
VEN_ID | PROD_ID
-------|----------
1 | 1
1 | 2
2 | 3
3 | 4
Purchases - The transactions that happened
PURCH_ID | INV_NUM | SALE_DATE | PROD_ID | QUANT_SOLD
---------|---------|-------------|---------|------------
1 | 211347 | 15-JAN-2006 | 1 | 1
2 | 211347 | 15-JAN-2006 | 2 | 8
3 | 211347 | 15-JAN-2006 | 3 | 1
4 | 211348 | 15-JAN-2006 | 1 | 2
5 | 211349 | 16-JAN-2006 | 4 | 1
I think that is good, but it can be split again.
Invoices - A unique list of invoices
INV_ID | INV_NUM | SALE_DATE
--------|---------|-------------
1 | 211347 | 15-JAN-2006
2 | 211348 | 15-JAN-2006
3 | 211349 | 16-JAN-2006
Purchases - The transactions that happened
PURCH_ID | INV_ID | PROD_ID | QUANT_SOLD
---------|--------|---------|---------
1 | 1 | 1 | 1
2 | 1 | 2 | 8
3 | 1 | 3 | 1
4 | 2 | 1 | 2
5 | 3 | 4 | 1
To get 2NF, combine the Vendor information back into the Product table. With these columns
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE | VEN_CODE | VEN_NAME
In this case, the Vendor and Vendor_Product tables aren't needed.

Resources