Attach values to one table from a differently shaped table - spss

So I have a data set collected from beekeeper surveys that looks something like this:
[Table 1]
ID | Crop | Year | Hives | HoneyP | CropP |
----+------+------+-------+--------+-------+
1 2 2014 2391 . .
2 4 2008 136 . .
3 12 2019 12346 . .
| |
V (and so on...) V
I also have a spreadsheet of crop prices over a time series, e.g.
[Table 2]
Year | Crop1 | Crop2 |
-----+-------+-------+
2008 $2.56 $6.45
2009 $2.42 $6.64 ->
2010 $2.69 $6.68 (and more crops) ->
2011 $2.62 $7.05 ->
...
Is it possible in PSPP/SPSS to iterate over the observations in Table 1 and insert values from Table 2 into the CropP variable based on the year and crop identifier?
This is what I'm imagining, in pseudo-code:
for each obs:
obs.CropP = Table2[obs.Year][obs.Crop]
I also have other attributes I want to add in to the observations (e.g. price index), but they're all one dimensional and could be entered manually if necessary; if I can programmatically add in a crop's price in the survey year, it would save a lot of time and trouble.

I suggest reshaping instead of iterating.
Assuming you've read both tables into SPSS, and the datasets are called table1 and table2 - follow these two steps:
First you need to reshape the crop prices data to fit the main dataset:
dataset activate table2.
varstocases /make cropPR from crop1 to cropX/index=crop(cropPR).
*your crop index now is a string like "crop3" and needs to be turned into a number.
compute crop=char.substr(crop,5,5).
alter type crop (f5).
sort cases by year crop.
Now this table is ready to attach to your main data.
dataset activate table1.
sort cases by year crop.
match files /file=* /table=table1 /by year crop.
exe.

Related

TABLEAU - Joins on the fly on raw data

I have been trying to perform joins on the fly in Tableau to perform some online computation - with no luck so far.
I wonder if any of you is aware of a way to achieve this?
I have a typical transactions dataset ("MYDATA"), with user ID (user's identifier), transaction date (when the transaction occurred), and purchases (the transactions). Something like:
ID TRANSACTION DATE PURCHASES
123 20/03/2020 1
123 22/03/2020 4
234 20/03/2020 10
234 22/03/2020 1
345 22/03/2020 5
What I would like to achieve is to add to it a variable with the SUM of PURCHASES by ID (say field "PURCHASES PER ID").
Then, critically, I'd like to make this computation update dynamically as I filter by different values in TRANSACTION DATE from the UI.
Ultimately I'd like to create a chart displaying the count of users (field "ID") in each value of the field "PURCHASES PER ID" (like bins), where "PURCHASES PER ID" is re-computed according to the date ranges selected in the worksheet.
Something like:
Case 1 : FILTER Transaction date = 20/03/2020 AND 22/03/2020
|---------------------|------------------|
| count OF ID | SUM of PURCHASES |
|---------------------|------------------|
| 2 | 5 |
|---------------------|------------------|
| 1 | 11 |
|---------------------|------------------|
Case 2 : FILTER Transaction date = 20/03/2020
|---------------------|------------------|
| count OF ID | SUM of PURCHASES |
|---------------------|------------------|
| 1 | 1 |
|---------------------|------------------|
| 1 | 10 |
|---------------------|------------------|
I'd expect this to be doable in Tableau, as I'm able to it with a much more simple (and cheaper) tool like Google Data Studio.
In Data Studio I'd simply do a join between "MYDATA" and the sum of PURCHASES grouped by ID - using ID as KEY. Then, I'd able to use that calculated sum of purchases as a dimension, and count the IDs in it.
Are you aware of a way to achieve the same in Tableau?
Many thanks
Think I got it.
My solution was:
Columns: ({FIXED [ID]: SUM([PURCHASES])})
Rows: CNTD(ID)
Filters: Add TRANSACTION DATE to Context
This allows me to achieve the view I wanted to.

Link tables and combine columns between them

My single dataset (generated from a large spreadsheet) is split into multiple tables. The relevant information is the dates and a numerical value assigned to them.
The data is organized as such on each table:
Start Date | End Date | Return Value
A1 | B1 | C1
A2 | B2 | C2
A3 | B2 | C3
The start and end dates are always quarter start and quarter end dates. The value C is always numeric. Each table represents a specific account. Some of these tables don't start until later dates (So Table 4 might have a start date equal to A3, for example).
I would like to group up these tables so the final report is organized as such:
Date range A1 - B1
Table1.C1 | calc(Table1.C1)
Table2.C1 | calc(Table2.C1)
Table3.C1 | calc(Table3.C1)
etc.
And on each detail line where TableX.CY is listed, perform relevant calculations using formulas.
The formulas I've already figured out and gotten sorted, but I'm lost at the best way to refer to each table without creating brand new formulas per table. IE, I don't want to create calcTable1(Table1.C1), calcTable2(Table2.C1), and so on, since there are over 40 tables in this.
How can I link these tables together so that the result set that CR is working with can be easily organized to produce this sort of report?
You can link tables in the Database Fields -> Database Expert -> Links tab.
If you wish to perform these calculations via SQL before they even reach the report, you can do so in the Database Expert by using the Add Command option to write your on SQL formulas.
Otherwise you want to group based on a date range. So you should probably first create a formula to return the format Date range A1 - B1. Then create a group based on that formula you just made.
To add a group, go to Insert -> Group and select your formula as the subject of the Grouped By field.

Filter to the latest month and then filter to the best score per person

I've got a Google Sheet which holds the results of a monthly competition. The format is
Name | Date | Score
--------------------------------
Alan Smith | 14/01/2016 | 500
Bob Dow | 14/01/2016 | 450
Bob Dow | 16/01/2016 | 470
Clare Allie| 16/01/2016 | 550
Declan Ham | 16/01/2016 | 350
Alan Smith | 10/02/2016 | 490
Bob Dow | 10/02/2016 | 425
Declan Ham | 12/02/2016 | 400
Declan Ham | 12/02/2016 | 390
Clare Allie| 12/02/2016 | 560
I want to do 2 things with this data
I want to create a new sheet which holds the latest 'best' results. For the data presented here that would be
Alan Smith | 10/02/2016 | 490
Bob Dow | 10/02/2016 | 425
Declan Ham | 12/02/2016 | 400
Clare Allie| 12/02/2016 | 560
i.e. The results from February with the 'best' score per person. Here Declan Ham's lower score of '390' was removed.
I want another sheet to hold the tournament ranking. People are ranked by their top 3 monthly scores. i.e. The best score for each person for each month is obtained and the top 3 scores are combined to give their place in the tournament.
So far I've attempted to use Google queries, vlookups, filters to get these new sheets. But, just focusing on 1), the best I've been able to achieve is
=FILTER(Results!$A:$B, MONTH(Results!$B:$B) = MONTH(MAX(Results!$B:$B)))
Which will get me the results from the latest month. But it does not remove duplicates entries by people.
Does anyone have a suggestion for how I can achieve these requirements? Feel like I'm treading water at the moment.
Rather than trying to remove duplicates, you need to identify the maximum score by each person; you can do that by grouping values by person, then aggregating using max(). Here's how that would look, for the month of February 2016:
=query(Results!A1:C,"select A,max(C) where todate(B) > date '2016-2-1' group by A")
Instead of using a fixed value for the start of the latest month, we can get the year and month using spreadsheet formulas, and concatenate our query with them:
=query(Results!A1:C,"select A,max(C) where todate(B) > date '"&year(max(Results!B2:B))&"-"&month(max(Results!B2:B))&"-1' group by A")
That addresses your first question.
Tournament ranking
Your second goal is too complex for a single spreadsheet formula, in my opinion. Here's a way to accomplish it with multiple formulas, though!
The X & Y axes are filled out by spreadsheet formulas. On the X axis (orange), we populate participants names using this in cell A3:
=unique(Results!A2:A)
The Y axis consists of dates (green). These are the start dates of each unique month that there are scores for, calculated using the following formula in cell D2. This results in strings, e.g. 2016-01-1, and that format is specifically required for the later formulas to work.
=TRANSPOSE(SORT(UNIQUE(ARRAYFORMULA(TEXT(Results!B2:B13,"YYYY-MM-1")))))
Here's the formula for cell D3, which will calculate the sum of the 3 highest scores recorded for the user whose name appears in A3, for the month appearing in D2. (Copy & Paste the formula across the full range of participants & months, and it will adjust.)
=sum(query(Results!$A$1:$C,"select C where A='"&$A2&"' and todate(B) >= date '"&B$1&"' and todate(B) < date '"&IF(ISBLANK(C$1),TEXT(TODAY()+1,"yyyy-mm-dd"),C$1)&"' order by C desc limit 3 label C ''"))
Key points about that formula:
The query range needs to used fixed values so it isn't transposed when copied to additional cells. However, it's still open-ended, to absorb additional rows of scores on the "Results" sheet.
Results!$A$1:$C
A WHERE clause is used to select rows from the Results sheet that are for the given participant (A='"&$A2&"') and fall within the month that heads the column (C$1).
...and todate(B) < date '"&IF(ISBLANK(C$1),TEXT(TODAY()+1,"yyyy-mm-dd"),C$1)&"'
The best 3 scores for the month are found by first sorting the above result descending, then limiting the result to 3 rows.
...order by C desc limit 3
Finally, the QUERY headers are suppressed by this little trick, so that we get a single number as the result:
...label C ''
Individual tournament totals appear in column C, with a range SUM across the row, e.g. for cell C3:
SUM(D3:3)
The corresponding ranking in column B is then:
RANK(C3,C$3:C)
Tidy
For simpler copy/paste, you can do some error checking in these formulas, so that they can be placed in the sheet before the corresponding data is - for example, at the start of your season. Using IF(ISBLANK(... or IFERROR(... can be very effective for this.
B3 & down:
=IFERROR(RANK(C3,C$3:C))
C3 & down:
=IF(ISBLANK(A3),"",sum(D3:3))
D3 & rest of field:
=IFERROR(sum(query(Results!$A$1:$C,"select C where A='"&$A3&"' and todate(B) >= date '"&D$2&"' and todate(B) < date '"&IF(ISBLANK(E$2),TEXT(TODAY()+1,"yyyy-mm-dd"),E$2)&"' order by C desc limit 3 label C ''")))
Alternatively for the first part of your question (the latest 'best' results) , in addition to the solution provided by Mogsdad, this should also work.. :-)
=ArrayFormula(iferror(vlookup(unique(A2:A), sort(A2:C, 2, 0, 3, 0), {1,3}, 0)))
EDIT: This formula sorts the table with dates (col B) descending and col C descending and then (ab)uses the fact that vlookup only returns the first match to return the first and last column.

Is there a multiple-and-add formula in Google's spreadsheet?

What I want is to easily multiply a number by another number for each column and add them up at the end in Google Sheets. For example:
User | Points 1 | Points 2 | Points 3 | Total
| 5 | 1 | 4 |
-----+----------+----------+----------+------
Jane | 2 | 3 | 0 | 13 (2*5 + 3*1 + 0*4)
John | 1 | 11 | 4 | 32 (1*5 + 11*1 + 4*4)
So it's easy enough to make this formula for the total:
= B3*$B$2 + C3*$C$2 + D3*$D$2
The problem is I frequently need to insert additional columns or even remove some columns. So then I have to mess with all the formulas. It's a pain... we have many spreadsheets with these formulas. I wish there was a formula like SUM(B3:D3) where I could just specify a range. Is there anything like MULTIPLY_AND_SUM(B2:D2, B3:D3) that would do this? Then I could insert columns in the middle and the range would still work.
There is a built in function in Google Sheets that does exactly what you are looking for: SUMPRODUCT.
In your example the formula would be:
=sumproduct(B$2:D$2,B3:D3)
Click here for more information about this function.
You can accomplish that without requiring a special-purpose function.
In E3, try this (and copy it to the rest of your rows):
=sum(arrayformula(B3:D3*B$2:D$2))
You can read about arrayformula here.
As long as you introduce new columns between B and D, this formula will automatically adjust. If you add new columns outside of that range, you'll need to edit (and cut & paste).
On it's own, arrayformula(B3:D3*B$2:D$2) operates over each value in B3:D3 in turn, multiplying it by the corresponding value in B$2:D$2. (Note the use of absolute references to 'lock down' to row 2.) The result in this case is three values, [10,3,0], arranged horizontally in three rows because that matches the dimensions of the ranges.
The enveloping sum() function adds up the values of the array produced by arrayformula, which is 13 in this case.
As you copy that formula to other rows, the relative range references get updated for the new row.

Apache Pig: Join records by shifting

I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.
I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.
I appreciate any help. Thanks!
See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.
Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.
I think you can use "lead" function of PiggyBank. Something like following might work.
A = LOAD 'T';
B = GROUP A ALL
C = FOREACH B {
C1 = ORDER A BY d;
GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
}
D = FOREACH C
GENERATE stitched::time AS time,
stitched::url AS url,
stitched::time - $3 AS time_diff;
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

Resources