Apache Pig: Join records by shifting - join

I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.
I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.
I appreciate any help. Thanks!

See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.
Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.

I think you can use "lead" function of PiggyBank. Something like following might work.
A = LOAD 'T';
B = GROUP A ALL
C = FOREACH B {
C1 = ORDER A BY d;
GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
}
D = FOREACH C
GENERATE stitched::time AS time,
stitched::url AS url,
stitched::time - $3 AS time_diff;
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

Related

How to use a Lambda function with Reduce to repeat row N numbre f time in google sheets

Hello I've tried using text manipulation to achieve the results, and while it works - I don't think it's an efficient way to do it and there is limitations with how many times it can be done.
I was trying to figure out how to get it done with reduce but it having hard time to figure it out.
This is the current table
Unique ID
Some other Info
How many times to repeat
123
Some Info
2
456
Some Info
3
The result would be
Unique ID
123
123
456
456
456
Thank you.
Here's one way to do this:
=ArrayFormula(REDUCE("Unique ID",SEQUENCE(COUNTA(A2:A)),LAMBDA(a,c,{a;IF(SEQUENCE(INDEX(C2:C,c)),INDEX(A2:A,c))})))
Explanation
The LAMBDA inside REDUCE works by taking 3 parameters: an accumulator (a), a current value (c) and the operation to perform using them.
The accumulator (a) is initialized to the first argument of REDUCE, which is "Unique ID" and every time the inner LAMBDA is executed, the accumulator updates with the result of that execution.
The current value (c) is a variable parameter and it takes on all the values provided in the second argument of REDUCE SEQUENCE(COUNTA(A2:A)) (1).
Let's assume (1) returns:
1
2
The main work happens here:
{a;IF(SEQUENCE(INDEX(C2:C,c)),INDEX(A2:A,c))} (2)
Before this piece of code is executed, a has a value of "Unique ID" and c has a value of 1.
When it executes for the first time, a and c are replaced with their initial value, so we get:
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
Now c becomes 2 and a becomes
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
So when (2) is executed for the second time, this is what we get:
{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))}
We have now gone through all the values of c so the formula stops executing and that's effectively what it returns.
The amount of iterations REDUCE does depends on the size of its second parameter.
Let's see another example. Assume (1) returns:
1
2
3
First time c=1, a="Unique ID":
{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))}
Second time c=2, a=PREVIOUSLY_RETURNED_ARRAY:
{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))}
Third and last time c=3, a=PREVIOUSLY_RETURNED_ARRAY:
{{{"Unique ID";IF(SEQUENCE(INDEX(C2:C,1)),INDEX(A2:A,1))};
IF(SEQUENCE(INDEX(C2:C,2)),INDEX(A2:A,2))};
IF(SEQUENCE(INDEX(C2:C,3)),INDEX(A2:A,3))}
And that's the array REDUCE returns.
Do you see a pattern?
A different approach could be-
=QUERY(FLATTEN(INDEX(SPLIT(REPT(A2:A3&"|",C2:C3),"|"))),"where Col1 is not null")

Google Sheets Query Drop Down Filtered Returned Values Not Resetting

Using the following code, I've been able to query a single column properly from my dataset where B2 and B3 are drop downs that I use to filter and D1 is just a header (I have multiple queries side by side that use the same filters)
=IFERROR(QUERY('Input Sheet'!$A:$E, "select A where B = '"&$B$2&"' and C = '"&D$1&"' and D ='"&$B$3&"'",0),"")
I've tried looking it up and when I see other people use the Query function, their search area will reset upon changing their drop down filters. What I mean is, if the first query returns 4 rows and the second returns 1, only my first row will update for the query.
Updated to include sample data and video. Sorry I can't post directly into thread.
Dataset (date is excluded since it isn't relevant to my query)
Dropdowns
Query Table and Filters
Video
So if my first query has 4 rows and my next query returns only 1 row, the last 3 rows will not update.
So for example, lets say my first query returns the following:
10
19
32
41
Changing a filter will return
23
19
32
41
But in reality, the query should only return 23, the rest of the values are from the previous query. None of the videos I've watched have this problem (so none have addressed my issue)
If I change my filters to something that should return nothing (no data entries) I get the following:
"" (Empty Cell, Null etc)
19
32
41
My data source is formatted like the below
A B C D
1 2 3 4
w x y z
Any help would be appreciated. Thanks.

Google Sheet: formula to loop through a range

It's not hard to do this with custom function, but I'm wondering if there is a way to do it using a formula. Because datas won't automatically update when using custom function.
So I have a course list sheet, each with a price. And I'm using google form to let users choose what courses they will take. Users are allowed to take multiple courses, so how many they will take is unknown.
Now in the response sheet, I have datas like
Order ID
User ID
Courses
Total
1001
38
courseA, courseC
What formula to put here?
1002
44
courseB, courseC, courseD
What formula to put here?
1003
55
courseE
What formula to put here?
and the course sheet is like
course
Price
A
23
B
33
C
44
D
23
E
55
I want to output the total for each order and am looking at using FILTER to do this. Firstly I can get a range of unknown length for the chosen courses
=SPLIT(courses, ",") // having named the Courses column as "courses"
Now I need to filter this range against the course sheet? not quite sure how to do it or even if it is possible. Any hint is appreicated.
try:
=ARRAYFORMULA(IF(A2:A="",,MMULT(IFERROR(
VLOOKUP(SPLIT(C2:C, ", "), {F1&F2:F, G2:G}, 2, 0))*1,
ROW(INDIRECT("1:"&COLUMNS(SPLIT(C2:C, ", "))))^0)))
demo spreadsheet
As I need time to digest #player0's answer, I am doing this in a more intuitive way.
I create 2 sheets to store intermediate values.
The first one is named "chosen_courses"
Order ID
User ID
1001
=IFERROR(ARRAYFORMULA(TRIM(SPLIT(index(courses,Row(),1),","))),"")
1002
=IFERROR(ARRAYFORMULA(TRIM(SPLIT(index(courses,Row(),1),","))),"")
1003
=IFERROR(ARRAYFORMULA(TRIM(SPLIT(index(courses,Row(),1),","))),"")
In this sheet every row is a horizontal list of the chosen courses, and I created another sheet
total
course price
=IF(isblank(order_id),"",SUM(B2:2))
=IFERROR(VLOOKUP('chosen_courses'!B2,{course_Names,course_price},2,false),"")
=IF(isblank(order_id),"",SUM(C2:2))
=IFERROR(VLOOKUP('chosen_courses'!B2,{course_Names,course_price},2,false),"")
=IF(isblank(order_id),"",SUM(D2:2))
=IFERROR(VLOOKUP('chosen_courses'!B2,{course_Names,course_price},2,false),"")
course_Names,order_id and course_price are named ranges.
This works well, at least for now.
But there is a problem:
I have 20 courses, so in the 2nd sheed, there are 21 columns. And I copy the formulas to 1000 rows because that is the maximum rows you can get to using ctrl+shift+↓ and ctrl+D. Now sometimes when I open the sheet, there will be a progress bar calculating formulas in this sheet, which could take around 2 mins, even though I have only like 5 testing orders in the sheet. I am afraid this will get worse when I have more datas or when it is open by old computers.
Is it because I use some resource consuming functions? Can it be improved?

Checking to which range a value belongs in Google Sheets

I have some data in the following way
Category
[Range 1_min]
[Range 1_max]
[Range 2_min]
[Range 2_max]
...
A
120
130
...
B
100
119
131
140
...
I want to be able to quickly query a number and have it return the category it belongs to, for example 135 belongs to B and 121 belongs to A.
I already have a script that does this, but since there are 1000+ categories, it takes a long time to run. Is there a faster way of doing this?
Thanks.
You can use LOOKUP:
=ArrayFormula(LOOKUP(2,1/((G2>=B2:B)*(G2<=C2:C)+(G2>=D2:D)*(G2<=E2:E)),A2:A))
Addition:
For more ranges you can add MMULT (not sure it's easier):
=ArrayFormula(LOOKUP(1,5/(MMULT(--(K2>={B2:B,D2:D,F2:F,H2:H}),ROW(A1:A4)^0)*MMULT(--(K2<={C2:C,E2:E,G2:G,I2:I}),ROW(A1:A4)^0)),A2:A))
some conditions:
change first argument of LOOKUP to 1
for second LOOKUP argument change denominator to 5 (number of cols to compare + 1)
for second MMULT argument ROW(A1:A4) use row count according column count to compare (i.e. for 4 cols ->ROW(A1:A4), for 6 cols -> ROW(A1:A6) etc. )

Google Sheets Query Group By / First-N-Per-Group

I'm trying to find a simple solution for first-n-per-group.
I have a table of data, first column dates and rest data. I want to group based around the date, as multiple entries per date are allowed. For the second column some numbers, but want the FIRST record.
Currently the aggregate function I could possibly use is MIN() but that will return the lowest value and not the first.
A B
01/01/2018 10
01/01/2018 15
02/01/2018 10
02/01/2018 2
02/01/2018 100
02/01/2018 20
03/01/2018 5
03/01/2018 2
Desired output
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Current results using MIN() - undesired
A B
01/01/2018 10
02/01/2018 2
03/01/2018 2
It's a shame there isn't a FIRST() aggregate function in Google Sheets, which would make this a lot easier.
I saw a couple of examples of using the Row Number and ArrayQuery, but that doesn't seem to work for me. There are about 5000 rows of data so trying to keep this as efficient as possible, and not have to recalculate the entire sheet on any change, each taking a few seconds.
Currently I have this, which appends a third column with the Row Number:
=query({A1:B, arrayformula(row(A1:B))}, "select min(Col1),min(Col2) group by Col1")
Thanks
EDIT 1
A suggested solution was =SORTN(A:B,2^99,2,1,1), which is a clean simple one. However, this requires a large range of "free space" to display the returned dataset. Imagine 3000+ rows.
I was hoping for a QUERY() -based solution, as I wanted to do further operations with the results. Specifically, count the occurrences of distinct values.
For example: I wanted a returned dataset of
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Yet I want to count the occurrences of those values (and then ignoring the dates). For example:
B C
10 2
5 1
Perhaps I've confused the situation by using numbers? the "data" in ColB is TEXT (short 3 letter codes), however I used numbers to show I couldn't use MIN() function as that returns the numerically lowest value.
So in brief:
Go through all rows (3000+ rows) and group by the FIRST row of a particular date
return the FIRST value of that row
COUNT() all unique occurrences of those FIRST values, disregarding the date. Just a list with the unique values and their count (again, only the first one of any particular day)
=SORTN(A:B,2^99,2,1,1)
If your data is sorted as in the sample, You can easily remove duplicates with SORTN()

Resources