Filtering and Grouping tuples on several fields in Pig Latin - join

I am relatively new to using Pig for my work. I have a huge table (3.67 Mil Entries) with fields -- id, feat1:value, feat2:value ... featN:value. Where id is text and feat_i is the feature name and value is thevalue for the feature i for a given id.
The number of features may vary for each tuple since its a sparse representation.
For example this is an example of 3 rows in data
id1 f1:23 f3:45 f7:67
id2 f2:12 f3:23 f5:21
id3 f7:30 f16:8 f23:1
Now the task is to group queries that have common features. I should be able to get those set of queries that have any feature overlapping.
I have tried several things. CROSS and JOINS create explosion in data and reducer gets stuck. Im not familiar with conditioning GROUP BY command.
Is there a way to write a condition in GROUP BY such that it selects only those queries that have common features.
For the above rows result will be:
id1, id2
id1, id3
Thanks

I can't think of an elegant way to do this in pig. There is no possibility to group b based on some condition.
However, you could GROUP ALL your relation and pass it to a UDF that compares each record with every other record. Not very scalable and a UDF is required, but it would do the job.

I would try and not parse the string.
If its possible, read the data as two columns: the ID column and the features columns.
Then I would cross join with a features table. It would essentially be a table looking like this:
f1
f2
f3
etc
Create this manually in Excel and load it onto your HDFS.
Then I would group by the features column and for each feature I would print all IDs
essentially, something like this:
features = load ' features.txt' using PigStorage(',') as (feature_number:chararray);
cross_data = cross features, data;
filtered_data = filter cross_data by (data_string_column matches feature_number);
grouped = group filtered_data by feature_number;
Then you can print all the IDs for each feature.
The only problem would be to read the data using something other than Pig storage.
But this would reduce your cross join from 3.6M*3.6M to 3.6M*(number of features).

Related

How to design this spark join

I need to join two big RDDs and potentially twice. Any help is appreciated to design these joins.
Here is the problem,
First RDD is productIdA, productIdB, similarity and the size is about 100G.
Second RDD is customerId, productId, boughtPrice and size is about 35G.
The result RDD I want is productIdA, productIdB, similarity, customerIds bought both product A and B.
Because I cannot broadcast either of the RDD since both of them are quite big, my design is to aggregate the second RDD by product id then join the first RDD twice but I get huge shuffle spill and all kinds of errors (OOM or out of space because of shuffle). Put the errors aside, I would like to know if any better way to achieve the same result. Thanks
Do you have a row for every product pairing in the first RDD?
If you do (or it's close), then you might want to do something like group the second RDD by customerId, create an element for every pairing, then rearrange and group that RDD by pairing, then group to get a list of customerIds, then join to add in the similarity.
(Whether or not this will result in more or less math depends, I think, on the distribution of number of products purchased per customer.)
Like zero323's comment also implies, once you have the pairings from grouping on customerId, it might be cheaper to recalculate the similarity than to join on a huge dataset.

Google Big Query: How to select subset of smaller table using nested fake join?

I would like to solve the problem related to How can I join two tables using intervals in Google Big Query? by selecting subset of of smaller table.
I wanted to use solution by #FelipeHoffa using row_number function Row number in BigQuery?
I have created nested query as follows:
SELECT a.DevID DeviceId,
a.device_make OS
FROM
(SELECT device_id DevID, device_make, A, lat, long, is_gps
FROM [Data.PlacesMaster] WHERE not device_id is null and is_gps is true) a JOIN (select ROW_NUMBER() OVER() row_number,top_left_lat, top_left_long, bottom_right_lat, bottom_right_long, A, count from (SELECT top_left_lat, top_left_long, bottom_right_lat,bottom_right_long, A, COUNT(*) count from [Karol.fast_food_box]
GROUP BY (....?)
ORDER BY COUNT DESC,
WHERE row_number BETWEEN 1000 AND 2000)) b ON a.A=b.A
WHERE (a.lat BETWEEN b.bottom_right_lat AND b.top_left_lat)
AND (a.long BETWEEN b.top_left_long AND b.bottom_right_long)
GROUP EACH BY DeviceId,
OS
Could you help in finalising it please? I cannot break the smaller table by "group by", i need to have consistency between two tables and select only items with lat,long from MASTER.table that fit into the given bounding box of a smaller table. I need to match lat,long into box really, my solution form How can I join two tables using intervals in Google Big Query? works only for small tables (approx 1000 to 2000 rows), hence this issue. Thank you in advance.
It looks like you're applying two approaches at once: 1) split a table into chunks of rows, and run on each, and 2) include a field, "A", tagging your boxes and your points into 'regions', that you can equi-join on. Approach (1) just does the same total work in more pieces (also, it's adding complication), so I would suggest focusing on approach (2), which cuts the work down to be ~quadratic in each 'region' rather than quadratic in the size of the whole world.
So the key thing is what values your A takes on, and how many points and boxes carry each A value. For example, if A is a country code, that has the right logical structure, but it probably doesn't help enough once you get a lot of data in any one country. If it goes to the state or province code, that gets you one step farther. Quantized lat/long grid cells generalize better. Sooner or later you do have to deal with falling across a region edge, which can be somewhat tricky. I would use a lat/long grid myself.
What A values are you using? In your data, what is the A value with the maximum (number of points * number of boxes)?

How to select random subset of cases in SPSS based on student number?

I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.

Pig Script: Join with multiple files

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.
A = join smalltable1 by (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by (f4) ,B by (f4) ;
The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won't fit in the memory. The implementation could be like:
A = LOAD massive
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)
Appreciate your thoughts...
Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522
You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.
UPDATE 1
You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.
UPDATE 2
It was mentioned in the comments that the outer join is used for augmenting data.
In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

How to Sum calulated fields

I‘d like to ask I question that here that I think would be easy to some people.
Ok I have query that return records of two related tables. (One to many)
In this query I have about 3 to 4 calculated fields that are based on the fields from the 2 tables.
Now I want to have a group by clause for names and sum clause to sum the calculated fields but it ends up in error message saying:
“You tried to execute a query that is not part of aggregate function”
So I decided to just run the query without the totals *(ie no group by , sum etc,,,)
:
And then I created another query that totals my previous query. ( i.e. using group by clause for names and sum for calculated fields… no calculation here) This is fine ( I use to do this) but I don’t like having two queries just to get summary total. Is their any other way of doing this in the design view and create only one query?.
I would very much appreciate.
Thankyou:
JM
Sounds like the query is thinking the calculated fields need to be part of the grouping or something. You might need to look into sub-querying.
Can you post the sql (before and after). It would help in getting an understanding of what the issue is.

Resources