I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.
A = join smalltable1 by (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by (f4) ,B by (f4) ;
The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won't fit in the memory. The implementation could be like:
A = LOAD massive
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)
Appreciate your thoughts...
Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522
You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.
UPDATE 1
You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.
UPDATE 2
It was mentioned in the comments that the outer join is used for augmenting data.
In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.
Related
Currently hive does support non equi join.
But as the cross product becomes pretty huge, I was wondering what are the options to tackle a large fact(257 billion rows, 37 tb) and relatively smaller(8.7 gb) dimension table join.
In case of equi join I can make it work easily with proper bucketing on the join column/columns . (using same number of buckets for SMBM practically converting to a map join). But if we think this wont be of any advantage when its a non equi join, because the values will be there in other buckets, practically triggering a shuffle i.e. a reduce phase.
If any one has any thoughts to overcome this, please suggest .....
If the dimension table fits in memory, you can create a Custom User Defined Function (UDF) as stated here, and perform the inequi-join in memory.
I'm trying to find an efficient way to transform a DataFrame into a bunch of persisted Series (columns) in Dask.
Consider a scenario where the data size is much larger than the sum of worker memory and most operations will be wrapped by read-from-disk / spill-to-disk. For algorithms which operate only on individual columns (or pairs of columns), reading-in the entire DataFrame from disk for every column operation is inefficient. In such a case, it would be nice to locally switch from a (possibly persisted) DataFrame to persisted columns instead. Implemented naively:
persisted_columns = {}
for column in subset_of_columns_to_persist:
persisted_columns[column] = df[column].persist()
This works, but it is very inefficient because df[column] will re-read the entire DataFrame N = len(subset_of_columns_to_persist) times from disk. Is it possible to extract and persist multiple columns individually based on a single read-from-disk deserialization operation?
Note: len(subset_of_columns_to_persist) is >> 1, i.e., simply projecting the DataFrame to df[subset_of_columns_to_persist] is not the solution I'm looking for, because it still has a significant I/O overhead over persisting individual columns.
You can persist many collections at the same time with the dask.persist function. This will share intermediates.
columns = [df[column] for column in df.columns]
persisted_columns = dask.persist(*columns)
d = dict(zip(df.columns, persisted_columns))
I am relatively new to using Pig for my work. I have a huge table (3.67 Mil Entries) with fields -- id, feat1:value, feat2:value ... featN:value. Where id is text and feat_i is the feature name and value is thevalue for the feature i for a given id.
The number of features may vary for each tuple since its a sparse representation.
For example this is an example of 3 rows in data
id1 f1:23 f3:45 f7:67
id2 f2:12 f3:23 f5:21
id3 f7:30 f16:8 f23:1
Now the task is to group queries that have common features. I should be able to get those set of queries that have any feature overlapping.
I have tried several things. CROSS and JOINS create explosion in data and reducer gets stuck. Im not familiar with conditioning GROUP BY command.
Is there a way to write a condition in GROUP BY such that it selects only those queries that have common features.
For the above rows result will be:
id1, id2
id1, id3
Thanks
I can't think of an elegant way to do this in pig. There is no possibility to group b based on some condition.
However, you could GROUP ALL your relation and pass it to a UDF that compares each record with every other record. Not very scalable and a UDF is required, but it would do the job.
I would try and not parse the string.
If its possible, read the data as two columns: the ID column and the features columns.
Then I would cross join with a features table. It would essentially be a table looking like this:
f1
f2
f3
etc
Create this manually in Excel and load it onto your HDFS.
Then I would group by the features column and for each feature I would print all IDs
essentially, something like this:
features = load ' features.txt' using PigStorage(',') as (feature_number:chararray);
cross_data = cross features, data;
filtered_data = filter cross_data by (data_string_column matches feature_number);
grouped = group filtered_data by feature_number;
Then you can print all the IDs for each feature.
The only problem would be to read the data using something other than Pig storage.
But this would reduce your cross join from 3.6M*3.6M to 3.6M*(number of features).
I have this join.
A = Join smallTableBigEnoughForInMemory on (F1,F2) RIGHT OUTER, massive on (F1,F2);
B = Join anotherSmallTableBigforInMemory on (F1,F3 ) RIGHT OUTER, massive on (F1,F3);
Since both joins are using one common key, I was wondering if COGROUP can be used for joining data efficiently. Please note this is a RIGHT outer join.
I did think about cogrouping on F1, but small tables has multiple combination ( 200-300) on single key so I have not used join using single key.
I think partitioning may help but data has skew and I am not sure how to use it in Pig
You are looking for Pig's implementation of fragment-replicate joins. See the O'Reilly book Programming Pig for more details about the different join implementations. (See in particular Chapter 8, Making Pig Fly.)
In a fragment-replicate join, no reduce phase is required because each record of the large input is streamed through the mapper, matched up with any records of the small input (which is entirely in memory), and output. However, you must be careful not to do this kind of join with an input which will not fit into memory -- Pig will issue an error and the job will fail.
In Pig's implementation, the large input must be given first, so you will actually be doing a left outer join. Just tack on "using 'replicated'":
A = JOIN massive BY (F1,F2) LEFT OUTER, smallTableBigEnoughForInMemory BY (F1,F2) USING 'replicated';
The join for B will be similar.
I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.