I found that the Hive does not support non-equi join.Is it just because it is difficult to convert non-equi join to Map reduce?
Yes, the problem is in current map-reduce implementation.
How common equi-join is implemented in MapReduce?
Input records are being copied in chunks to the mappers, mappers produce output as key-value pairs, which are collected and distributed between reducers using some function in such way that each reducer will process the whole key, in other words, mapper creates a list of key-values for each reducer grouped by key. Reducers copy mappers output, sort it to get <key, list of values>. The same is being done for both datasets. Then reducer applies cross-product on both lists with equal keys. In such way the equi-join is implemented. The main idea here is that tuples with the same join key are distributed to the same instance of reducer and being processed on the same reducer. This is easy to implement because key itself determines on which reducer it will be processed (computation is based on key-equality) and each reducer instance receives it's dedicated key list from both datasets, no other reducers are working with the same keys.
Consider non-equi-join: For example we need to join datasets A and B on A.key<=B.key condition. In this case the reducer should receive not only tuples with equal keys from both datasets, but also for each B.key all A tuples with key less then B.key. It is difficult to implement using the same key equality paradigm.
If reducer will receive for each A.key B-tuples with B.key < A.key than it will cause huge duplication of data on reducer. for example if we have A keys (1, 2, 3) and B keys (1,2,3) then for A.3 we need [A.1, A.2, A.3]. For A.2 we need [A.1, A.2]. In other words, the mapper need to produce a duplication for each particular key, lists produced by mappers for different keys will be overlapped. The more distinct keys we have the bigger duplication it will be.
Read this paper for deep dive into problems and possible solutions: Processing Theta-Joins using MapReduce
Related
I would like to use Z3 to reason about a configuration problem using some data from a relational database containing physical properties of materials.
As suggested in this post, I could use an outer loop around the solver. But this works only for sorts with finite domains: I don't see how it would work on infinite domains.
I could represent the whole data tables by Z3 functions from primary keys to attributes, using the if-then-else construct, but the reasoning might use only a few rows in the table: it does not seem efficient.
Another approach would be to create a custom background theory solver that would determine the truth values of atoms by database lookup : has that been done before ?
Do you see some other ways to do it ?
My objective is to join two tables, where the second table is normal and the first one is nested structure table. The join key is available inside the nested structure in first table. In this case, how to Join these two tables using dataflow java code. WithKeys (org.apache.beam.sdk.transforms.WithKeys) accepting direct column name and it does not allow like firstTable.columnname. Could some one to help to solve this case.
If both tables are equally large consider using the CoGroupByKey transform described here. You will have to transform your data into two PCollections keyed by the proper key before this operation.
If one table is significantly smaller than the other, feeding the smaller PCollection as a side input to a ParDo over the larger PCollection as described here might be a better option.
Mnesia has four methods of reading from database: read, match_object, select, qlc. Besides their dirty counterparts of course. Each of them is more expressive than previous ones.
Which of them use indices?
Given the query in one of this methods will the same queries in more expressive methods be less efficient by time/memory usage? How much?
UPD.
As I GIVE CRAP ANSWERS mentioned, read is just a key-value lookup, but after a while of exploration I found also functions index_read and index_write, which work in the same manner but use indices instead of primary key.
One at a time, though from memory:
read always uses a Key-lookup on the keypos. It is basically the key-value lookup.
match_object and select will optimize the query if it can on the keypos key. That is, it only uses that key for optimization. It never utilizes further index types.
qlc has a query-compiler and will attempt to use additional indexes if possible, but it all depends on the query planner and if it triggers. erl -man qlc has the details and you can ask it to output its plan.
Mnesia tables are basically key-value maps from terms to terms. Usually, this means that if the key part is something the query can latch onto and use, then it is used. Otherwise, you will be looking at a full-table scan. This may be expensive, but do note that the scan is in-memory and thus usually fairly fast.
Also, take note of the table type: set is a hash-table and can't utilize a partial key match. ordered_set is a tree and can do a partial match:
Example - if we have a key {Id, Timestamp}, querying on {Id, '_'} as the key is reasonably fast on an ordered_set because the lexicographic ordering means we can utilize the tree for a fast walk. This is equivalent of specifying a composite INDEX/PRIMARY KEY in a traditional RDBMS.
If you can arrange data such that you can do simple queries without additional indexes, then that representation is preferred. Also note that additional indexes are implemented as bags, so if you have many matches for an index, then it is very inefficient. In other words, you should probably not index on a position in the tuples where there are few distinct values. It is better to index on things with many different (mostly) distinct values, like an e-mail address for a user-column for instance.
How can I use take the input set
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
and produce the output set
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
using a "pure" map-reduce framework, i.e. one with only a map phase and a reduce phase but without any extra feature such as CouchDB's lookup?
Exact details will depend on your map-reduce framework. But the idea is this. In your map phase, you emit two types of key/value pairs. (1, {name:john type:boss}) and (3, {worker-id:1 name:john type:worker}). In your reduce phase you get all of the values for the key grouped together. If there is a record of type boss in there, then you remove that record and populate the supervisor-name of the other records. If there isn't, then you drop those records on the floor.
Basically you use the fact that data gets grouped by key then processed together in the reduce to do the join.
(In some map-reduce implementations you incrementally get key/value pairs put together in the reduce. In those implementations you can't throw away records that don't have a boss already, so you wind up needing to map-reduce-reduce for that final filtering step.)
There is Only one input file or more??
I mean, is it possible a case which we have a file that one of its worker-id have a supervisor-id which its descriptions(name of that supervisor-id) be in another file??
I try to implement Hash join in Hadoop.
However, Hadoop seems to have already a map-side join and a reduce - side join already implemented.
What is the difference between these techniques and hash join?
Map-side Join
In a map-side (fragment-replicate) join, you hold one dataset in memory (in say a hash table) and join on the other dataset, record-by-record. In Pig, you'd write
edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';
taking care that the smaller dataset is on the right. This is extremely efficient, as there is no network overhead and minimal CPU demand.
Reduce Join
In a reduce-side join, you group on the join key using hadoop's standard merge sort.
<user_id {A, B, F, ..., Z}, { A, C, G, ..., Q} >
and emit a record for every pair of an element from the first set with an element from the second set:
[A user_id A]
[A user_id C]
...
[A user_id Q]
...
[Z user_id Q]
You should design your keys so that the dataset with the fewest records per key comes first -- you need to hold the first group in memory and stream the second one past it. In Pig, for a standard join you accomplish this by putting the largest dataset last. (As opposed to the fragment-replicate join, where the in-memory dataset is given last).
Note that for a map-side join the entirety of the smaller dataset must fit in memory. In a standard reduce-side join, only each key's groups must fit in memory (actually each key's group except the last one). It's possible to avoid even this restriction, but it requires care; look for example at the skewed join in Pig.
Merge Join
Finally, if both datasets are stored in total-sorted order on the join key, you can do a merge join on the map side. Same as the reduce-side join, you do a merge sort to cogroup on the join key, and then project (flatten) back out on the pairs.
Because of this, when generating a frequently-read dataset it's often a good idea to do a total sort in the last pass. Zebra and other databases may also give you total-sorted input for (almost) free.
Both of these joins of Hadoop are merge joins, which require a (explicit) sorting beforehand.
Hash join, on the other hand, do not require sorting, but partition data by some hash function.
Detailed discussion can be found in section "Relational Joins" in Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer, a well-written book that is free and open source.