Map-side join with Hadoop Streaming - join

I have a file in which each line is a record. I want all records with the same value in a certain field (call if field A) to go to the same mapper. I have heard this is called a Map-Side Join, and I also heard that it's easy if the records in the file are sorted by what I call field A.
If it would be easier, the data could be spread across multiple files, but each file sorted on field A.
Is this right? How do I do this in with streaming? I'm using Python. A assume it's just part of the command I use to start Hadoop?

What is the real justification for wanting only certain records to go to certain mappers? If what you want out of this is the final result to be 3 output files (one with all A, another with all B, last with all C), you can accomplish that with multiple reducers. Need to know what you really want to accomplish.

Related

Pentaho map values to field after join

I am doing a data source integration using Pentaho Data Integration where I need to join a table A with multiple Google Analytics data streams (Lets call them GA_A, GA_B, GA_C, ... GA_Z).All the GA streems have the same fields, but they come from different profiles. I am using a LEFT OUTER JOIN in each merge step to keep all the data from table A while adding the values of each GA data stream. The problem is that, when I make the joins, all the GA fields from each data stream are added to the result but renamed with an underscore. Here is an example:
GA_A , GA_B and GA_C all have the field "name" and are joined to the table A. In the last join result, I get the fields "name" , "name_1", and "name_2").
This obviously happens because of the nature of the LEFT OUTER JOIN. However, I want to "map" of "send" all the values from "name_1", "name_2", "name_3", etc to the field "name". How can I achieve this? I see that there's a "Value Mapper" step in PDI, but I don't want to use a step for each of the 10 fields I bring from GA (also, I'm not sure if that step does what I want to do)
Thanks!
As #Brian.D.Myers said there are multiple solutions available.
First, if all the GA streams are of the same structure there is no need to use join for all of them - you can first union all data (just directing them to a same step i.e. Dummy step) and after do the join - in that case you won't get multiple name_* fields.
However if there are still fields having the same name in table A and GA stream - they will obviously be renamed with underscores (it is essential as you pointed out). To handle this there ar few options:
If you need to just copy values - use the Set field value step - it copies a value from one field to another
If there is some complex processing logic - use the Javascript step
If streams are relatively small and you actually need to retain both fields - you may use the "Stream lookup" step instead of Merge join - it will allow you to specify names of the "merged" columns so no naming conflicts occurs.

How to do lookups (or joins) with map-reduce?

How can I use take the input set
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
and produce the output set
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
using a "pure" map-reduce framework, i.e. one with only a map phase and a reduce phase but without any extra feature such as CouchDB's lookup?
Exact details will depend on your map-reduce framework. But the idea is this. In your map phase, you emit two types of key/value pairs. (1, {name:john type:boss}) and (3, {worker-id:1 name:john type:worker}). In your reduce phase you get all of the values for the key grouped together. If there is a record of type boss in there, then you remove that record and populate the supervisor-name of the other records. If there isn't, then you drop those records on the floor.
Basically you use the fact that data gets grouped by key then processed together in the reduce to do the join.
(In some map-reduce implementations you incrementally get key/value pairs put together in the reduce. In those implementations you can't throw away records that don't have a boss already, so you wind up needing to map-reduce-reduce for that final filtering step.)
There is Only one input file or more??
I mean, is it possible a case which we have a file that one of its worker-id have a supervisor-id which its descriptions(name of that supervisor-id) be in another file??

Working with cyclical graphs in RoR

I haven't attempted to work with graphs in Rails before, and am curious as to the best approach. Some background:
I am making a Rails 3 site and thought it would be interesting to store certain objects and their relationships as a graph, where each object is a node and some are connected to show that the two objects are related. The graph does contain cycles, and there wouldn't be more than 100-150 nodes in the graph (probably only closer to 50). One node probably wouldn't have more than five edges, with an average of three to four edges per node.
I figured a simple join table with two columns (each the ID of the object) might be the easiest way to do it, but I doubt it's the best way. Another thought was to use a plugin such as acts_as_tree (which doesn't appear to be updated for Rails 3...) or acts_as_tree_with_dotted_ids, but I am unsure of their ability to work with cycles rather than hierarchical trees.
the most I would currently like is to easily traverse from one node to its siblings. I really can't think of a reason I would want to traverse to a node's sibling's sibling, which is why I was considering just making an SQL join table. I only want to have a section on the site to display objects related to a specified object, and this graph is one of the ways I am specifying relationships.
Advice? Things I should check out? Thanks!
I would use two SQL tables, node and link where a link is simply two foreign keys, source and target. This way you can get the set of inbound or outbound links to a node by performing an SQL select query by constraining the source or target node id. You could take it a step further by adding a "graph_id" column to both tables so you can retrieve all the data for a graph in two queries and build it as a post-processing step.
This strategy should be just as easy (if not easier) than finding, installing, learning to use, and implementing a plugin to do the same, IMHO.
Depending on whether your concern is primarily about operations on graphs, or on storage of graphs, what you need is potentially quite different. If you want convenient operations on graphs, investigate the gem "rgl" (ruby graph library). It has implementations of most of the basic classic traversal and search algorithms.
If you're dealing with something on the order of 150 nodes, you can probably get away with a minimalist adjacency list representation in the database itself, or incidence list. Then you can feed that into RGL for traversal and search operations.
If I remember correctly, RGL has enough abstraction that you may be able to work with an existing class structure and you simply provide methods to get adjacent nodes.
Assuming that it is a directed graph, use a mapping table such as
id | src | dest
where src and dest are FKs to your object table.
If your objects are not all of the same type, either have them all inherit a ruby class or have another table:
id | type | type_id
Where type is the type of object it is and type_id is its id in another table.
By doing this, you should be able to get an array of objects for each object that it points to using:
select dest
from maptable
where dest = self.id
If you need to know its inbound edges, you can preform the same type of query using src instead of dest.
From there, you should be able to easily write any graph algorithms that you want. If you need weights, you can modify the mapping table as such.
id | src | dest | weight

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Mnesia table replication/sharing

Assume that we have N erlang nodes, running same application. I want
to share an mnesia table T1 with all N nodes, which I see no problem.
However, I want to share another mnesia table T2 with pairs of nodes.
I mean the contents of T2 will be identical and replicated to/with
only sharing pair. In another words, I want N/2 different contents for
T2 table. Is this possible with mnesia, not with renaming T2 for each
distinct pair of nodes?
It's possible to do this with mnesia's table fragmentation, if one makes use of the mnesia_frag_hash callback behaviour. This allows you to control the distribution of keys, and it would be possible to construct the keys such that the callback is able to determine which node pair (and thus, which fragment) should be used.
Whether or not this works in your particular case depends on your access patterns and data set. Chances are that it's a pretty convoluted approach, and that you'd be better served by simply using different table names instead.
One table is always one table, no matter how many nodes you share it with. If you want pairs of nodes sharing a table, you would have to create a unique table for each pair of nodes.
You can use the same settings (records etc) for all those tables though, so there shouldn't be so much more work to get it done.

Resources