Joining two files in the same directory using hadoop

Joining two files in the same directory using hadoop - join

I am a total hadoop n00b. I am trying to solve the following as my first hadoop project. I have a million+ sub-folders sitting in an amazon S3 bucket. Each of these folders have two files. File 1 has data as follows:
date,purchaseItem,purchaseAmount
01/01/2012,Car,12000
01/02/2012,Coffee,4
....................
File2 has the information of the customer in the following format:
ClientId:Id1
ClientName:"SomeName"
ClientAge:"SomeAge"
This same pattern is repeated across all the folders in the bucket.
Before I write all this data into HDFS, I want to join File1 and File2 as follows:
Joined File:
ClientId,ClientName,ClientAge,date,purchaseItem,purchaseAmount
Id1,"SomeName","SomeAge",01/01/2012,Car,12000
Id1,"SomeName","SomeAge",01/02/2012,Coffee,4
I need to do this for each and every folder and then feed this joined dataset into HDFS. Can somebody point out how would I be able to achieve something like this in Hadoop. A push in the right direction will be much appreciated.

What comes to mind quickly is an implementation in cascading.
Figure out a way to turn your rows into columns for File2 programmatically so that you can iterate over all the folders and transpose the file so that your 1st column is your 1st row.
For just one subfolder:
Perhaps setting up Two Schemes a TextDelimited Scheme for File 1 and a TextLine Scheme for File 2. Set these up as Taps then wrap each of these into a MultiSourceTap this concatenates all those files into one Pipe.
At this point you should have two separate MultiSourceTaps one for all the File1(s) and one for all the File2(s).
Keep in mind some of the details in between here, it may be best to just set this up for one subfolder and then iterated over the other million subfolders and output to some other area then use hadoop fs -getmerge to get all the output small files into one big one.
Keeping with the Cascading theme, then you could construct Pipes to add the subfolder name using new Insert(subfolder_name) inside and Each function so that both your data sets have a reference to the subfolder it came from to join them together then... Join them using cascading CoGroup or Hive-QL Join.
There may be a much easier implementation than this but this is what come to mind thinking quickly. :)
TextDelimited,
TextLine,
MultiSourceTap

Have a look at the CombineFileInputFormat.

Related

Neo4j - Creating nodes when an intermediate one does not exist

I'm using a LOAD CSV for some tests, and i figured out one problem. How can I create an intermediate node if there isn't one already, using a CREATE?
For example:
I have (p:Person)-[:HAS_BANCOMAT]->(bm:Bancomat)-[:FROM]->(b:Bank). In my model, one Person can only have one bancomat, so, if in my CSV file I'm going to find some people who actually have more then one Bancomat, I will just persist the first occurrence.
I've ended up with this script:
LOAD CSV FROM 'file:///myfile.csv' as row
WITH row.name as name, row.bank as bank, row.id as bancomat_id
//OMITTING CREATING PERSONS p AND BANKS b PART
//JUST GOING ON WHAT IS NOT WORKING
WHERE size((p)-[:HAS_BANCOMAT]->(:Bancomat)-[:FROM]->(b)) = 0
CREATE (bm:Bancomat {bancomatId: row.bancomat_id})
CREATE (p)-[:HAS_BANCOMAT]->(bm)
CREATE (bm)-[:FROM]->(b)
The size part it's not working. I've also tried using NOT over the (p)-[:HAS_BANCOMAT]->(:Bancomat)-[:FROM]->(b) path, bot it doesn't work either.
MERGE is not the solution I'm looking for.
Any suggestions about what I'm doing wrong?
EDIT1: it's not working because this script will create anyway more then one linked Bancomat if some persons got more then one.

Multiple Nodes per Line in Neo4j Batch Import Tool

Using Neo4j's Batch Import Tool, how can I create multiple nodes from a single row, and then attribute some properties to Node 1 and some to Node 2?
This is an example from 29.3:
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Is there a way to make it so title is "movieId.title" and year is its own ID? Then I can abstract that out to multiple nodes.

The import tool (in contrast to LOAD CSV) expects exactly one node per line. So you have to use some preprocessing to make the format fitting your desired graph model.
Typical candidates for this a csvkit or the usual suspects from a unix command line: sed, awk, ...
In your case I'd strip out the title into a separate file for creating the :Title nodes, and create another csv file for the relationships between movies and titles.

You can re-use the same csv file but use two different header files, with different columns used as :ID and columns you don't want for this node as :IGNORED
As the header is independent from the data you can use that approach to pull in the same file several times for different nodes, relationships, etc.
It's also explained here: http://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets

Neo4j data modeling for branching/merging graphs

We are working on a system where users can define their own nodes and connections, and can query them with arbitrary queries. A user can create a "branch" much like in SCM systems and later can merge back changes into the main graph.
Is it possible to create an efficient data model for that in Neo4j? What would be the best approach? Of course we don't want to duplicate all the graph data for every branch as we have several million nodes in the DB.
I have read Ian Robinson's excellent article on Time-Based Versioned Graphs and Tom Zeppenfeldt's alternative approach with Network versioning using relationnodes but unfortunately they are solving a different problem.
I Would love to know what you guys think, any thoughts appreciated.

I'm not sure what your experience level is. Any insight into that would be helpful.
It would be my guess that this system would rely heavily on tags on the nodes. maybe come up with 5-20 node types that are very broad, including the names and a few key properties. Then you could allow the users to select from those base categories and create their own spin-offs by adding tags.
Say you had your basic categories of (:Thing{Name:"",Place:""}) and (:Object{Category:"",Count:4})
Your users would have a drop-down or something with "Thing" and "Object". They'd select "Thing" for instance, and type a new label (Say "Cool"), values for "Name" and "Place", and add any custom properties (IsAwesome:True).
So now you've got a new node (:Thing:Cool{Name:"Rock",Place:"Here",IsAwesome:True}) Which allows you to query by broad categories or a users created categories. Hopefully this would keep each broad category to a proportional fraction of your overall node count.
Not sure if this is exactly what you're asking for. Good luck!

Hmm. While this isn't insane, think about the type of system you're replacing first. SQL. In SQL databases you wouldn't use branches because it's data storage. If you're trying to get data from multiple sources into one DB, I'd suggest exporting them all to CSV files and using a MERGE statement in cypher to bring them all into your DB at once.
This could manifest similar to branching by having each person run a script on their own copy of the DB when you merge that takes all the nodes and edges in their copy and puts them all into a CSV. IE
MATCH (n)-[:e]-(n2)
RETURN n,e,n2
Then comparing these CSV's as you pull them into your final DB to see what's already there from the other copies.
IMPORT CSV WITH HEADERS FROM "file:\\YourFile.CSV" AS file
MERGE (N:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N2:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N)-[E:Edge]-(N2)
This will work, as long as you're using node types that you already know about and each person isn't creating new data structures that you don't know about until the merge.

Use CSV to populate Neo4j

I am very new for Neo4j. I am a learner of this graph database. I need to load a csv file into Neo4j database. I am trying from 2 days,I couldn't able to find good information of reading the csv file in to Neo4j. Please suggest me wil sample code or blogs of reading csv file into Neo4j.
Example:
Suppose if i have a csv file in This way how can we read it into Neo4j
id name language
1 Victor Richards West Frisian
2 Virginia Shaw Korean
3 Lois Simpson Belarusian
4 Randy Bishop Hiri Motu
5 Lori Mendoza Tok Pisin

You may want to try https://github.com/sroycode/neo4j-import
This populates data directly from a pair of CSV files ( entries must be COMMA separated )
To build: (you need maven)
sh build.sh
The nodes file has a mandatory field id and any other fields you like
NODES.txt
id,name,language
1,Victor Richards,West Frisian
2,Virginia Shaw,Korean
3,Lois Simpson,Belarusian
The relationships file has 3 mandatory fields from,to,type. Assuming you have a field age ( long integer), and info, the relations file will look like
RELNS.txt
from,to,type,age#long,info
1,2,KNOWS,10,known each other from school
1,3,CLUBMATES,5,member of country club
Running:
sh run.sh graph.db NODES.txt RELNS.txt
will create graph.db in the current folder which you can copy to the neo4j data folder.
Note:
If you are using neo4j later than 1.6.* , please add this line in conf/neo4j.properties
allow_store_upgrade = true
Have fun.

Please take a look at https://github.com/jexp/batch-import
Can be used as starting point

There is nothing available to generically load CSV data into Neo4j because the source and destination data structures are different: CSV data is tabular whereas Neo4j holds graph data.
In order to achieve such an import, you will need to add a separate step to translate your tabular data into some form of graph (e.g. a tree) before it can be loaded into Neo4j. Taking the tree structure further as an example, the page below shows how XML data can be converted into Cypher which may then be directly executed against a Neo4j instance.
http://geoff.nigelsmall.net/xml2graph/
Please feel free to use this tool if it helps (bear in mind it can only deal with small files) but this will of course require you to convert your CSV to XML first.
Cheers
Nigel

there is probably no known CSV importer for neo4j, you must import it yourself:
i usually do it myself via gremlin's g.loadGraphML(); function.
http://docs.neo4j.org/chunked/snapshot/gremlin-plugin.html#rest-api-load-a-sample-graph
i parse my data with some external script into the xml syntax and load the particular xml file. you can view the syntax here:
https://raw.github.com/tinkerpop/gremlin/master/data/graph-example-1.xml
parsing an 100mb file takes few minutes.
in your case what you need to do is a simple bipartite graph with vertices consisting of users and languages, and edges of "speaks". if you know some programming, then create user nodes with parameters id, name | unique language nodes with parameters name | relationships where you need to connect each user with the particular language. note that users can be duplicite whereas languages can't.

I believe your question is too generic. What does your csv file contain? Logical meaning of the contents of a csv file can vary very much. An example of two columns with IDs, which would represent entities connected to each other.
3921 584
831 9891
3841 92
...
In this case you could either write a BatchInserter code snippet which would import it faster, see http://docs.neo4j.org/chunked/milestone/batchinsert.html.
Or you could import using regular GraphDatabaseService with transaction sizes of a couple of thousands inserts for performance. See how to setup and use the graph db at http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded.html.

Map-side join with Hadoop Streaming

I have a file in which each line is a record. I want all records with the same value in a certain field (call if field A) to go to the same mapper. I have heard this is called a Map-Side Join, and I also heard that it's easy if the records in the file are sorted by what I call field A.
If it would be easier, the data could be spread across multiple files, but each file sorted on field A.
Is this right? How do I do this in with streaming? I'm using Python. A assume it's just part of the command I use to start Hadoop?

What is the real justification for wanting only certain records to go to certain mappers? If what you want out of this is the final result to be 3 output files (one with all A, another with all B, last with all C), you can accomplish that with multiple reducers. Need to know what you really want to accomplish.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Joining two files in the same directory using hadoop - join

Have a look at the CombineFileInputFormat.

Related

Neo4j - Creating nodes when an intermediate one does not exist

Multiple Nodes per Line in Neo4j Batch Import Tool

Neo4j data modeling for branching/merging graphs

Use CSV to populate Neo4j

Map-side join with Hadoop Streaming

Categories

Resources