Importing data from oracle to neo4j using java API - neo4j

Can u please share any links/sample source code for generating the graph using neo4j from Oracle database tables data .
And my use case is oracle schema table names as Nodes and columns are properties. And also need to genetate graph in tree structure.

Make sure you commit the transaction after creating the nodes with tx.success(), tx.finish().
If you still don't see the nodes, please post your code and/or any exceptions.

Use JDBC to extract your oracle db data. Then use the Java API to build the corresponding nodes :
GraphDatabaseService db;
try(Transaction tx = db.beginTx()){
Node datanode = db.createNode(Labels.TABLENAME);
datanode.setProperty("column name", "column value"); //do this for each column.
tx.success();
}
Also remember to scale your transactions. I tend to use around 1500 creates per transaction and it works fine for me, but you might have to play with it a little bit.
Just do a SELECT * FROM table LIMIT 1000 OFFSET X*1000 with X being the value for how many times you've run the query before. Then keep those 1000 records stored somewhere in a collection or something so you can build your nodes with them. Repeat this until you've handled every record in your database.
Not sure what you mean with "And also need to genetate graph in tree structure.", if you mean you'd like to convert foreign keys into relationships, remember to just index the key and in stead of adding the FK as a property, create a relationship to the original node in stead. You can find it by doing an index lookup. Or you could just create your own little in-memory index with a HashMap. But since you're already storing 1000 sql records in-memory, plus you are building the transaction... you need to be a bit careful with your memory depending on your JVM settings.

You need to code this ETL process yourself. Follow the below
Write your first Neo4j example by following this article.
Understand how to model with graphs.
There are multiple ways of talking to Neo4j using Java. Choose the one that suits your needs.

Related

Neo4j web client fails with large Cypher CREATE query. 144000 lines

I'm new to neo4j and currently attempting to migrate existing data into a neo4j database. I have written a small program to convert current data (in bespoke format) into a large CREATE cypher query for initial population of the database. My first iteration has been to somewhat retain the structuring of the existing object model, i.e Objects become nodes, node type is same as object name in current object model, and the members become properties (member name is property name). This is done for all fundamental types (and strings) and any member objects are thus decomposed in the same way as in the original object model.
This has been fine in terms of performance and 13000+ line CREATE cypher queries have been generated which can be executed throuh the web frontend/client. However the model is not ideal for a graph database, I beleive, since there can be many properties, and instead I would like to deomcompose these 'fundamental' nodes (with members which are fundamental types) into their own node, relating to a more 'abstract' node which represents the more higher level object/class. This means each member is a node with a single (at first, it may grow) property say { value:"42" }, or I could set the node type to the data type (i.e integer). If my understanding is correct this would also allow me to create relationships between the 'members' (since they are nodes and not propeties) allowing a greater freedom when expressing relationships between original members of different objects rather than just relating the parent objects to each other.
The problem is this now generates 144000+ line Cypher queries (and this isn't a large dataset in compraison to others) which the neo4j client seems to bulk at. The code highlighting appears to work in the query input box of the client (i.e it highlights correctly, which I assume implies it parsed it correctly and is valid cypher query), but when I come to run the query, I get the usual browser not responding and then a stack overflow (no punn intended) error. Whats more the neo4j client doesn't exit elegantly and always requires me to force end task and the db is in the 2.5-3GB usage from, what is effectively and small amount of data (144000 lines, approx 2/3 are relationships so at most ~48000 nodes). Yet I read I should be able to deal with millions of nodes and relationships in the milliseconds?
Have tried it with firefox and chrome. I am using the neo4j community edition on windows10. The sdk would initially be used with C# and C++. This research is in its initial stages so I haven't used the sdk yet.
Is this a valid approach, i.e to initially populate to database via a CREATE query?
Also is my approach about decomposing the data into fundamental types a good one? or are there issues which are likely to arise from this approach.
That is a very large Cypher query!!!
You would do much better to populate your database using LOAD CSV FROM... and supplying a CSV file containing the data you want to load.
For a detailed explaination, have a look at:
https://neo4j.com/developer/guide-import-csv/
(This page also discusses the batch loader for really large datasets.)
Since you are generating code for the Cypher query I wouldn't imagine you would have too much trouble generating a CSV file.
(As an indication of performance, I have been loading a 1 million record CSV today into Neo4j running on my laptop in under two minutes.)

How to determine the Max property on a Relationship in Neo4j 2.2.3

How do you quickly get the maximum (or minimum) value for a property of all instances of a relationship? You can assume the machine I'm running this on is well within the recommended spec's for the cpu and memory size of graph and the heap size is set accordingly.
Facts:
Using Neo4j v2.2.3
Only have access to modify graph via Cypher query language which I'm hitting via PHP or in the web interfacxe--would love to avoid any solution that requires java coding.
I've got a relationship, call it likes that has a single property id that is an integer.
There's about 100 million of these relationships and growing
Every day I grab new likes from a MySQL table to add to the graph within in Neo4j
The relationship property id is actually the primary key (auto incrementing integer) from the raw MySQL table.
I only want to add new likes so before querying MySQL for the new entries I want to get the max id from the likes, so I can use it in my SQL query as SELECT * FROM likes_table WHERE id > max_neo4j_like_property_id
How can I accomplish getting the max id property from neo4j in a optimal way? Please indicate the create statement needed for any index as well as the query you'd used to get the final result.
I've tried creating an index as follows:
CREATE INDEX ON :likes(id);
After the index is online I've tried:
MATCH ()-[r:likes]-() RETURN r.i ORDER BY r.id DESC LIMIT 1
as well as:
MATCH ()-[r:likes]->() RETURN MAX(r.id)
They work but take freaking forever as the explain plan for both indicate no indexes being used.
UPDATE: Holy $?##$?!!!! It looks like the new schema indexes aren't functional for relationships even though you can create them and show them with :schema. It also looks as if there's no way with cypher directly to create Legacy Indexes which look like they might solve this issue.
If you need to query relationship properties, it is generally a sign of a model issue.
The need of this query reveals you that you would better extract these properties into a node, that you'll then be able to query faster.
I don't say it is 100% the case, but certainly 99% of the people seen so far with the same problem has been demonstrating this model concern.
What is your model right now ?
Also you don't use labels at all in your query, likes have a context bound to the nodes.

Metadata in neo4j graph database

I know that neo4j stores data structured in graphs rather than in tables. In RDBMS we will be having schemas of the tables but in neo4j we will not be having the tables. Only nodes, relations and properties are defined. So is there any concept of metadata in neo4j. Like is there any information stored about nodes, relationships in the database? If yes, how and what it stores in the metadata? Also where can we find the metadata related information in the graph database (location)
Thanks,
Neo4J doesn't directly store metadata in the way that you're looking for. The NeoProfiler tool was written precisely for this purpose. You can run it on a Neo4J database, and it will pull out as much information on labels, indexes, constraints, properties, nodes, and relationships as it can. The way that this works isn't too far off of the queries that #ulkas suggests in the other answer here, the output is just much better.
More broadly, in an RDBMS the schema information you pull out substantially constrains the database. The schema there is like a set of rules; you can't insert data unless it conforms to that schema. In Neo4J, because it's so flexible, even if there was a schema it would just be documentation of what's there, it would not be a set of constraints on what you can put in. At any time, you can insert new data that has nothing to do with the present schema (except that you can't violate things like uniqueness constraints).
If you want to see an equivalent schema for your database in neo4j, check out neoprofiler linked above. A few people out there have written about "metagraphs" - that is, they talk about representing a neo4j schema as a graph itself, where for example a node refers to a label. Relationships from that "label node" then go out to other kinds of label nodes, specifying what sorts of relationships can exist between nodes. For example, nodes labeled "Employee" may frequently have "works_for" relationships to nodes of label "Company".
no, direct metadata are not present. the maximum you can do is to query all the structure types and have a small inside what kind of graph could be stored in the db.
START r=rel(*)
RETURN type(r), count(*)
START n=node(*)
RETURN labels(n), count(*)
the specific database files are stored in the folder data/graph.db but besides some index and key files they are binary and not easy to read.
Meanwhile there is the official APOC Library.
This includes functions like apoc.meta.graph, apoc.meta.schema and others.
The link above describes the installation, if you run into sandbox errors, check the answers in this question

Neo4j data modeling for branching/merging graphs

We are working on a system where users can define their own nodes and connections, and can query them with arbitrary queries. A user can create a "branch" much like in SCM systems and later can merge back changes into the main graph.
Is it possible to create an efficient data model for that in Neo4j? What would be the best approach? Of course we don't want to duplicate all the graph data for every branch as we have several million nodes in the DB.
I have read Ian Robinson's excellent article on Time-Based Versioned Graphs and Tom Zeppenfeldt's alternative approach with Network versioning using relationnodes but unfortunately they are solving a different problem.
I Would love to know what you guys think, any thoughts appreciated.
I'm not sure what your experience level is. Any insight into that would be helpful.
It would be my guess that this system would rely heavily on tags on the nodes. maybe come up with 5-20 node types that are very broad, including the names and a few key properties. Then you could allow the users to select from those base categories and create their own spin-offs by adding tags.
Say you had your basic categories of (:Thing{Name:"",Place:""}) and (:Object{Category:"",Count:4})
Your users would have a drop-down or something with "Thing" and "Object". They'd select "Thing" for instance, and type a new label (Say "Cool"), values for "Name" and "Place", and add any custom properties (IsAwesome:True).
So now you've got a new node (:Thing:Cool{Name:"Rock",Place:"Here",IsAwesome:True}) Which allows you to query by broad categories or a users created categories. Hopefully this would keep each broad category to a proportional fraction of your overall node count.
Not sure if this is exactly what you're asking for. Good luck!
Hmm. While this isn't insane, think about the type of system you're replacing first. SQL. In SQL databases you wouldn't use branches because it's data storage. If you're trying to get data from multiple sources into one DB, I'd suggest exporting them all to CSV files and using a MERGE statement in cypher to bring them all into your DB at once.
This could manifest similar to branching by having each person run a script on their own copy of the DB when you merge that takes all the nodes and edges in their copy and puts them all into a CSV. IE
MATCH (n)-[:e]-(n2)
RETURN n,e,n2
Then comparing these CSV's as you pull them into your final DB to see what's already there from the other copies.
IMPORT CSV WITH HEADERS FROM "file:\\YourFile.CSV" AS file
MERGE (N:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N2:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N)-[E:Edge]-(N2)
This will work, as long as you're using node types that you already know about and each person isn't creating new data structures that you don't know about until the merge.

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

Resources