Custom ETL OrientDB from PostgreSQL

Custom ETL OrientDB from PostgreSQL - ruby-on-rails

I am working on a Rails project, the database is OrientDB graph database. I need to transfer data from Postgres to OrientDB graph. I have written scripts to in Ruby to fetch data from postgres and load it into the graph structure by creating relevant edges and nodes.
However this process is very slow and is taking months to enter million records. The graph is somewhat a densely connected graph.
I wanted to use the inbuilt ETL configuration provided by OrientDB but it seems relatively complex since I need to create multiple vertexes from fields in the same table and then connect them. I referred to this documentation.
Can I write custom ETL to load data into OrientDB with the same speed as done by inbuilt ETL tool?
Also, are there any benchmarks for the speed of data loading into OrientDB.

if ETL doesn't fit your needs you can write a custom importer using Java or any other JVM language of your choice.
If you only need to import the db once, the best way is to use plocal access (embedded) and then move the resulting database under the server.
With this approach, you can achieve the best performances, because the network isn't involved.
The code should be something like this snippet:
OrientGraphFactory fc = new OrientGraphFactory("plocal:./databases/import", "admin", "admin");
Connection conn = DriverManager.getConnection("jdbc....");
Statement stmt = conn.createStatement();
ResultSet resultSet = stmt.executeQuery("SELECT * from table ");
while (resultSet.next()) {
OrientGraph graph = fc.getTx();
OrientVertex vertex1 = graph.addVertex("class:Class1", "name", resultSet.getString("name"));
OrientVertex vertex2 = graph.addVertex("class:Class2", "city", resultSet.getString("city"));
graph.addEdge(null, vertex1, vertex2, "class:edgeClass");
graph.shutdown();
}
fc.close();
resultSet.close();
stmt.close();
conn.close();
It is more pseudo code than working code, but take it as a template for operations needed to import a single query/table from the original RDBMS.
About performance, it is quite hard to give numbers, it depends on many factors: schema complexity, type of access (plocal, remote), lookups, connection's speed of the data source.
Few more words about teleporter. It will import the original database schema and data inside OrientDB, automatically. AFAIK you have a working OrientDB and for sure Teleporter will not create the same schema on OrientDb you did.

Related

Custom Sink to Snowflake using Snowflake JDBC driver is very slow

I am using Spring Cloud Data Flow to create a custom stream to load data into Snowflake. I have written a custom sink to load data into Snowflake using Snowflake's JDBC driver. The methodology that I used is similar to any database update using the following steps:
Create a connection pool (used HikariCP) to obtain Snowflake database connection.
Using prepared statement, created a batch of rows to commit all at once.
Using a scheduled timer committed the batch to snowflake.
This is when I noticed that the batch is being updated very slowly in Snowflake - i.e. one or two records at a time and a batch of 8K rows took well over 45 minutes to update in Snowflake table (using a XS warehouse).
My question: Is there a better/another/recommended method to stream data into Snowflake? I am aware of Kafka connector to Snowflake and Snowpipes (which use an internal/external stage) but these are not the options we would like to pursue.
PreparedStatement preparedStatement = null;
Connection conn = null;
String compiledQuery = "INSERT INTO " + env.getProperty("snowtable") + " SELECT parse_json (column1) FROM VALUES (?)";
conn = DataSource.getConnection();
preparedStatement = conn.prepareStatement(compiledQuery);
for(int i = 0; i<messageslocal.size(); i++) {
preparedStatement.setString(1, messageslocal.get(i));
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
Thank you!

Generally speaking Snowflake - like many column-store or hybrid store DBs - is not performing well for single or small number of lines insert. So the poor performance you experience does not look strange to me, especially on an XS WH.
Without knowing the context of your task, I would suggest to write to a JSON, PARQUET or CSV file (stored on S3 if you're in AWS) instead of writing directly to Snowflake through JDBC. You can make that JSON/PARQUET/CSV file available through a Stage in Snowflake.
Then you can either write a process that copies the Stage data to a table, or put a materialized view on top of the Stage. The materialized view will more or less do the equivalent to triggering the extract of the JSON/PARQUET/CSV data into a Snowflake Table, but this would operate asynchronously without impacting your application performance.

In addition to the great answer by #JeromeE, you should also try using multi-row insert. What you have in your code is a batch with individual inserts.

graphQL join on common column

how we can do data join on a common column in graphQL.
For example in SQL : Select t.name and z.address where t.id=z.id;
how is this managed by graphQL query.

How is this managed by GraphQL query?
The short answer: It's not (managed). You have to write it in your resolvers.
Your GraphQL schema is basically just a declaration of types that represent nodes, and fields that represent edges in a graph. GraphQL is the language to query this graph. At each edge in your schema, you write a function for how to get the data when the query traverses that edge. This function is called a "resolver", and you can put pretty much any code in your resolvers as long as it returns valid data. This means GraphQL is completely, absolutely database-independent. Your resolvers are responsible for talking to your database(s).
So when it comes to SQL, and joins, the concern of making the queries is entirely the responsibility of the API developer. GraphQL knows nothing about SQL or joins, so this is essentially custom logic for your application which you must write. It turns out that building SQL queries to resolve data for GraphQL queries is very difficult without running into problems like eagerly fetching too much data or the "N+1" problem.
Some tools in the open-source community have emerged to help solve this problem. Here are a couple I would recommend in the Node.js space:
DataLoader - A database-agnostic tool that batches queries and caches individual records.
Join Monster - A SQL-tailored tool for efficient data-retrieval and query batching. It examines each query and your schema and generates SQL queries dynamically.
Disclaimer: I'm a co-creator of Join Monster.

Importing data from oracle to neo4j using java API

Can u please share any links/sample source code for generating the graph using neo4j from Oracle database tables data .
And my use case is oracle schema table names as Nodes and columns are properties. And also need to genetate graph in tree structure.

Make sure you commit the transaction after creating the nodes with tx.success(), tx.finish().
If you still don't see the nodes, please post your code and/or any exceptions.

Use JDBC to extract your oracle db data. Then use the Java API to build the corresponding nodes :
GraphDatabaseService db;
try(Transaction tx = db.beginTx()){
Node datanode = db.createNode(Labels.TABLENAME);
datanode.setProperty("column name", "column value"); //do this for each column.
tx.success();
}
Also remember to scale your transactions. I tend to use around 1500 creates per transaction and it works fine for me, but you might have to play with it a little bit.
Just do a SELECT * FROM table LIMIT 1000 OFFSET X*1000 with X being the value for how many times you've run the query before. Then keep those 1000 records stored somewhere in a collection or something so you can build your nodes with them. Repeat this until you've handled every record in your database.
Not sure what you mean with "And also need to genetate graph in tree structure.", if you mean you'd like to convert foreign keys into relationships, remember to just index the key and in stead of adding the FK as a property, create a relationship to the original node in stead. You can find it by doing an index lookup. Or you could just create your own little in-memory index with a HashMap. But since you're already storing 1000 sql records in-memory, plus you are building the transaction... you need to be a bit careful with your memory depending on your JVM settings.

You need to code this ETL process yourself. Follow the below
Write your first Neo4j example by following this article.
Understand how to model with graphs.
There are multiple ways of talking to Neo4j using Java. Choose the one that suits your needs.

How to execute SQL statements on a dataset which didn't come from a database?

Suppose I have an application which fetches a custom XML packet from the server which represents a dataset. Then, suppose I wish to execute a SQL statement on that data via a dataset. What can I use to do this? I don't need to know the code necessarily, but just what to use to make this possible and a general explanation of how.
For example, I may fetch a list of customers in XML format from the server. Then, I can use any third-party parser to dump that XML data into some client dataset. Then, execute a query on that dataset, for example select * from customers where ZipCode = '12345' without fetching this data from the server again.
XML is not the only limitation, that's just an example. I might want to do the same to some application settings loaded from an INI file. Either way, the concept is that the original source of the data is unknown.
Whether the dataset stores its temporary data in the memory or on the disk doesn't matter, but it would be excellent if it could keep it in the disk.

TXQuery (http://code.google.com/p/txquery/) is a component that provides a local SQL engine for executing SQL queries against one or more TDataSets. The only issues I have had with it is updating data via a TDBGrid of a query joining multiple tables (TDataSets) - specifically which table is being updated.
AnyDac v6 (now FireDac) also has a local SQL engine. http://www.da-soft.com/anydac/docu/frames.html?frmname=topic&frmfile=Local_SQL.html
Edit: For the example SQL in your question, because it only involves a single table, you do this with just a Filter on the datatset. For example
ADataSet.Filtered := False;
ADataSet.Filter := 'ZipCode=' + QuotedStr('12345');
ADataSet.Filtered := True;

Such a feature can be done using a local database. You just insert the TDataSet result into a local in-memory (or file-based) stand-alone database, then you can use regular SQL queries on it, including JOIN.
You can for instance use SQLite3, or the free edition of NexusDB.
NexusDB embedded has the benefit of being a native Delphi database, so stick to the DB.pas TDataSet paradigm.
Another option is to use the so-called Virtual Table mechanism of SQLite3, which allows to expose any data (even from TDataSet, XML, JSON or in-memory objects) to the SQLite3 engine, just as regular tables. Then you can run SQL statements on those "virtual" tables, including JOINs. With this approach, you do not require to INSERT the data into regular tables, but the data remain in their original form. Of course, you will miss some performance features like indexes, which should be handled on the virtual table provider side. We use this feature as the database core of our mORMot ORM/SOA framework, and this is pretty powerful.

The general process that you want to perform is complicated by the difference in data representation. SQL data is stored in tables made up of distinguishable records. XML is a structured representation of data, but in tree form rather than table/row form.
Each of these data forms may be qualified by a schema that provides a context for the data.
You have two general paths that you can follow:
Take the XML, and based on the schema insert it into a set of interlinked tables, then perform the SQL query. - if you have the schema, you can use code generators to make a parser, and then based ont the parse tree, you can insert into a local db with tables constructed on the fly. You can set up my SQL pretty easily from https://dev.mysql.com/doc/refman/5.7/en/installing.html and then in your version of delphi make a connection to the database, first fill it in, then query. This would satisfy your desire to have the data stored on the disk. unless you purge the tables when done, the data are still available in the local machine db.
This seems like more work than:
Use Xpath or Xquery and work directly on the XML. For this, a package like saxon in your favorite environment, or expat in python would work nicely.
Let me know if either of these paths seems as if it may be fruitful.

Change management for graph databases?

I've been recently exposed to the world of graph databases. Its quite an interesting paradigm shift for an old relational dog like me.
Also quite recently, I've been tinkering with liquibase and its been quite a neat tool in managing databases.
So, two worlds collide and I was just wondering if there are any tools out there that take on liquibase-like change management for graph databases. I'm especially interested in neo4j and orientdb.

Liquigraph exists now and although still quite new, the author is very receptive to feedback and is actively working on the project.

Pramod Sadalage and Martin Fowler's influential article from 2003 on Evolutionary Database Design had a big impact on how I approached managing schema changes in a database. I went on to use DbDeploy and DbDeploy.net in Java and .NET ecosystems, and now use ActiveRecord migrations. If you find liquibase interesting, I recommend taking a look at these tools.
The Neo4j.rb documentation discusses these kinds of migrations against Neo4j.
I personally haven't used a tool to manage migrations in Neo4j, but I've written migration scripts that have done things like rename properties, change edge labels, or create indexes. As an example use case, here's a snippet from a Gremlin Groovy script I used to remap some foreign keys stored in a Neo4j graph and update an index:
try {
projects.each { node ->
old_id = node.ref_id
new_id = old_to_new_ids[old_id]
index.remove('project', old_id, node)
node.ref_id = new_id
index.put('project', new_id, node)
}
} catch (Throwable e) {
println(e)
} finally {
g.shutdown()
}
As of Neo4j version 1.8, there is a PropertyContainer that can be used for graph metadata. It would be simple to use this container to update a 'schema_version' property. The code would look something like:
EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(dbFilename);
Transaction tx = db.beginTx();
PropertyContainer properties = db.getNodeManager().getGraphProperties();
properties.setProperty("schema_version", 3);
tx.success();
tx.finish();

Personally, I would be more interested in something based on TinkerPop APIs. I think this API is supported by multiple different databases, that's what it is designed for. I'd prefer to be able to define my vertex labels, edge labels, properties, indexes etc - not trying to align with a (great) technology that is designed for the relational databases.

Objectivity/DB is an object-oriented/graph database that has a feature call "Schema Evolution". This feature allows you to create your schema, load data, change the schema, and load more data. You can change the schema as many times as you'd like. We've had customers that have deploy operational systems and have changed their schema hundreds of times without having to reload data.
The Schema Evolution feature uses the concept of schema "shapes" where each shape is stored in the schema catalog and each object has a shape id. When an object is read from disk, the shape id is used to lookup the schema shape from the catalog. Then, if the catalog shape is not the "latest" shape for that schema type, the actual object data is "evolved" on the fly to match the newest shape for that object type. This allows operational system to not have to reload petabyte-scale databases just because someone wants an extra attribute.
There are many types of schema changes that are allowed, adding, removing, re-typing attributes, but there are a few schema changes that are not allowed because they would be functionally destructive to the data and/or schema.
Disclaimer: I am employed by Objectivity, Inc.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Custom ETL OrientDB from PostgreSQL - ruby-on-rails

Related

Custom Sink to Snowflake using Snowflake JDBC driver is very slow

graphQL join on common column

Importing data from oracle to neo4j using java API

How to execute SQL statements on a dataset which didn't come from a database?

Change management for graph databases?

Categories

Resources