I have preparing AVRO data with deflateCodec compression, when i am preparing data with 10 batch, each time when i prepare it is adding schema definition in to the file, is it possible to avoid adding schema definition.
Related
Background of problem
I'm currently building an app that models various geographic features (roads, towns, highways, etc.) in a graph database. The geographic data is all in GeoJSON format.
There is no LOAD JSON function in the cypher language, so loading JSON files requires passing the fully parsed JavaScript object as a parameter and using UNWIND to access arrayed properties and objects to create nodes. (This guide helped me out a lot to get started: Loading JSON in neo4j). Since GeoJSON is just a spec built on JSON conventions, the load JSON method works great for reasonably sized files.
However, geographic data files can be massive. Some of the files I'm trying to import range from 100 features to 200,000 features.
The problem I'm running into is that with these very large files, the query will not MERGE any nodes in the database until it has completely processed the file. For large files, this often exceeds the 3600s timeout limit set in neo4j. So I end up waiting for an hour to find out that I have no new data in my database.
I know that with some data, the current recommendation is to convert it to CSV and then use the optimization of LOAD CSV. However, I don't believe it is easy to condense GeoJSON into CSV.
Primary Question
Is it possible to send the data from a very large JSON/GeoJSON file over in smaller batches so that neo4j will commit the data intermittently?
Current Approach
To import my data, I built a simple Express app that connects to my neo4j database via the bolt protocol (using official binary JS drivers). My GeoJSON files all have a well known text (WKT) property for each feature so that I can make use of neo4j-spatial.
Here's an example of the code I would use to import a set of road data:
session.run("WITH {json} as data UNWIND data.features as features MERGE (r:Road {wkt:features.properties.wkt})", {json: jsonObject})
.then(function (result) {
var records = [];
result.records.forEach((value) => {
records.push(value);
});
console.log("query completed");
session.close();
driver.close();
return records;
})
.catch((error) => {
console.log(error);
// Close out the session objects
session.close();
driver.close();
});
As you can see I'm passing in the entire parsed GeoJSON object as a parameter in my cypher query. Is there a better way to do this with very large files to avoid the timeout issue I'm experiencing?
This answer might be helpful here: https://stackoverflow.com/a/59617967/1967693
apoc.load.jsonArray() will stream the values of the given JSON file. This can then be used as data source for batching via apoc.periodic.iterate.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('https://dummyjson.com/products', '$.features') YIELD value AS features",
"UNWIND features as feature MERGE (r:Road {wkt:feature.properties.wkt})",
{batchSize:1000, parallel:true}
)
I see plenty of examples on how to convert Avro files to Parquet, with Parquet retaining the Avro schema in its metadata.
I'm confused however on if there's some simple way of doing the opposite - converting Parquet to Avro. Any examples of that?
I think with Impala with some query like that should work :
CREATE TABLE parquet_table(....) STORED AS PARQUET;
CREATE TABLE avro_table(....) STORED AS AVRO;
INSERT INTO avro_table SELECT * FROM parquet_table;
Parquet data stored in parquet_table will be converted into avro format into avro_table.
I have a few TB logs data in JSON format, I want to convert them into Parquet format to gain better performance in analytics stage.
I've managed to do this by writing a mapreduce java job which uses parquet-mr and parquet-avro.
The only thing I'm not satisfied with is that, my JSON logs doesn't have a fixed schema, I don't know all the fields' names and types. Besides, even I know all the fields' names and types, my schema evolves as time goes on, for example, there will be new fields added in future.
For now I have to provide a Avro schema for AvroWriteSupport, and avro only allows fixed number of fields.
Is there a better way to store arbitrary fields in Parquet, just like JSON?
One thing for sure is that Parquet needs a Avro schema in advance. We'll focus on how to get the schema.
Use SparkSQL to convert JSON files to Parquet files.
SparkSQL can infer a schema automatically from data, thus we don't need to provide a schema by ourselves. Every time the data changes, SparkSQL will infer out a different schema.
Maintain an Avro schema manually.
If you don't use Spark but only Hadoop, you need to infer the schema manually. First write a mapreduce job to scan all JSON files and get all fields, after you know all fields you can write an Avro schema. Use this schema to convert JSON files to Parquet files.
There will be new unknown fields in future, every time there are new fields, add them to the Avro schema. So basically we're doing SparkSQL's job manually.
Use Apache Drill!
From https://drill.apache.org/docs/parquet-format/, in 1 line of SQL.
After setup Apache Drill (with or without HDFS), execute sqline.sh to run SQL queries:
// Set default format ALTER SESSION SET `store.format` = 'parquet';
ALTER SYSTEM SET `store.format` = 'parquet';
// Migrate data
CREATE TABLE dfs.tmp.sampleparquet AS (SELECT trans_id, cast(`date` AS date) transdate, cast(`time` AS time) transtime, cast(amount AS double) amountm, user_info, marketing_info, trans_info FROM dfs.`/Users/drilluser/sample.json`);
Should take a few time, maybe hours, but at the end, you have light and cool parquet files ;-)
In my test, query a parquet file is x4 faster than JSON and ask less ressources.
My scenario is a variation on the one discussed here:
How do I write to BigQuery using a schema computed during Dataflow execution?
In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.
For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).
Is this possible? If so, what's the best approach?
My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.
If you use BigQuery.Write to create the table then the schema needs to known when the table is created.
Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.
You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.
Alternatively you create a custom sink
to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.
We have a pipeline that looks like:
BigQuery -> ParDo -> BigQuery
The table has ~2B rows, and is just under 1TB.
After running for just over 8 hours, the job failed with the following error:
May 19, 2015, 10:09:15 PM
S09: (f5a951d84007ef89): Workflow failed. Causes: (f5a951d84007e064): BigQuery job "dataflow_job_17701769799585490748" in project "gdfp-xxxx" finished with error(s): job error: Sources are too large. Limit is 5.00Ti., error: Sources are too large. Limit is 5.00Ti.
Job id is: 2015-05-18_21_04_28-9907828662358367047
It's a big table, but it's not that big and Dataflow should be easily able to handle it. Why can't it handle this use case?
Also, even though the job failed, it still shows it as successful on the diagram. Why?
I think that error means the data you are trying to write to BigQuery exceeds the 5TB limit set by BigQuery for a single import job.
One way to work around this limit might be to split your BigQuery writes into multiple jobs by having multiple Write transforms so that no Write transform receives more than 5TB.
Before your write transform, you could have a DoFn with N outputs. For each record randomly assign it to one of the outputs. Each of the N outputs can then have its own BigQuery.Write transform. The write transforms could all append data to the same table so that all of the data will end up in the same table.