Apache Beam: Wait for AvroIO write step is done before start ImportTransform Dataflow template - google-cloud-dataflow

I'm using apache beam to create a pipeline where basically reads an InputFile, Convert to Avro, write the AvroFile to a bucket and then Import these avro files to Spanner using Dataflow template
The problem that I'm facing is that the last step (Import the Avro files to the Database) is starting before the previous (write Avro Files to the bucket) is done.
I tried to add the Wait.on function but that only works if returns a PCollection, but when I write the files to avro it returns PDone.
Example of the Code:
// Step 1: Read Files
PCollection<String> lines = pipeline.apply("Reading Input Data exported from Cassandra",TextIO.read().from(options.getInputFile()));
// Step 2: Convert to Avro
lines .apply("Write Item Avro File",AvroIO.writeGenericRecords(spannerItemAvroSchema).to(options.getOutput()).withSuffix(".avro"));
// Step 3: Import to the DataBase
pipeline.apply( new ImportTransform(
spannerConfig,
options.getInputDir(),
options.getWaitForIndexes(),
options.getWaitForForeignKeys(),
options.getEarlyIndexCreateFlag()));
Again, the problem is because step 3 starts before Step 2 is done
any ideas?

This is a flaw in the API, see, e.g. a recent discussion on this on the beam dev list. The only solutions for now are to either fork AvroIO to return a PCollection or run two pipelines sequentially.

Related

In pyspark, reading csv files gets failed if even 1 path does not exist. How can we avoid this?

In pyspark reading csv files from different paths gets failed if even one path does not exist.
Logs = spark.read.load(Logpaths, format="csv", schema=logsSchema, header="true", mode="DROPMALFORMED");
Here Logpaths is an array that contain multiple paths. And these paths are created dynamically depending upon given startDate and endDate range. If Logpaths contain 5 paths and first 3 exists but 4th does not exist. Then whole extraction gets failed. How can I avoid this in pyspark or how can I check there existance before reading?
In scala I did this by checking file existance and filter out non-existed records by using hadoop hdfs filesystem globStatus function.
Path = '/bilal/2018.12.16/logs.csv'
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
val fileStatus = fs.globStatus(new org.apache.hadoop.fs.Path(Path));
So I got what I was looking for. Like the code I posted in the question which can be used in scala for file existance check. We can use below code in case of PySpark.
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("bilal/logs/log.csv"))
This is exactly the same code also used in scala, so in this case we are using java library for hadoop and java code runs on JVM on which spark is running.

Exported Dataflow Template Parameters Unknown

I've exported a Cloud Dataflow template from Dataprep as outlined here:
https://cloud.google.com/dataprep/docs/html/Export-Basics_57344556
In Dataprep, the flow pulls in text files via wildcard from Google Cloud Storage, transforms the data, and appends it to an existing BigQuery table. All works as intended.
However, when trying to start a Dataflow job from the exported template, I can't seem to get the startup parameters right. The error messages aren't overly specific but it's clear that for one thing, I'm not getting the locations (input and output) right.
The only Google-provided template for this use case (found at https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud-storage-text-to-bigquery) doesn't apply as it uses a UDF and also runs in Batch mode, overwriting any existing BigQuery table rather than append.
Inspecting the original Dataflow job details from Dataprep shows a number of parameters (found in the metadata file) but I haven't been able to get those to work within my code. Here's an example of one such failed configuration:
import time
from google.cloud import storage
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def dummy(event, context):
pass
def process_data(event, context):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials)
data = event
gsclient = storage.Client()
file_name = data['name']
time_stamp = time.time()
GCSPATH="gs://[path to template]
BODY = {
"jobName": "GCS2BigQuery_{tstamp}".format(tstamp=time_stamp),
"parameters": {
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),
"outputLocations": '{{\"location1\":\"[project]:[dataset].[table]\", [... other locations]"}}',
"customGcsTempLocation": "gs://[my bucket]/dataflow"
},
"environment": {
"zone": "us-east1-b"
}
}
print(BODY["parameters"])
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
print(response)
The above example indicates invalid field ("location1", which I pulled from a completed Dataflow job. I know I need to specify the GCS location, the template location, and the BigQuery table but haven't found the correct syntax anywhere. As mentioned above, I found the field names and sample values in the job's generated metadata file.
I realize that this specific use case may not ring any bells but in general if anyone has had success determining and using the correct startup parameters for a Dataflow job exported from Dataprep, I'd be most grateful to learn more about that. Thx.
I think you need to review this document it explains exactly the syntax required for passing the various pipeline options available including the location parameters needed... 1
Specifically with your code snippet the following does not follow the correct syntax
""inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name)"
In addition to document1, you should also review the available pipeline options and their correct syntax 2
Please use the links...They are the official documentation links from Google.These links will never go stale or be removed they are actively monitored and maintained by a dedicated team

Google dataflow: AvroIO read from file in google storage passed as runtime parameter

I want to read Avro files in my dataflow using java SDK 2
I have schedule my dataflow using cloud function which are triggered based on the files uploaded to the bucket.
Following is the code for options:
ValueProvider <String> getInputFile();
void setInputFile(ValueProvider<String> value);
I am trying to read this input file using following code:
PCollection<user> records = p.apply(
AvroIO.read(user.class)
.from(String.valueOf(options.getInputFile())));
I get following error while running the pipeline:
java.lang.IllegalArgumentException: Unable to find any files matching RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}
Same code works fine in case of TextIO.
How can we read Avro file which is uploaded for triggering cloud function which triggers the dataflow pipeline?
Please try ...from(options.getInputFile())) without converting it to a string.
For simplicity, you could even define your option as simple string:
String getInputFile();
void setInputFile(String value);
You need to use simply from(options.getInputFile()): AvroIO explicitly supports reading from a ValueProvider.
Currently the code is taking options.getInputFile() which is a ValueProvider, calling the JavatoString() function on it which gives a human-readable debug string "RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}" and passing that as a filename for AvroIO to read, and of course this string is not a valid filename, that's why the code currently doesn't work.
Also note that the whole point of ValueProvider is that it is placeholder for a value that is not known while constructing the pipeline and will be supplied later (potentially the pipeline will be executed several times, supplying different values) - so extracting the value of a ValueProvider at pipeline construction time is impossible by design, because there is no value. At runtime though (e.g. in a DoFn) you can extract the value by calling .get() on it.

Different checksum results for jar files compiled on subsequent build?

I am working verifying the jar files present on remote unix boxes with that of built on local machine(Windows & Cygwin) with same JVM.
As a POC I am trying to verify if same checksum is produced with jar files generated on my machine with consecutive builds, I tried below,
Generated the jar file first time using ant script
Calculated the checksum (e.g. "xyz abc")
Generated the jar file again with same ant script without changing anything
I got different checksum but same byte count (e.g. "xvw abc")
I am not sure how java internal processes produce the class files and then the jar files, Can someone please help me understand below points
Does the cksum utility of unix/cygwin consider timestamp of the file while coming up with the value?
Will the checksum be different for compiled class files/jar file produced if we keep every other things same [Compiler version + sourcecode + machine + environment]?
Answer to question 1: cksum doesn't consider the timestamp of the archive (e.g. jar-file) but it does consider the timestamps of the files inside the jarfile.
Answer to question 2: The checksums of the individual class-files will be the same with all other things the same (source-code, compiler etc.) The checksums of the jar-files will be different. Causes of differences can be the timestamp of the files inside the jarfile or if files are put into the archive in different orders (e.g. caused by parallel builds).
If you want to create a reproducible build with gradle you can do so with the config below:
tasks.withType(AbstractArchiveTask) {
preserveFileTimestamps = false
reproducibleFileOrder = true
}
Maven allows something similar, sorry I don't know how to do this with ant..
More info here:
https://dzone.com/articles/reproducible-builds-in-java
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74682318

Fail to import into Neo4J with batch-import

I'm trying to import a SQLite3 database into Neo4J using batch-import. Being a Neo4J noob, I followed Max De Marzi's post : Batch Importer – Part 2.
I get this error:
# java -server -Xmx2G -jar /opt/batch-import/target/batch-import-jar-with-dependencies.jar /var/lib/neo4j/data/graph.db nodes.csv relations.csv
Usage: Importer data/dir nodes.csv relationships.csv [node_index node-index-name fulltext|exact nodes_index.csv rel_index rel-index-name fulltext|exact rels_index.csv ....]
Using: Importer /var/lib/neo4j/data/graph.db nodes.csv relations.csv
Using Existing Configuration File
..
Importing 271544 Nodes took 2 seconds
Total import time: 4 seconds
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=271565
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:917)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:471)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:136)
at org.neo4j.batchimport.Importer.doImport(Importer.java:214)
at org.neo4j.batchimport.Importer.main(Importer.java:78)
But the node exists :
$ grep ^271565 nodes.csv
271565 'la Callas' 'n_term' 0.0
Has anyone else had this issue?
Thanks.
Can you show your file headers?
As you can see you only imported 271544 nodes. So there is no way there is a node with the node-id 271565.
The id in the relationship file refers to the row number in the nodes-file not to what is in your own "id" column (how could it know).
The only thing you can do here is to use id:id which is a special type and will force the neo4j-id's to correspond to your provided id's. And in the relationship-file use start:id, end:id.
You can try an alternate method to import bulk-data into neo4j.
First convert your database into csv files and import it into Gephi - a graph visualization tool. Then by using the Gephi plugin for neo4j database support, you should be able to export your database (from Gephi) into neo4j format.
Finally just copy the exported file into appropriate neo4j directory.
For importing database into Gephi, you will need two csv files - one with all the nodes and other with all the relationships. Follow this tutorial : http://blog.neo4j.org/2013/01/fun-with-beer-and-graphs.html
Get Gephi from here: https://gephi.org/
Get the Plugin from here : https://marketplace.gephi.org/plugin/neo4j-graph-database-support/
Hope this helps.
Can you supply your input files to test? What branch are you using?
I found a similar error reported here: https://github.com/jexp/batch-import/issues/59

Resources