DataflowRunner pipeline error - Unable to rename - google-cloud-dataflow

My DataFlow job reads one CSV file from GS bucket, query another service for extra data and writing it to a new CSV file and storing back to the bucket but it seems to fall before it grabs the input CSV file at the start...
This is the error I get:
DataflowRuntimeException - Dataflow pipeline failed. State: FAILED, Error:
Unable to rename "gs://../../job.1582402027.233469/dax-tmp-2020-02-22_12_07_49-5033316469851820576-S04-0-1719661b275ca435/tmp-1719661b275ca2ea-shard--try-273280d77b2c5b79-endshard.avro" to "gs://../../temp/job.1582402027.233469/tmp-1719661b275ca2ea-00000-of-00001.avro".
Any ideas what is the cause for this error?
here is a print screen

Usually that error is due to the fact that the service account you are using in the DataFlow jobs does not have the right GCS (Google Cloud Storage) permissions.
You should add a role like "roles/storage.objectAdmin" to the service account to allow the interaction with GCS.

Related

Storage Event Trigger failed to trigger the function

I am working on pipeline that copies data from ADLS Gen into Azure Synapse Dedicated SQL Pool. I used the Synapse pipelines and followed the Microsoft Docs on how to create a storage event trigger. But when a new file is loaded into the ADLS, I get the following error:
" 'The template language expression 'trigger().outputs.body.ContainerName' cannot be evaluated because property 'ContainerName' doesn't exist, available properties are 'RunToken'.
I have set the following pipeline parameters:
The pipeline successfully runs when I manually trigger it and pass the parameters. I would appreciate any solution or guidance to resolve this issue
Thank you very much
I tried to set trigger the synapse pipeline and copy the new blob into the dedicated pool, but when I monitored the triggers run, it failed to run.
I can trigger the pipeline manually
According to storage event trigger
The storage event trigger captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName.
It does not have property called container name.
As per data you provided it seems that file is stored in your container itself. for this approach you have to give value for container name parameter as #trigger().outputs.body.folderPath it will return container name as folder
And now pass these pile line parameters to dataset properties dynamically.
It will run pipeline successfully and copy data from ADLS to synapse dedicated pool

Question: BigQueryIO creates one file per input line, is it correct?

I'm new on Apache Beam and I'm developing a pipeline to get rows from JDBCIO and send them to BigQueryIO. I'm converting the rows to avro files with withAvroFormatFunction but it is creating a new file for each row returned by JDBCIO. The same for withFormatFunction with json files.
It is so slow to run locally with DirectRunner because it uploads a lot of files to Google Storage. Is this approach good for scaling on Google Dataflow? Is there a better way to deal with it?
Thanks
In BigqueryIO there is an option to specify withNumFileShards which controls the number of files that get generated while using Bigquery Load Jobs.
From the documentation
Control how many file shards are written when using BigQuery load jobs. Applicable only when also setting withTriggeringFrequency(org.joda.time.Duration).
You can set test your process by setting the value to 1 to see if only 1 large file gets created.
BigQueryIO will commit results to BigQuery for each bundle. The DirectRunner is known to be a bit inefficient about bundling. It never combines bundles. So whatever bundling is provided by a source is propagated to the sink. You can try using other runners such as Flink, Spark, or Dataflow. The in-process open source runners are about as easy to use as the direct runner. Just change --runner=DirectRunner to --runner=FlinkRunner and the default settings will run in local embedded mode.

Is there any GCP Dataflow template for "Pub/Sub to Cloud Spanner"

I am trying to find out if there is any GCP Dataflow template available for data ingestion with "Pub/Sub to Cloud Spanner". I have found there is already a default GCP dataflow template available with example - "Cloud Pub/Sub to BigQuery".
So, I am interested to see if I can do data ingestion to spanner in stream or batch mode and how the behavior would be
There is a Dataflow template to import Avro files in batch mode that you can use by following these instructions. Unfortunately a Cloud Pub/Sub streaming template is not available yet. If you would like, you can file a feature request.
Actually I tried to do something like use "projects/pubsub-public-data/topics/taxirides-realtime" and "gs://dataflow-templates/latest/Cloud_PubSub_to_Avro" template to load sample data file to my gcp storage. Then I stopped this stream job and created another batch job with "gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Spanner" template. But the batch job failed with below error,
java.io.FileNotFoundException: No files matched spec: gs://cardataavi/archive/spanner-export.json
at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:166)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:153)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:636)
It seems, right now spanner support only Avro data format which has Spanner specific format. Is the understanding correct?

Azure Cloud Shell - Storage Creation Failed

Seems each time I try to use an existing share for Cloud Shell, it gives me the annoying error
Error: 400 {"error": "code":"AccountPropertyCannotBeUpdated","message":"The property 'kind'
was specified in the input, but it cannot be updated."}}. I have
tried just creating a Resource Group and then a Storage Account before
hand and then selecting to create a new File share but this too fails.
I wanted to use a single share for storing Cloud Shell img files for
each of the members of my team so we could easily share files.
It seems to be a bad behavior. Please use the standard options for initialize the Cloud Shell and verify your Azure account type.

Stop/abort Build the moment certain text is encountered in console output

Bottom line on top: Is there a way to halt a build immediately when a certain string is encountered in the console output?
We have a maven build that uses the maven target site-deploy (it uploads java doc to a remote server via ssh).
Every once in a blue moon a build fails, and as a result of this failure, the console output file is ~12+ gigs, which files up our drive, which in turn, can cause our Jenkins master to die due to out of diskspace.
The log file gets filled up with the following message repeated over and over again:
WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
3d:69:41:8a:ec:d1:4c:d9:75:ef:7d:71:b7:7d:61:d0.
Please contact your system administrator.
Add correct host key in known_hosts to get rid of this message.
Do you want to delete the old key and insert the new key? (yes/no):
We are in the process of fixing the build so that we don't get this error message, but it would really cool if Jenkins can stop/abort the build the moment it encounters this message.
Is there a way to do this?
I don't know any existing solution, but I believe it should be possible to write your own plugin to do so.
You could create a BuildWrapper that decorates the log and searches for your messages, then kill the build when it matches your criteria.
Here's a BuildWrapper that kills a job that has been running too long:
The plugin
The implementation

Resources