How to Update a python library thats already present in GCP dataflow - google-cloud-dataflow

I am using avro version 1.11.0 for parsing an avro file and decoding it. We have a custom requirement, so i am not able to use ReadFromAvro. When trying this with dataflow there arises a dependency issues as avro-python3 with version 1.82 is already available. The issue is of class TimestampMillisSchema which is not present in avro-python3. It fails stating Attribute TimestampMillisSchema not found in avro.schema.
I then tried passing a requirements file with avro==1.11.0 but now the dataflow was not able to start giving error "Error syncing pod" which seems to be because of dependencies conflicts.
Any Idea/help on how this should be resolved.
Thanks

Related

Use Of experiments=no_use_multiple_sdk_containers in Google cloud dataflow

Issue Summary:
Hi,
I am using avro version 1.11.0 for parsing an avro file and decoding it. We have a custom requirement, so i am not able to use ReadFromAvro. When trying this with dataflow there arises a dependency issues as avro-python3 with version 1.82 is already available. The issue is of class TimestampMillisSchema which is not present in avro-python3. It fails stating Attribute TimestampMillisSchema not found in avro.schema. I then tried passing a requirements file with avro==1.11.0 but now the dataflow was not able to start giving error "Error syncing pod" which seems to be because of dependencies conflicts.
To Solve the issue , we set an experiment flag (--experiments=no_use_multiple_sdk_containers ) which ran fine.
I want to know a better solution of my issue and also does the above flag will effect the pipeline performance.
Please try with the dataflow run command:
--prebuild_sdk_container_engine=cloud_build --experiments=use_runner_v2
this would use cloud build to build the container with your extra dependencies and then would use it within the dataflow run.

Unable to use Kubernetes and Jackson 2 API plugins simuntaneously at Jenkins

I am trying to deploy my web application usin Kubernetes at Jenkins, but the manifest.yml is not being properly read, as I constantly get the following error logs:
ERROR: Can't construct a java object for tag:yaml.org,2002:io.kubernetes.client.openapi.models.V1Deployment; exception=Class not found: io.kubernetes.client.openapi.models.V1Deployment
in 'reader', line 1, column 1:
apiVersion: v1
^
For what I've read in other entries, this seems to be an issue with the latest versions of the Jackson API plugin. Due to this, I've tried to downgrade this plugin, but this does not seem to be possible, as In have several plugins installed that require the latest Jackson 2 API plugin version, which is the reason why I get an error when trying to downgrade.
Therefore, I'd like to know if I have any alternative to fix this problem when parsing the manifest.yml rather thar diwngrading every dependent plugin one by one.

EMR 6 Beta with Docker Support has S3 Access Issue

I am exploring the new EMR 6.0.0 with Docker support in order to make decision if we want to use it. One of our projects is written in Scala 2.11. But EMR 6.0.0 comes with Spark built from Scala 2.12. So I switched to try 6.00-beta, which is Spark 2.4.3 built from Scala 2.11. If it works on 6.0.0-beta, then we will upgrade our code to Scala 2.12 and use 6.0.0.
A few issues I am having are when I tried to run my Scala spark job:
When it tried to read parquet from S3, I got error: java.lang.RuntimeException: Cannot create temp dirs: [/mnt/s3]
When I tried to make API call with https, I got error: usun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.
When it tried to read files from S3, I got error: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. I was able to able to hack this one by passing the path by --jars. Maybe not the best solution.
I am guessing there must be something I need to set either during bootstrap or in the Docker file.
Can someone please help? Thanks!
I figure out the S3 issue. In beta version, /mnt/s3 is not mounted and given the read and write permission.
So I need to add the "docker.allowed.rw-mounts" to the container-executor configuration like below:
docker.allowed.rw-mounts=/etc/passwd,/mnt/s3

Issue while running dataflow

I am getting below error while running dataflow job. I am trying to update my existing beam version to 2.11.0 but I am getting below error at run time.
java.lang.IncompatibleClassChangeError: Class
org.apache.beam.model.pipeline.v1.RunnerApi$StandardPTransforms$Primitives
does not implement the requested interface
com.google.protobuf.ProtocolMessageEnum at
org.apache.beam.runners.core.construction.BeamUrns.getUrn(BeamUrns.java:27)
at
org.apache.beam.runners.core.construction.PTransformTranslation.(PTransformTranslation.java:58)
at
org.apache.beam.runners.core.construction.UnconsumedReads$1.visitValue(UnconsumedReads.java:49)
at
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:666)
at
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:649)
at
org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:311)
at
org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:245)
at
org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:458)
at
org.apache.beam.runners.core.construction.UnconsumedReads.ensureAllReadsConsumed(UnconsumedReads.java:40)
at
org.apache.beam.runners.dataflow.DataflowRunner.replaceTransforms(DataflowRunner.java:868)
at
org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:660)
at
org.apache.beam.runners.dataflow.DataflowRunner.run(DataflowRunner.java:173)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313) at
org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
This usually means that the version of com.google.protobuf:protobuf-java that Beam was built with does not match the version at runtime. Does your pipeline code also depend on protocol buffers?
UPDATE: I have filed https://issues.apache.org/jira/browse/BEAM-6839 to track this. It is not expected.
I don't have enough rep to leave a comment, but I ran into this issue and later figured out that my problem was that I had different beam versions in my pom.xml. Some had 2.19 and some had 2.20.
I would do a quick search of your versions in the pom or gradle file to make sure they are all the same.
This may be caused by incompatible dependencies. I successfully upgraded beam from 2.2.0 to 2.20.0 by upgrading the dependencies at the same time.
beam.version: 2.20.0
guava.version: 29.0-jre
bigquery.version: v2-rev20191211-1.30.9
google-api-client.version: 1.30.9
google-http-client.version: 1.34.0
pubsub.version: v1-rev20200312-1.30.9

Google Cloud Dataflow Stuck

Recently I've been getting this error when running dataflow jobs written in Python. The thing is it used to work and no code has changed so I'm thinking it's got something to do with the env.
Error syncing pod d557f64660a131e09d2acb9478fad42f (""), skipping:
failed to "StartContainer" for "python" with CrashLoopBackOff:
"Back-off 20s restarting failed container=python pod=dataflow-)
Can anyone help me with this?
In my case, I was using Apache Beam SDK version 2.9.0 had the same problem.
I used setup.py and set-up field “install_requires” was filled dynamically by loading content of requirements.txt file. It’s okay if you’re using DirectRunner but DataflowRunner is too sensitive for dependencies on local files, so abandoning that technique and hard-coding dependencies from requirements.txt into “install_requires” solved an issue for me.
If you stuck on that try to investigate your dependencies and minimize them as much as you can. Please refer to the Managing Python Pipeline Dependencies documentation topic for help. Avoid using complex or nested code-structures or dependencies on the local filesystem.
Neri, thanks for your pointer to the SDK. I noticed that my requirements file was using an older version of the SDK 2.4.0. I've now changed everything to 2.6.0 and it's no longer stuck.

Resources