How do I resolve a Pickling Error on class apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum? - google-cloud-dataflow

A PicklingError is raised when I run my data pipeline remotely: the data pipeline has been written using the Beam SDK for Python and I am running it on top of Google Cloud Dataflow. The pipeline works fine when I run it locally.
The following code generates the PicklingError: this ought to reproduce the problem
import apache_beam as beam
from apache_beam.transforms import pvalue
from apache_beam.io.fileio import _CompressionType
from apache_beam.utils.options import PipelineOptions
from apache_beam.utils.options import GoogleCloudOptions
from apache_beam.utils.options import SetupOptions
from apache_beam.utils.options import StandardOptions
if __name__ == "__main__":
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).runner = 'BlockingDataflowPipelineRunner'
pipeline_options.view_as(SetupOptions).save_main_session = True
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = "project-name"
google_cloud_options.job_name = "job-name"
google_cloud_options.staging_location = 'gs://path/to/bucket/staging'
google_cloud_options.temp_location = 'gs://path/to/bucket/temp'
p = beam.Pipeline(options=pipeline_options)
p.run()
Below is a sample from the beginning and the end of the Traceback:
WARNING: Could not acquire lock C:\Users\ghousains\AppData\Roaming\gcloud\credentials.lock in 0 seconds
WARNING: The credentials file (C:\Users\ghousains\AppData\Roaming\gcloud\credentials) is not writable. Opening in read-only mode. Any refreshed credentials will only be valid for this run.
Traceback (most recent call last):
File "formatter_debug.py", line 133, in <module>
p.run()
File "C:\Miniconda3\envs\beam\lib\site-packages\apache_beam\pipeline.py", line 159, in run
return self.runner.run(self)
....
....
....
File "C:\Miniconda3\envs\beam\lib\sitepackages\apache_beam\runners\dataflow_runner.py", line 172, in run
self.dataflow_client.create_job(self.job))
StockPickler.save_global(pickler, obj)
File "C:\Miniconda3\envs\beam\lib\pickle.py", line 754, in save_global (obj, module, name))
pickle.PicklingError: Can't pickle <class 'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>: it's not found as apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum

I've found that your error gets raised when a Pipeline object is included in the context that gets pickled and sent to the cloud:
pickle.PicklingError: Can't pickle <class 'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>: it's not found as apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum
Naturally, you might ask:
What's making the Pipeline object unpickleable when it's sent to the cloud, since normally it's pickleable?
If this were really the problem, then wouldn't I get this error all the time - isn't a Pipeline object normally included in the context sent to the cloud?
If the Pipeline object isn't normally included in the context sent to the cloud, then why is a Pipeline object being included in my case?
(1)
When you call p.run() on a Pipeline with cloud=True, one of the first things that happens is that p.runner.job=apiclient.Job(pipeline.options) is set in apache_beam.runners.dataflow_runner.DataflowPipelineRunner.run.
Without this attribute set, the Pipeline is pickleable. But once this is set, the Pipeline is no longer pickleable, since p.runner.job.proto._Message__tags[17] is a TypeValueValuesEnum, which is defined as a nested class in apache_beam.internal.clients.dataflow.dataflow_v1b3_messages. AFAIK nested classes cannot be pickled (even by dill - see How can I pickle a nested class in python?).
(2)-(3)
Counterintuitively, a Pipeline object is normally not included in the context sent to the cloud. When you call p.run() on a Pipeline with cloud=True, only the following objects are pickled (and note that the pickling happens after p.runner.job gets set):
If save_main_session=True, then all global objects in the module designated __main__ are pickled. (__main__ is the script that you ran from the command line).
Each transform defined in the pipeline is individually pickled
In your case, you encountered #1, which is why your solution worked. I actually encountered #2 where I defined a beam.Map lambda function as a method of a composite PTransform. (When composite transforms are applied, the pipeline gets added as an attribute of the transform...) My solution was to define those lambda functions in the module instead.
A longer-term solution would be for us to fix this in the Apache Beam project. TBD!

This should be fixed in the google-dataflow 0.4.4 sdk release with https://github.com/apache/incubator-beam/pull/1485

I resolved this problem by encapsulating the body of the main within a run() method and invoking run().

Related

Getting error when adding package to Spacy pipe

I'm getting an error while adding a spacy compatible extension, med7, to the pipeline. I've included the replicable code below.
!pip install -U https://med7.s3.eu-west-2.amazonaws.com/en_core_med7_lg.tar.gz
import spacy
import en_core_med7_lg
from spacy.lang.en import English
med7 = en_core_med7_lg.load()
# Create the nlp object
nlp2 = English()
nlp2.add_pipe(med7)
# Process a text
doc = nlp2("This is a sentence.")
The error I get is
Argument 'string' has incorrect type (expected str, got spacy.tokens.doc.Doc)
I realized I was having this problem because I don't understand the difference some components of Spacy. For instance, in the Negex extension package, loading the pipeline is done with the Negex command:
negex = Negex(nlp, ent_types=["PERSON","ORG"])
nlp.add_pipe(negex, last=True)
I don't understand what the difference between Negex and en_core_med7_lg.load(). For some reason, I when add "med7" into the pipeline, it causes this error. I'm new to Spacy and would appreciate an explanation so that I can learn. And please let me know if I can make this question any more clear. Thanks!
med7 is already the loaded pipeline. Run:
doc = med7("This is a sentence.")

Service __len__ not found Unexpected error, recovered safely

python3.8
My code:
from googleads import adwords
def execute_request():
adwords_client = adwords.AdWordsClient.LoadFromStorage(path="google_general/googleads.yaml")
campaign_service = adwords_client.GetService('CampaignService', version='v201809')
pass
context["dict_list"] = execute_request()
Traceback:
Traceback (most recent call last):
File "/home/michael/pycharm-community-2019.3.2/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_xml.py", line 282, in frame_vars_to_xml
xml += var_to_xml(v, str(k), evaluate_full_value=eval_full_val)
File "/home/michael/pycharm-community-2019.3.2/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_xml.py", line 369, in var_to_xml
elif hasattr(v, "__len__") and not is_string(v):
File "/home/michael/PycharmProjects/ads3/venv/lib/python3.8/site-packages/googleads/common.py", line 694, in __getattr__
raise googleads.errors.GoogleAdsValueError('Service %s not found' % attr)
googleads.errors.GoogleAdsValueError: Service __len__ not found
Unexpected error, recovered safely.
googleads.yaml about logging
logging:
version: 1
disable_existing_loggers: False
formatters:
default_fmt:
format: ext://googleads.util.LOGGER_FORMAT
handlers:
default_handler:
class: logging.StreamHandler
formatter: default_fmt
level: DEBUG
loggers:
# Configure root logger
"":
handlers: [default_handler]
level: DEBUG
I've just started studying the API.
Namely, I'm trying to execute my first request (https://developers.google.com/adwords/api/docs/guides/first-api-call#make_your_first_api_call)
Could you help me with this problem? At least how to localize it more precisely.
This seems to be a problem which results from the way the PyCharm debugger inspects live objects during debugging.
Specifically, it checks if a given object has the __len__ attribute/method in the code of var_to_xml, most likely to determine an appropriate representation of the object for the debugger interface (which seems to require constructing an XML representation).
googleads service objects such as your campaign_service, however, use some magic to be able to call the defined SOAP methods on them without requiring to hard-code all of them. The code looks like this:
def __getattr__(self, attr):
"""Support service.method() syntax."""
if self._WsdlHasMethod(attr):
if attr not in self._method_proxies:
self._method_proxies[attr] = self._CreateMethod(attr)
return self._method_proxies[attr]
else:
raise googleads.errors.GoogleAdsValueError('Service %s not found' % attr)
This means that the debugger's check for a potential __len__ attribute is intercepted, and because the CampaignService does not have a SOAP operation called __len__, an exception is raised.
You can validate this by running your snippet in the regular way (i.e. not debugging it) and checking if that works.
An actual fix would seem to either require that PyCharm's debugger changes the way it inspects objects (not calling hasattr(v, "__len__")) or that googleads modifies the way it implements __getattr__, for example by actually implementing a __len__ method that just raises AttributeError.

"would create a cyclic reference" when creating CodePipeline with deploy action in other stack

I want to have my deployment resources in a different stack then my application resources. I have a stack with an auto scaling group that I reference in a second stack to my deployment group. The aws cdk can synth / deploy both stacks until I add the deploy stage to the pipeline (the stage where I assigned the asg) then I get a "cyclic reference" error.
I created a git repo with an cdk example app (https://github.com/codyfy/aws-cdk-pipeline-cyclic-reference-error) that throws the error.
Here is the code snippet where I create the deployment resources and add them to the pipeline (the error only happens after the last line is added):
deploy_application = codedeploy.ServerApplication(self, "CodeDeployApplication", application_name="application")
deployment_group = codedeploy.ServerDeploymentGroup(self, "DeploymentGroup",
application=deploy_application,
auto_scaling_groups=[ props ]
)
print(deployment_group)
deploy_action = codepipeline_actions.CodeDeployServerDeployAction(
action_name="DeployToASG",
input=source_output,
deployment_group=deployment_group
)
pipeline.add_stage(stage_name="Deploy", actions=[ deploy_action ]) # the deployment group can be created but not assigend to the pipeline
where props is the asg given as a parameter during construct creation:
def __init__(self, scope: core.Construct, id: str, props: autoscaling.AutoScalingGroup, **kwargs) -> None:
super().__init__(scope, id, **kwargs)
This is the error message cdk synth is throwing:
Error: 'pipeline-stack' depends on 'asg-stack' (dependency added using stack.addDependency(), pipeline-stack -> asg-stack/AutoScalingGroup/InstanceRole/Resource.Arn, pipeline-stack -> asg-stack/AutoScalingGroup/ASG.Ref). Adding this dependency (asg-stack -> pipeline-stack/CodePipeline/ArtifactsBucket/Resource.Arn) would create a cyclic reference.
at Stack._addAssemblyDependency (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/stack.js:443:19)
at Object.addDependency (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/deps.js:39:24)
at Stack.addDependency (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/stack.js:183:16)
at Stack.prepareCrossReference (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/stack.js:670:28)
at Stack.prepare (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/stack.js:541:43)
at Function.prepare (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/construct.js:87:27)
at Function.synth (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/construct.js:51:14)
at App.synth (/tmp/jsii-kernel-vBZ69a/node_modules/#aws-cdk/core/lib/app.js:71:52)
at /home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7588:51
at Kernel._wrapSandboxCode (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8224:19)
at /home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7588:25
at Kernel._ensureSync (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:8197:20)
at Kernel.invoke (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7587:26)
at KernelHost.processRequest (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7296:28)
at KernelHost.run (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7236:14)
at Immediate._onImmediate (/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7239:37)
at processImmediate (internal/timers.js:439:21)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "app.py", line 14, in <module>
app.synth()
File "/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/aws_cdk/core/__init__.py", line 3463, in synth
return jsii.invoke(self, "synth", [])
File "/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_kernel/__init__.py", line 113, in wrapped
return _recursize_dereference(kernel, fn(kernel, *args, **kwargs))
File "/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_kernel/__init__.py", line 288, in invoke
args=_make_reference_for_native(self, args),
File "/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_kernel/providers/process.py", line 348, in invoke
return self._process.send(request, InvokeResponse)
File "/home/hornung/dev/tests/cdk-pipeline-mwe/.env/lib/python3.6/site-packages/jsii/_kernel/providers/process.py", line 318, in send
raise JSIIError(resp.error) from JavaScriptError(resp.stack)
jsii.errors.JSIIError: 'pipeline-stack' depends on 'asg-stack' (dependency added using stack.addDependency(), pipeline-stack -> asg-stack/AutoScalingGroup/InstanceRole/Resource.Arn, pipeline-stack -> asg-stack/AutoScalingGroup/ASG.Ref). Adding this dependency (asg-stack -> pipeline-stack/CodePipeline/ArtifactsBucket/Resource.Arn) would create a cyclic reference.
Has someone encountered a similar issue with resources in two different stacks?
Thanks for any help and advice. Have a nice day ;-)

Flask: How do I successfully use multiprocessing (not multithreading)?

I am using a Flask server to handle requests for some image-processing tasks.
The processing relies extensively on OpenCV and I would now like to trivially-parallelize some of the slower steps.
I have a preference for multiprocessing rather than multithreading (please assume the former in your answers).
But multiprocessing with opencv is apparently broken (I am on Python 2.7 + macOS): https://github.com/opencv/opencv/issues/5150
One solution (see https://github.com/opencv/opencv/issues/5150#issuecomment-400727184) is to use the excellent Loky (https://github.com/tomMoral/loky)
[Question: What other working solutions exist apart from concurrent.futures, loky, joblib..?]
But Loky leads me to the following stacktrace:
a,b = f.result()
File "/anaconda2/lib/python2.7/site-packages/loky/_base.py", line 433, in result
return self.__get_result()
File "/anaconda2/lib/python2.7/site-packages/loky/_base.py", line 381, in __get_result
raise self._exception
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
This was caused directly by
'''
Traceback (most recent call last):
File "/anaconda2/lib/python2.7/site-packages/loky/process_executor.py", line 391, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/anaconda2/lib/python2.7/multiprocessing/queues.py", line 135, in get
res = self._recv()
File "myfile.py", line 44, in <module>
app.config['EXECUTOR_MAX_WORKERS'] = 5
File "/anaconda2/lib/python2.7/site-packages/werkzeug/local.py", line 348, in __getattr__
return getattr(self._get_current_object(), name)
File "/anaconda2/lib/python2.7/site-packages/werkzeug/local.py", line 307, in _get_current_object
return self.__local()
File "/anaconda2/lib/python2.7/site-packages/flask/globals.py", line 52, in _find_app
raise RuntimeError(_app_ctx_err_msg)
RuntimeError: Working outside of application context.
This typically means that you attempted to use functionality that needed
to interface with the current application object in some way. To solve
this, set up an application context with app.app_context(). See the
documentation for more information.
'''
The functions to be parallelized are not being called from app/main.py, but rather from an abitrarily-deep submodule.
I have tried the similarly-useful-looking https://flask-executor.readthedocs.io/en/latest, also so far in vain.
So the question is:
How can I safely pass the application context through to the workers or otherwise get multiprocessing working (without recourse to multithreading)?
I can build out this question if you need more information. Many thanks as ever.
Related resources:
Copy flask request/app context to another process
Flask Multiprocessing
Update:
Non-opencv calls work fine with flask-executor (no Loky) :)
The problem comes when trying to call an opencv function like knnMatch.
If Loky fixes the opencv issue, I wonder if it can be made to work with flask-executor (not for me, so far).

Not able to save pyspark iforest model using pyspark

Using iforest as described here: https://github.com/titicaca/spark-iforest
But model.save() is throwing exception:
Exception:
scala.NotImplementedError: The default jsonEncode only supports string, vector and matrix. org.apache.spark.ml.param.Param must override jsonEncode for java.lang.Double.
Followed the code snippet mentioned under "Python API" section on mentioned git page.
from pyspark.ml.feature import VectorAssembler
import os
import tempfile
from pyspark_iforest.ml.iforest import *
col_1:integer
col_2:integer
col_3:integer
assembler = VectorAssembler(inputCols=in_cols, outputCol="features")
featurized = assembler.transform(df)
iforest = IForest(contamination=0.5, maxDepth=2)
model=iforest.fit(df)
model.save("model_path")
model.save() should be able to save model files.
Below is the output dataframe I'm getting after executing model.transform(df):
col_1:integer
col_2:integer
col_3:integer
features:udt
anomalyScore:double
prediction:double
I have just fixed this issue. It was caused by an incorrect param type. You can checkout the latest codes in the master branch, and try it again.

Resources