Cloud Dataflow: java.lang.IllegalStateException: no evaluator registered for GroupedValues - google-cloud-dataflow

I'm getting the following exception when running the pipeline locally. There is no exception when submitting for cloud execution.
Thanks,
Genady
INFO: Executing pipeline using the DirectPipelineRunner.
Exception in thread "main" java.lang.IllegalStateException: no evaluator registered for GroupedValues [GroupedValues]
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:606)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:200)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:196)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:109)
at com.google.cloud.dataflow.sdk.Pipeline.traverseTopologically(Pipeline.java:204)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.run(DirectPipelineRunner.java:583)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:327)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.run(DirectPipelineRunner.java:70)
at app.Main.main(Main.java:124)
The code outline is basically this:
PCollection<KV<MyKey, Iterable<MyValue>>> groupedByMyKey = ...
PCollection<KV<MyKey, MyAggregated>> aggregated = groupedByMyKey.apply(
Combine.<MyKey, MyValue, MyAggregated>groupedValues(new Aggregator()));
Aggregator class extends CombineFn<MyValue, List<MyValue>, MyAggregated>

Can you share a code snippet that triggers this? GroupedValues is a PTransform that is often used within various combining transforms, so it might be from using something like Min, Max, etc.
The error means that the DirectPipelineRunner doesn't know how to evaluate a GroupedValues. However, that's unexpected, since that should have been expanded into a ParDo before execution.

I found the reason to this behaviour
I was using a command line argument to run it in remote mode (--runner=BlockingDataflowPipelineRunner) and then forced it to run locally with
PipelineRunner<?> runner = DirectPipelineRunner.fromOptions(options);
runner.run(p);
After removing these lines and just using the --runner=DirectPipelineRunner argument it worked as expected.

Related

Segmentation Fault after creating PyQGIS standalone application the second time even after exitQgis() - Debian

I am trying to create several .qgs project files to be served at a later time by an instance of qgis Server.
To this end I need to start a new PyQGIS application several times upon request. The application runs smoothly the first time it is called, but if I try to run it a second time I get a Segmentation Fault error.
Here is an example code that triggers the problem:
from qgis.core import QgsApplication
import os
os.environ["QT_QPA_PLATFORM"] = "offscreen"
for i in range(2):
print(f'Iteration number {i}')
print('\tSet prefix path')
QgsApplication.setPrefixPath('/usr', False)
print('\tInstantiating application')
qgs = QgsApplication([], False)
print('\tInitializing application')
qgs.initQgis()
print('\tExiting')
qgs.exitQgis()
When executed, I get this output:
Iteration number 0
Set prefix path
Instantiating application
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
Initializing application
Exiting
Iteration number 1
Set prefix path
Instantiating application
Initializing application
Segmentation fault
Something similar happens if I enclose the content of the loop inside a function and call it multiple times. In this case the segmentation fault happens upon calling qgs.exitQgis() the second time (and any vector or raster layers added before that would be invalid).
Maybe the problem is that for some reason qgs.exitQgis() is not really cleaning up before exit?
The code is running on a Python:3.9 docker container that comes with Debian Bullseye.
Qgis has been installed following the instruction from the QGIS docs:
https://qgis.org/en/site/forusers/alldownloads.html#debian-ubuntu. QGIS version is QGIS 3.22.3-Białowieża 'Białowieża'.
To prevent an import error when loading qgis.core I had to set the environment variable PYTHONPATH = /usr/lib/python3/dist-packages/.
UPDATE: Following a suggestion of one comment found on this post:
https://gis.stackexchange.com/questions/250933/using-exitqgis-in-pyqgis,
I substituted qgs.exitQgis() with qgs.exit() and now the app can be instantiated again any number of times without crashing.
It is still not clear what causes the segmentation fault, but at least I found this workaround.
It seems like the problem was fixed in QGIS ver. 3.24 Tisler. Now qgs.exitQgis() can be called in a loop without triggering a segmentation fault.

Beam/Dataflow: No session file found: /var/opt/google/dataflow/pickled_main_session

When using Apache Beam (GCP Dataflow) I see the following warning in worker logs:
No session file found: /var/opt/google/dataflow/pickled_main_session.
Functions defined in __main__ (interactive session) may fail.
My Dataflow job seems to be fine regardless, but I'm wondering what this warning is all about.
I have seen the following in some sample code (which I am NOT currently doing):
pipeline_options.view_as(SetupOptions).save_main_session = True
where pipeline_options is the main way of specifying options for the Beam/Dataflow pipeline, as in the following later in the code:
with beam.Pipeline(options=pipeline_options) as p:
# actual pipeline code here
I am curious if the two are related. Does the presence of the warning mean I should always be saving the main session? Are these two things related? Unrelated?
You should be able to safely ignore this warning. No need to set save_main_session if it's not required for your pipeline.

Reactor flux throws illegalArgumentException - suspecting due to bufferTimeout

I have a spring application which builds a reactive pipeline as follows:
buildPipeline(). // returns a flux based on changeStreamEvents or Kafka receives
.bufferTimeout( capacity, Duration.ofSeconds(1))
. flatMap( r -> {
element x = r.get(r.size()-1)
//some processing on element and the batch obtained
})
.doOnError( e-> log.info("error occurred:" + e.toString())
.subscribe()
However, I see my application intermediately throwing the below error -
java.lang.illegalArgumentException:3.9 While the Subscription is not cancelled, Subscription.request(long n) MUST throw a java.lang.illegalArgumentException if argument <= 0
at com.mongodb.reactivestreams.client.internal.ObservableToPublisher$1
$1.request(ObservableToPublisher.java:43)
at reactor.core.publisher.FluxMap$MapSubscriber.request(FluxMap.java:155)
at reactor.core.publisher.FluxBufferTimeout
$BufferimeoutSubscriber.requestMore(FluxBufferTimeout.java:317)
I'm not able to determine what is wrong, and why the stream is terminating with this error.
Any help would be highly appreciated.
The application started throwing this error after I added "bufferTimeout" to add a feature of batching. Before that, I had never encountered this exception.
Not sure how to replicate the issue as well, as it is not occurring locally or in UAT, but only in production environment of the application.
Any leads would be helpful.
Thanks!
Try adding a onBackPressureBuffer(), so that in case of low demand this operator buffers the requests, and emits items in a controlled way.

Dataflow pipeline is dropping events during processing when using outputWithTimestamp

I have a Cloud Dataflow pipeline in which I alter the original timestamp for the event in order to simulate real world scenarios of events arriving late. However, it appears I'm dropping some percentage of my events on each run of the pipeline. Inside my DoFn I use the following code to change the timestamp:
Instant newTimestamp = originalTimestamp.minus(Duration.standardMinutes(RANDOM.nextInt(15)));
c.outputWithTimestamp(KV.of(Integer.toString(RANDOM.nextInt(100)), element), newTimestamp);
The problem is most likely caused by your DoFn step outputting a timestamp that is earlier than the timestamp that was received by the processing step minus the allowed timestamp skew. The exception that would be thrown can be found here in the code:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/DoFnRunnerBase.java#L493
This behavior is documented with regard to using outputWithTimestamp here:
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn.Context#outputWithTimestamp-OutputT-org.joda.time.Instant-
While you could override the getAllowedTimestampSkew function, is is also documented that this might cause unpredictable issues with the watermark calculations so it should only be used without windowing/grouping.
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/DoFn#getAllowedTimestampSkew--

How to implement a custom file parser in Google DataFlow for a Google Cloud Storage file

I have a custom file format in Google Cloud Storage and I want to read it from Google DataFlow.
I've implemented a Source and a Reader by subclassing FileBasedReader, but then I realized it didn't support reading from Google Cloud Storage (while FileBasedSink actually does...) so I'm not sure what's the best idea to solve that here...
I tried to subclass TextIO but I couldn't reach an end with that as it doesn't seem to be designed to be subclassed.
Any good idea on how to deal with that?
Thanks.
Update to reflect on the comments
File pattern used: gs://mybucket/my.json
Implemented the Source class from FileBasedSource:
MessageSource<T> extends FileBasedSource<T>
Implemented the Reader class (what I really care about here) from FileBasedReader:
MessageReader<T> extends FileBasedReader<T>
Process for reading is:
MySource source = // instantiate source
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(options.getSource()).named("ReadFileData"))
.apply(ParDo.of(new DoFn<String, String>() {
And the getSource() comes from this command line parameter (verified correct):
--source=gs://${BUCKET_NAME}/my.json \
Am I missing anything?
2nd UPDATE
While running source.getEstimatedSizeBytes(options) it tells me no handler found?
java.io.IOException: Unable to find handler for gs://mybucket/my.json
at com.google.cloud.dataflow.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:186)
at com.google.cloud.dataflow.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:182)
at com.etc.TrackingDataPipeline.main(TrackingDataPipeline.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
I thought the FileBasedSource was supposed to handle GCS?
From the stack trace you show in "2nd Update", it looks like you have called getEstimatedSizeBytes directly from your main() method. This is expected to lead to the error you see.
The standard URL scheme handlers are registered when a pipeline runner is constructed. In your example code, that would happen when you call Pipeline.create(options) (this calls PipelineRunner.fromOptions(options), where the standard handlers are registered).
If you want to have the standard URL schemes registered in a context other than running a pipeline, you can explicitly call IOChannelUtils.registerStandardIOFactories(). I should note that this is not a supported API, but reaching a bit "under the hood". As such, it may change at any time.

Resources