Can Dataflow sideInput be updated per window by reading a gcs bucket? - google-cloud-dataflow

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));

You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

Related

Dask task steam does not display a custom task given to `map_blocks`

I have written a function named nd_rmmeh and passed it to dask.array.Array.map_blocks.
The task runs and completes normally but does not show on task stream on dashboard.
This is despite the fact that it does show on task "graph" and task "progress" as seen in the picture below:
I did mouse over the boxes and did not find any nd_rmmeh labels.
The timing of nd_rmmeh do coincides with when empty (white) sections of task stream appears.
However, I couldn't see how it is actually run from the dashboard.
I am interested in checking whether the nd_rmmeh release GIL enough to be run as threads instead of processes.
I have a suspicion that it doesn't by looking at htop task manager.
For context, here is how I call map_blocks:
da.copy(
deep = False,
data = da.data.map_blocks(
nd_rmmeh,
dtype = np.float,
meta = da.data,
# the rest is some key word arguments to nd_rmmeh ... omitted
),
)
I can not recall why I use `` instead of xarray.map_blocks,
but it feels like that shouldn't matter.
So the questions is:
why task stream doesn't display the custom function and what could be done to fix it.

is there a way to inject and a config into a ParDo without sideInput?

I have a ParDo that uses state and timers with a periodically updating PcollectionView as sideInput to that parDo; google dataflow will throw an exception that timers are not allowed in such a case. Is there another way to feed config data to the parDo with out sideInput? Essentially, the sideInput was a map of config data that was reading from datastore about every 24 hours.
I am currently trying to see if I can create a ParDo before the one with state and timers to periodically update the config, but I don't see how we can access that map from within the next ParDo. Any suggestions?
Note: This pipeline is running in streaming mode with a global window and reading from pubsub messages as they arrive. Datastore is used to hold data needed to decide when to output an element to a pubsub topic.
Instead of using state timers to update the side input, you can use a fixed window to periodically update your PCollectionView with your data source:
PCollectionView<Map<String,String>> sideInput = pipeline
.apply(notifications)
.apply(
Window.<Long>into(FixedWindows.of(Duration.standardMinutes(refreshMinutes)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply( /* query data source */ )
.apply(View.<Map<String,String>>asSingleton());

about boost beast websocket api : async_close, async_write

I have read the official document.I'm confused that the document conflict itself.
Here is the document picked from the official:
However, this code is well-formed:
ws.async_read(b, [](error_code, std::size_t){});
ws.async_write(b.data(), [](error_code, std::size_t){});
ws.async_ping({}, {});
ws.async_close({}, {});
and here is another snippet:
This operation is implemented in terms of one or more calls to the next layer's async_write_some functions, and is known as a composed operation. The program must ensure that the stream performs no other write operations (such as websocket::stream::async_write, websocket::stream::async_write_some, or websocket::stream::async_close).
so, which one should I trust?
This part is correct:
https://www.boost.org/doc/libs/1_67_0/libs/beast/doc/html/beast/using_websocket/notes.html#beast.using_websocket.notes.thread_safety
The other text needs to be updated.

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

Conditional OCR rotation on the image or Page in KOFAX

We have two source of inputs to create a Batch first is Folder Import and second is Email import.
I need to add condition where if the source of image is Email it should not allow to rotate the image and like wise if source if Folder import it should rotate the image.
I have added a script for this in KTM.
It is showing proper message of the source of image but it is not stopping the rotation of the image.
Below check the below script for reference.
Public Function setRotationRule(ByVal pXDoc As CASCADELib.CscXDocument) As String
Dim i As Integer
Dim FullPath As String
Dim PathArry() As String
Dim xfolder As CscXFolder
Set xfolder = pXDoc.ParentFolder
While Not xfolder.IsRootFolder
Set xfolder = xfolder.ParentFolder
Wend
'Added for KTM script testing
FullPath= "F:\Emailmport\dilipnikam#gmail.com_09-01-2014_10-02-37\dfdsg.pdf"'
If xfolder.XValues.ItemExists("AC_FIELD_OriginalFileName") Then
FullPath= xfolder.XValues.ItemByName("AC_FIELD_OriginalFileName").Value
End If
PathArry() = Split(FullPath,"\")
MsgBox(PathArry(1))
If Not PathArry(1) = "EmailImport" Then
For i = 0 To pXDoc.CDoc.Pages.Count - 1
pXDoc.CDoc.Pages(i).Rotation = Csc_RT_NoRotation
Next i
End If
End Function
The KTM Scripting Help has a misleading topic named "Dynamically Suppress Orientation Detection for Full Page OCR" where it shows setting Csc_RT_NoRotation from the Document_AfterClassifyXDoc event.
The reason I think this is misleading is because rotation may already have occurred before that event and thus setting the property has no effect. This can happen if layout classification has run, or if OCR has run (which can be triggered by content classification, or if any project-level locators need OCR). The sample in that topic does suggest that it is only for use when classifiers are not used, but it could be explained better.
The code you've shown would be best called from the event Document_BeforeProcessXDoc. This will run before the entire classify phase (including project-level locators), ensuring that rotation could not have already occurred.
Of course, also make sure this isn't because of a typo or anything else preventing the code from actually executing, as mentioned in the comments.

Resources