Dask task steam does not display a custom task given to `map_blocks` - dask

I have written a function named nd_rmmeh and passed it to dask.array.Array.map_blocks.
The task runs and completes normally but does not show on task stream on dashboard.
This is despite the fact that it does show on task "graph" and task "progress" as seen in the picture below:
I did mouse over the boxes and did not find any nd_rmmeh labels.
The timing of nd_rmmeh do coincides with when empty (white) sections of task stream appears.
However, I couldn't see how it is actually run from the dashboard.
I am interested in checking whether the nd_rmmeh release GIL enough to be run as threads instead of processes.
I have a suspicion that it doesn't by looking at htop task manager.
For context, here is how I call map_blocks:
da.copy(
deep = False,
data = da.data.map_blocks(
nd_rmmeh,
dtype = np.float,
meta = da.data,
# the rest is some key word arguments to nd_rmmeh ... omitted
),
)
I can not recall why I use `` instead of xarray.map_blocks,
but it feels like that shouldn't matter.
So the questions is:
why task stream doesn't display the custom function and what could be done to fix it.

Related

how to run .fsx in fsi

Exploring F# with FSharp.Charting I thought I would start with a simple 'hello world' but it leaves me with more questions then lines of code.
#load #"..\packages\FSharp.Charting.0.90.14\FSharp.Charting.fsx"
open FSharp.Charting
let chart = Chart.Line([ for x in 0 .. 10 -> x, x*x ])
chart.ShowChart()
chart.SaveChartAs(#"C:\Temp\chart.png",ChartTypes.ChartImageFormat.Png)
This works in interactive window in VS, but what I want to do is execute this script from the cmd line (using fsi.exe). I made an association with fsx files to fsi, but when I execute it it opens fsi but no chart is created. What do I need to do?
Short answer: add the following line at the end of your program:
System.Windows.Forms.Application.Run()
Long answer:
The chart does get created, but it immediately disappears, because your program immediately exits, right after creating the chart. This does not happen in the F# Interactive window in Visual Studio, because the F# interactive window doesn't close immediately after executing your program - it just hangs out there, waiting for you to submit more code for execution.
In order to make your program not exit immediately, you could implement some waiting mechanism, such as waiting for a set amount of time (see System.Threading.Thread.Sleep) or waiting for the user to press Enter (via stdin.ReadLine()), etc.
However, this won't actually help you, because there is the next problem: the chart is drawn via Windows Forms, which relies on the message loop running - otherwise the window can't receive messages, and so can't event paint itself.
FSI does have its own built-in event loop, and this is how your program works under VS. However, if you implement a "waiting" mechanism (e.g. stdin.ReadLine()), this event loop will be blocked - won't be able to pump messages. Therefore, the only sane way to keep your program from exiting, while not interfering with the functioning of the chart window, is to start your own event loop. And this is exactly what Application.Run() does.
Saving to disk without displaying:
(in response to comment)
From what I understand, the FSharp.Charting library was intended as a quick-and-dirty way to display charts on the screen, primary use case being exploring datasets live within F# Interactive. More specifically, some key properties of the Chart object, such as ChartAreas and Series are not initialized upon chart creation, but only when it is shown on the screen (see source code), and without these properties the chart remains empty.
Short of submitting a pull request to the library, I recommend dropping down to the underlying System.Windows.Forms.DataVisualization.Charting.Chart:
open System.Windows.Forms.DataVisualization.Charting
let ch = new Chart()
ch.ChartAreas.Add( new ChartArea() )
let s = new Series( ChartType = SeriesChartType.Line )
s.Points.DataBind( [for x in 1..10 -> x, x*x], "Item1", "Item2", "" )
ch.Series.Add s;
ch.SaveImage(#"C:\Temp\chart.png", System.Drawing.Imaging.ImageFormat.Png)

Defining "global" behavior in Gulp (measuring task duration)

I'm working on moving us from ant to gulp, and as part of the effort I want to write timing stats to Graphite. We're doing this in ant as well (no idea how, beside the point anyway). My question is, I'd prefer to not have to add some or other plugin manually to every task we have (we have over 60), but rather have some sort of global behavior, where for every task, before the task is run a timer is start, and when it signals completion we push some data to Graphite (over statsd).
Can someone point me in the right direction where to hook into gulp for this? I couldn't find anything particularly useful in the docs / recipes...
We're running gulp#4.
Instead of adding timing code to your numerous tasks, you could make use of the NPM gulp-duration package.
A snippet of an example of it's use is shown below:
function rebundle() {
var uglifyTimer = duration('uglify time')
var bundleTimer = duration('bundle time')
return bundler.bundle()
.pipe(source('bundle.js'))
.pipe(bundleTimer)
// start just before uglify recieves its first file
.once('data', uglifyTimer.start)
.pipe(uglify())
.pipe(uglifyTimer)
.pipe(gulp.dest('example/'))
}
gulp-duration's duration function:
Creates a new pass-through duration stream. When this stream is
closed, it will log the amount of time since its creation to your
terminal.
will then allow you to log the duration of the task.
Whilst this is not a global behaviour solution, at least you can specify the timing code in your gulp file, as opposed to having to modify all 60+ of your tasks.

Can Dataflow sideInput be updated per window by reading a gcs bucket?

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));
You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

Conditional OCR rotation on the image or Page in KOFAX

We have two source of inputs to create a Batch first is Folder Import and second is Email import.
I need to add condition where if the source of image is Email it should not allow to rotate the image and like wise if source if Folder import it should rotate the image.
I have added a script for this in KTM.
It is showing proper message of the source of image but it is not stopping the rotation of the image.
Below check the below script for reference.
Public Function setRotationRule(ByVal pXDoc As CASCADELib.CscXDocument) As String
Dim i As Integer
Dim FullPath As String
Dim PathArry() As String
Dim xfolder As CscXFolder
Set xfolder = pXDoc.ParentFolder
While Not xfolder.IsRootFolder
Set xfolder = xfolder.ParentFolder
Wend
'Added for KTM script testing
FullPath= "F:\Emailmport\dilipnikam#gmail.com_09-01-2014_10-02-37\dfdsg.pdf"'
If xfolder.XValues.ItemExists("AC_FIELD_OriginalFileName") Then
FullPath= xfolder.XValues.ItemByName("AC_FIELD_OriginalFileName").Value
End If
PathArry() = Split(FullPath,"\")
MsgBox(PathArry(1))
If Not PathArry(1) = "EmailImport" Then
For i = 0 To pXDoc.CDoc.Pages.Count - 1
pXDoc.CDoc.Pages(i).Rotation = Csc_RT_NoRotation
Next i
End If
End Function
The KTM Scripting Help has a misleading topic named "Dynamically Suppress Orientation Detection for Full Page OCR" where it shows setting Csc_RT_NoRotation from the Document_AfterClassifyXDoc event.
The reason I think this is misleading is because rotation may already have occurred before that event and thus setting the property has no effect. This can happen if layout classification has run, or if OCR has run (which can be triggered by content classification, or if any project-level locators need OCR). The sample in that topic does suggest that it is only for use when classifiers are not used, but it could be explained better.
The code you've shown would be best called from the event Document_BeforeProcessXDoc. This will run before the entire classify phase (including project-level locators), ensuring that rotation could not have already occurred.
Of course, also make sure this isn't because of a typo or anything else preventing the code from actually executing, as mentioned in the comments.

Why is my Ruby script utilizing 90% of my CPU?

I wrote a admin script that tails a heroku log and every n seconds, it summarizes averages and notifies me if i cross a certain threshold (yes I know and love new relic -- but I want to do custom stuff).
Here is the entire script.
I have never been a master of IO and threads, I wonder if I am making a silly mistake. I have a couple of daemon threads that have while(true){} which could be the culprit. For example:
# read new lines
f = File.open(file, "r")
f.seek(0, IO::SEEK_END)
while true do
select([f])
line = f.gets
parse_heroku_line(line)
end
I use one daemon to watch for new lines of a log, and the other to periodically summarize.
Does someone see a way to make it less processor-intensive?
This probably runs hot because you never really block while reading from the temporary file. IO::select is a thin layer over POSIX select(2). It looks like you're trying to block until the file is ready for reading, but select(2) considers EOF to be ready ("a file descriptor is also ready on end-of-file"), so you always return right away from select then call gets which returns nil at EOF.
You can get a truer EOF reading and nice blocking behavior by avoiding the thread which writes to the temp file and instead using IO::popen to fork the %x[heroku logs --ps router --tail --app pipewave-cedar] log tailer, connected to a ruby IO object on which you can loop over gets, exiting when gets returns nil (indicating the log tailer finished). gets on the pipe from the tailer will block when there's nothing to read and your script will only run as hot as it takes to do your line parsing and reporting.
EDIT: I'm not set up to actually try your code, but you should be able to replace the log tailer thread and your temp file read loop with this code to get the behavior described above:
IO.popen( %w{ heroku logs --ps router --tail --app my-heroku-app } ) do |logf|
while line = logf.gets
parse_heroku_line(line) if line =~ /^/
end
end
I also notice your reporting thread does not do anything to synchronize access to #total_lines, #total_errors, etc. So, you have some minor race conditions where you can get inconsistent values from the instance vars that parse_heroku_line method updates.
select is about whether a read would block. f is just a plain old file, so you when get to the end reads don't block, they just return nil instantly. As a result select returns instantly rather than waiting for something to be appending to the file as I assume you're expecting. Because of this you're sitting in a tight busy loop, so high cpu is to be expected.
If you are at eof (you could either check f.eof? or whether gets returns nil), then you could either start sleeping (perhaps with some sort of back off) or use something like listen to be notified of filesystem changes

Resources