In flink how to set set parallelism of Joinstream - join

When use joinStream to join two stream
I can't set parallelism of joinStream and it always be 1
aStream.assignTimestampsAndWatermarks(new AWatermarks())
.keyBy(AStream::getKey)
.join(bStream.assignTimestampsAndWatermarks(new BWatermarks())
.keyBy(BStream::getKey))
.where(AStream::getKey).equalTo(BStream::getKey)
.window(TumblingEventTimeWindows.of(Time.seconds(30))).apply(new Joiner())
Is there any way to set parallelism of JoinStream

Your data flow program looks odd, you have extra keybys, which may confuse Flink. Could you try
aStream.assignTimestampsAndWatermarks(new AWatermarks())
.join(bStream.assignTimestampsAndWatermarks(new BWatermarks()))
.where(AStream::getKey).equalTo(BStream::getKey)
.window(TumblingEventTimeWindows.of(Time.seconds(30))).apply(new Joiner());

Related

SystemVerilog: String to Circuit Net

Assuming that I set one environment variable before launching a logic simulation of my circuit wrapped in a testbench written in SystemVerilog, I want to check whether it is possible to read the variable and try to map it to a net of the circuit.
For instance:
#### from the bash script that invokes the logic simulator ####
export NET_A=tb_top.module_a.submodule_b.n1
//// inside the tb_top in system verilog ////
import "DPI-C" function string getenv(input string env_name);
always_ff #(posedge clk, nenedge rst_n) begin
if (getenv("NET_A") == 1'b1) begin
$display("Hello! %s has the value 1", getenv("NET_A"));
end
end
In the example above I simply want to check whether the current net i.e., NET_Ais assigned at a certain point in the simulation the logic value of 1.
Thanks in advance
SystemVerilog has a C-based API (Verilog Procedural Interface VPI) that gives you access to a simulator's database. There are routines like vpi_get_handle_by_name which gives you a handle to an signal looked up by a string name. And then you can use vpi_get_value the gives you the current value of that signal.
Use of the VPI needs quite a bit of additional knowledge and many simulators give you built-in routines to handle this common application without having to break into C code. In Modelsim/Questa, it is called Signal_Spy.
But regardless of whether you use the VPI or tool specific routines, looking up a signal by string name has severe performance implications because it prevents many optimizations. Unless a signal represents a storage element, it usually does not keep its value around for queries.
It would be much better to use the signal path name directly
vlog ... +define+NET_A=tb_top.module_a.submodule_b.n1
Then in your code
if (`NET_A == 1'b1) begin

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

Can Dataflow sideInput be updated per window by reading a gcs bucket?

I’m currently creating a PCollectionView by reading filtering information from a gcs bucket and passing it as side input to different stages of my pipeline in order to filter the output. If the file in the gcs bucket changes, I want the currently running pipeline to use this new filter info. Is there a way to update this PCollectionView on each new window of data if my filter changes? I thought I could do it in a startBundle but I can’t figure out how or if it’s possible. Could you give an example if it is possible.
PCollectionView<Map<String, TagObject>>
tagMapView =
pipeline.apply(TextIO.Read.named("TagListTextRead")
.from("gs://tag-list-bucket/tag-list.json"))
.apply(ParDo.named("TagsToTagMap").of(new Tags.BuildTagListMapFn()))
.apply("MakeTagMapView", View.asSingleton());
PCollection<String>
windowedData =
pipeline.apply(PubsubIO.Read.topic("myTopic"))
.apply(Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardSeconds(31))));
PCollection<MY_DATA>
lineData = windowedData
.apply(ParDo.named("ExtractJsonObject")
.withSideInputs(tagMapView)
.of(new ExtractJsonObjectFn()));
You probably want something like "use an at most a 1-minute-old version of the filter as a side input" (since in theory the file can change frequently, unpredictably, and independently from your pipeline - so there's no way really to completely synchronize changes of the file with the behavior of the pipeline).
Here's a (granted, rather clumsy) solution I was able to come up with. It relies on the fact that side inputs are implicitly also keyed by window. In this solution we're going to create a side input windowed into 1-minute fixed windows, where each window will contain a single value of the tag map, derived from the filter file as-of some moment inside that window.
PCollection<Long> ticks = p
// Produce 1 "tick" per second
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(1)))
// Window the ticks into 1-minute windows
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
// Use an arbitrary per-window combiner to reduce to 1 element per window
.apply(Count.globally());
// Produce a collection of tag maps, 1 per each 1-minute window
PCollectionView<TagMap> tagMapView = ticks
.apply(MapElements.via((Long ignored) -> {
... manually read the json file as a TagMap ...
}))
.apply(View.asSingleton());
This pattern (joining against slowly changing external data as a side input) is coming up repeatedly, and the solution I'm proposing here is far from perfect, I wish we had better support for this in the programming model. I've filed a BEAM JIRA issue to track this.

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

How to read the current machine NTFS settings?

Before inserting filestream data I'd like to check the following NTFS settings:
1) 8.3 naming status (this is disabled by using fsutil behavior set disable8dot3 1)
2) last access status (this is disabled by using fsutil behavior set disablelastaccess 1)
3) cluster size (this is set with format F: /FS:NTFS /V:MyFILESTREAMContainer /A:64K)
The filestream recomendation is to disable (1) and (2) and to set (3) at 64kb.
But before setting this I'd like to know the existing settings. How do I check this? Answer can be in Delphi but not necessarly.
The GetDiskFreeSpace Windows API call returns the sector_per_cluster and bytes_per_sector values. I think this function should be in Windows unit.
You can read the registry for points 1 and 2 (using xp_regread in SQL)
Number 3 is not essential but helps and has been SQL Server best practice for a decade or more. You'd have to use sp_OA% or a CLR function to read this in SQL.

Resources