Flink: Are multiple execution environments supported? - join

Is it OK to create multiple ExecutionEnvironments in a Flink program? More specifically, create one ExecutionEnvironment and one StreamExecutionEnvironment in the same main method, so that one can work with batch and later transit to streaming without problems?
I guess that the other possibility would be to split the program in two, but for my testing purposes this seems better. Is Flink prepared for this scenario?
All seems to work fine, except I am currently having problems with no output when joining two streams on a common index and using window(TumblingProcessingTimeWindows.of(Time.seconds(1))). I have already called setStreamTimeCharacteristic(TimeCharacteristic.EventTime) on the StreamExecutionEnvironment and even tried assigning custom watermarks on both joined streams with assignTimestampsAndWatermarks where I just return System.currentTimeMillis() as the timestamp of each record.
Since it finishes really quickly, both streams should fit in that 1-second window, no? Both streams print just fine right before the join. I can try supplying the important parts of code (it's rather lengthy) if anyone's interested.
UPDATE: OK, so I separated the two environments (put each inside a main method) and then I simply call the first main from the second main method. The described problem no longer occurs.

No, this not supported, and won't really work.
At least up through Flink 1.9, a given application must either have an ExecutionEnvironment and use the DataSet API, or a StreamExecutionEnvironment and use the DataStream API. You cannot mix the two in one application.
There is ongoing work to more completely unify batch and streaming, but that's a work in progress. To understand this better you might want to watch the video for this recent Flink Forward talk when it becomes available.

Related

django channels and running event loop

For a game website, I want a player to contest either agains a human or an AI.
I am using Django + Channels (Django-4.0.2 asgiref-3.5.0 channels-3.0.4)
This is a long way of learning...
Human vs Human: the game take place is the web browser turn by turn. Each time a player connects, it opens a websocket connexion, a move is sent through the socket, processed by the consumer (validated and saved in the database) and sent to the other player.
It is managed only with sync programming.
Human vs AI: I try to use the same route as previously. A test branch check if the game is against the computer and process a move instead of receiving it from the other end of the websocket. This AI move can be a blocking operation as it can take from 2 to 5sec.
I don't want the receive method of the consumer to wait for the AI to return its move, since I have other operations to do quickly (like update some informations on the client side).
Then I thought I could easily take advantage of the allegedly already existing event loop of the channels framework. I could send the AI thinking process to this loop and return the result later to the client through the send method of the consumer.
However, when I write:
loop = asyncio.get_event_loop()
loop.create_task(my_AI_thinking())
Django raises a runtime effort error (the same as described here: https://github.com/django/asgiref/issues/278) telling me there is no running event loop.
The solution seemed to be to upgrade asgiref to 3.5.0 which I did but issue not solved.
I think I am a little bit short of background, and some enlightments should help me to understand a little bit more what is the root cause of this fail.
My first questions would be:
In the combo django + channels + asgi: which is in charge to run the eventloop?
How to check if indeed one event loop is running whatever the thread?
Maybe your answers wil raise other questions.
Did you try running your event_loop example on Django 3.2? (and/or with different Python version)? I experienced various problems with Django 4.0 & Python 3.10, so I keep with Django 3.2 and Python3.7/3.8/3.9 for now, maybe your errors are one of these problems?
If you won't be able to get event_loop running, I see two possible alternative solutions:
Open two WS connections: one only for the moves, and the other for all the other stuff, such as updating information on Player's UI, etc.
You can also use multiprocessing to "manually" send calculating AI move to other thread, and then join the two threads again, after receiving the result (the move). To be honest, multiprocessing in Python is quite simple -- it's pretty handy, if you are familiar with the idea of multithreaded applications.
Unfortunately, I have not yet used event loops in channels myself, maybe someone more experienced in that matter will be able to better address your issue.

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Rails process batches of elements within an array concurrently

First time learning about concurrency and threading within Rails, so any advice is very appreciated.
I currently have an array of 50 strings. I have an 3rd party API call that takes in the string and returns a numeric value. Right now I am simply calling the API on each string one at a time, which takes a really long time.
After looking at a few SO like this one, this other one and finally this one, it seems like I have to use some sort of threading to achieve what I want to do. My plan is to break down the array into batches of ten strings, and then run 5 API calls on each array of ten strings concurrently in hopes that it will drastically reduce the time.
I've never done threading of any kind with rails before, so I just wondering if I am on the right track following the third SO post above, or if I should use other techniques that may be better for my need.
The approach you take will depend on your use case. Do you need to wait for all the calls to be made to do something with the result? Can it be asynchronous?
If you are looking into threads to distribute the work then the third SO post you mentioned is a good way to do it.
If your use case permits the process to be async, I'd definitely look into a scheduler, as mentioned in the first SO post. I've use DelayedJob for this goal, there are some other alternatives.
On a related topic, I usually implement a micro-service that receives those requests and processes them async instead of having DelayedJob in the same app, but is just a matter of preference.
Something REALLY important to have in mind if you go with the async approach is that if you are accessing ActiveRecord records inside a thread you need to explicitly check out the database connection. Rails only handles the check in/out of connections in the main thread. Be really careful on this since it can cause connection leaks really hard to track.
The first answer on this SO post shows how to ensure the db connection to be released.
Hope that helps.

How should a parser filter behave in directshow editing services?

we´ve created a custom push source / parser filter that is expected to work in a directshow
editing services timeline.
Now everything is great except that the filter does not stop to deliver samples when the current
cut has reached it´s end. The rendering stops, but the downstream filter continues to consume
samples. The filter delivers samples until it reaches EOF. This causes high cpu load, so the application
is simply unusable.
After a lot of investigation I’m not able to find a suitable mechanism that can inform my filter
that the cut is over so the filter needs to be stopped :
The Deliver function on the connected decoder pins always returns S_OK, meaning the attached decoder
is also not aware the IMediaSamples are being discarded downstream
there’s no flushing in the filter graph
the IMediaSeeking::SetPositions interface is used but only the start positions are set –
our is always instructed to play up to the end of the file.
I would expect when using IAMTimelineSrc::SetMediaTimes(Start, Stop) from the application
that this would set a stop time too, but this does not happen.
I’ve also tried to manipulate the XTL timeline adding ‘mstop’ attributes to all the clip in the
hope that this would imply a stop position being set, but to no avail
In the filters point of view, the output buffers are always available (as the IMediaSamples are being discarded downstream),
so the filter is filling samples as fast as it can until the source file is finished.
Is there any way the filter can detect when to stop or can we do anything from the application side ?
Many thanks
Tilo
You can try adding a custom interface to your filter and call a method externally from your client application. See this SO question for a bit more of details on this approach. You should be careful with thread safety while implementing this method, and it is indeed possible that there is a neater way of detecting that the capturing should be stopped.
I'm not that familiar with DES, but I have tried my demux filters in DES and the stop time was set correctly when there was a "stop=" tag for the clip.
Perhaps your demux does not implement IMediaSeeking correctly. Do you expose IMediaSeeking through the pins?
I had a chance to work with DES and custom push source filter recently.
From my experience;
DES actually does return error code to Receive() call, which is in turn returned to Deliver() of the source, when the cut reaches the end.
I hit the similar situation that source does not receive it and continues to run to the end of the stream.
The problem I found (after a huge amount of ad-hoc trials) is that the source needs to call DeliverNewSegment() method at each restart after seek. DES seems to take incoming samples only after that notification. It looks like DES receives the samples as S_OK even without that notification, but it just throws away.
I don't see DES sets end time by IMediaSeeking::SetPositions, either.
I hope this helps, although this question was very old and I suppose Tilo does not care this any more...

Clone a lua state

Recently, I have encountered many difficulties when I was developing using C++ and Lua. My situation is: for some reason, there can be thousands of Lua-states in my C++ program. But these states should be same just after initialization. Of course, I can do luaL_loadlibs() and lua_loadfile() for each state, but that is pretty heavy(in fact, it takes a rather long time for me even just initial one state). So, I am wondering the following schema: What about keeping a separate Lua-state(the only state that has to be initialized) which is then cloned for other Lua-states, is that possible?
When I started with Lua, like you I once wrote a program with thousands of states, had the same problem and thoughts, until I realized I was doing it totally wrong :)
Lua has coroutines and threads, you need to use these features to do what you need. They can be a bit tricky at first but you should be able to understand them in a few days, it'll be well worth your time.
take a look to the following lua API call I think it is what you exactly need.
lua_State *lua_newthread (lua_State *L);
This creates a new thread, pushes it on the stack, and returns a pointer to a lua_State that represents this new thread. The new thread returned by this function shares with the original thread its global environment, but has an independent execution stack.
There is no explicit function to close or to destroy a thread. Threads are subject to garbage collection, like any Lua object.
Unfortunately, no.
You could try Pluto to serialize the whole state. It does work pretty well, but in most cases it costs roughly the same time as normal initialization.
I think it will be hard to do exactly what you're requesting here given that just copying the state would have internal references as well as potentially pointers to external data. One would need to reconstruct those internal references in order to not just have multiple states pointing to the clone source.
You could serialize out the state after one starts up and then load that into subsequent states. If initialization is really expensive, this might be worth it.
I think the closest thing to doing what you want that would be relatively easy would be to put the states in different processes by initializing one state and then forking, however your operating system supports it:
http://en.wikipedia.org/wiki/Fork_(operating_system)
If you want something available from within Lua, you could try something like this:
How do you construct a read-write pipe with lua?

Resources