I need to implement bloom filter as state for Apache Flink, I used ValueState for it which contain an instance of bloom filter and each time I update the value of state.
state = getRuntimeContext.getState(
new ValueStateDescriptor[BloomFilter](
"value-bloomfilter",
createTypeInformation[BloomFilter]
)
)
But I need more efficient way such as customized state for it. Unfortunately I couldn't find any resource about how to develop raw state for Apache Flink.
Related
In the federated learning context, and like this tutorial shows, initial weights of global model (at server level) are initialized randomly with : state = iterative_process.initialize(). I want to have the hand to put these initial weights by downloading them from another model (load_model()). So please how can I proceed, I'm newer in TFF.
Thanks
You can either manually create a state object with the same structure and any values you choose, or use tff.structure.update_struct.
If you already have a my_model, and instance of either tff.learning.Model or a tf.keras.Model with weights you want, you can write something like
state = tff.structure.update_struct(state, model=tff.learning.ModelWeights.from_model(my_model))
Has anyone integrated Microsoft Power BI into a Delphi application. I beleive that I will need to embed a webpage into a form,I am ok with that, however I cant see how you force a refresh or feed Power BI the run-time selection criteria.
It will be linked to a standard SQL Server database (not cloud based at the moment). I have the graph I want working on Power BI desktop.
I'm integrating it in WPF C# application. It's pretty much the same as in Delphi, but easier due to availability of ADAL library for C#.
If you want to display a report (or tile, or dashboard) based on the current selection from your application, you must provide this information to the report. You can save the selection to a table in the database (or information about the selection, like primary key values) and build the report on this table. Put a session column in it, and on every save generate an unique session ID value. Then filter the report to show only data for your session.
To filter the embedded report, define a filter and assign it to filters property of the embed configuration object, that you are passing to the JavaScript Power BI Client, or call report.setFilters method. In your case, IBasicFilter is enough. Construct it like this:
const basicFilter: pbi.models.IBasicFilter = {
$schema: "http://powerbi.com/product/schema#basic",
target: {
table: "ReportTempTableName",
column: "SessionId"
},
operator: "In",
values: [12345],
filterType: 1 // pbi.models.FilterType.BasicFilter
}
replacing 12345 with the unique session ID value, that you want to visualize.
To avoid the possibility the user to remove the applied filter and see the data for all sessions, you may hide the filter pane:
var embedConfig = {
...
settings: {
filterPaneEnabled: false
}
};
Given a list of Fluxes, I'd like to determine which are upstream of others in the list and which aren't. A way to get upstream publishers for each flux would do the trick, but I'm open to other suggestions.
Also, is it possible to detect circular dependencies between fluxes? I'd like to only allow the creation of DAGs.
There is a best effort way (not 100% supported) with the Scannable interface:
Flux<T> fluxToCheck;
List<Flux> potentialParents;
Scannable s = Scannable.from(fluxToCheck);
Stream<Scannable> parents = s
.parents() //this is the important part
.collect(Collectors.toList());
potentialParents.retainAll(parents);
//or some more efficient other tests on the collections
Scannable#parents() recursively looks up Scannable that advertise a PARENT, which I reckon most Reactor core operators should do.
Scannable.from(foo) returns a NO-OP Scannable if the object you pass is not actually Scannable.
By the way how do you create a STREAM?
I use AppendToStreamAsync directly, is this right or shall I create a
stream first then append onto this stream?
I also tried performing some tests but using the methods below I can write
events onto EventStore but can't read Events from it.
And most import question is how do I view my saving events in the Admin site of EventStore?
Here are the code:
public async Task AppendEventAsync(IEvent #event)
{
try
{
var eventData = new EventData(#event.EventId,
#event.GetType().AssemblyQualifiedName,
true,
Serializer.Serialize(#event),
Encoding.UTF8.GetBytes("{}"));
var writeResult = await connection.AppendToStreamAsync(
#event.SourceId.ToString(),
#event.AggregateVersion,
eventData);
Console.WriteLine(writeResult);
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
}
public async Task<IEnumerable<IEvent>> ReadEventsAsync(Guid aggregateId)
{
var ret = new List<IEvent>();
StreamEventsSlice currentSlice;
long nextSliceStart = StreamPosition.Start;
do
{
currentSlice = await connection.ReadStreamEventsForwardAsync(aggregateId.ToString(), nextSliceStart, 200, false);
if (currentSlice.Status != SliceReadStatus.Success)
{
throw new Exception($"Aggregate {aggregateId} not found");
}
nextSliceStart = currentSlice.NextEventNumber;
foreach (var resolvedEvent in currentSlice.Events)
{
ret.Add(Serializer.Deserialize(resolvedEvent.Event.EventType, resolvedEvent.Event.Data));
}
} while (!currentSlice.IsEndOfStream);
return ret;
}
Streams are created automatically as you write events. You should follow the recommended naming convention though as it enables a few features out of the box.
await Connection.AppendToStreamAsync("CustomerAggregate-b2c28cc1-2880-4924-b68f-d85cf24389ba", expectedVersion, creds, eventData);
It is recommended to call your streams as "category-id" - (where category in our case is the aggregate name) as we use are using DDD+CQRS pattern
CustomerAggregate-b2c28cc1-2880-4924-b68f-d85cf24389ba
The stream matures as you write more events to the same stream name.
The first events ID becomes the "aggregateID" in our case and then each new
eventID after that is unique. The only way to recreate our aggregate is
to replay the events in sequence. If the sequence fails an exception is thrown
The reason to use this naming convention is that Event Store runs a few default internal projection for your convenience. Here is a very convoluted documentation about it
$by_category
$by_event_type
$stream_by_category
$streams
By Category
By category basically means there is stream created using internal projection which for our CustomerAggregate we subscribe to $ce-CustomerAggregate events - and we will see only those "categories" regardless of their ID's - The event data contains everything we need there after.
We use persistent subscribers (small C# console applications) which are setup to work with $ce-CustomerAggregate. Persistent subscribers are great because they remember the last event your client acknowledged. So if the application crashes, you start it and it starts from the last place that application finished.
This is where event store starts to shine and stand out from the other "event store implementations"
Viewing your events
The example with persistent subscribers is one way to set things up using code.
You cannot really view "all" your data in the admin site. The purpose of the admin site it to manage projections, manage users, see some statistics, create some projections, and have a recent view of streams and events only. (If you know the ID's you can create the URL's as you need them - but you cant search for them)
If you want to see ALL the data then you use the RESTfull API using by using something like Postman. Maybe there is a 3rd party software that can create a grid like data source viewer but I am unaware of this. That would probably also just hook into the REST API and you could create your own visualiser this way quite quickly.
Again back to code, you can also always read all events from 0 using on of the libraries - which incidentally using DDD+CQRS you always read the aggregates stream from 0 to rebuilt its state. But you can do the same for other requirements.
In some cases looking at how to use snapshots makes replaying events allot faster, if you have an extremely large stream to deal with.
Paradigm shift
Event Store has quite a learning curve and is a paradigm shift from conventional transactional databases. Event Stores best friend is CQRS - We use a slightly modified version of the CQRS Lite open source framework
To truly appreciate Event Store you would need to understand DDD concepts and then dig into CQRS/ES - There are a few good YouTube videos and examples.
Problem Context
I am trying to generate a total (linear) order of event items per key from a real-time stream where the order is event time (derived from the event payload).
Approach
I had attempted to implement this using streaming as follows:
1) Set up a non overlapping sequential windows, e.g. duration 5 minutes
2) Establish an allowed lateness - it is fine to discard late events
3) Set accumulation mode to retain all fired panes
4) Use the "AfterwaterMark" trigger
5) When handling a triggered pane, only consider the pane if it is the final one
6) Use GroupBy.perKey to ensure all events in this window for this key will be processed as a unit on a single resource
While this approach ensures linear order for each key within a given window, it does not make that guarantee across multiple windows, e.g. there could be a window of events for the key which occurs after that is being processed at the same time as the earlier window, this could easily happen if the first window failed and had to be retried.
I'm considering adapting this approach where the realtime stream can first be processed so that it partitions the events by key and writes them to files named by their window range.
Due to the parallel nature of beam processing, these files will also be generated out of order.
A single process coordinator could then submit these files sequentially to a batch pipeline - only submitting the next one when it has received the previous file and that downstream processing of it has completed successfully.
The problem is that Apache Beam will only fire a pane if there was at least one time element in that time window. Thus if there are gaps in events then there could be gaps in the files that are generated - i.e. missing files. The problem with having missing files is that the coordinating batch processor cannot make the distinction between knowing whether the time window has passed with no data or if there has been a failure in which case it cannot proceed until the file finally arrives.
One way to force the event windows to trigger might be to somehow add dummy events to the stream for each partition and time window. However, this is tricky to do...if there are large gaps in the time sequence then if these dummy events occur surrounded by events much later then they will be discarded as being late.
Are there other approaches to ensuring there is a trigger for every possible event window, even if that results in outputting empty files?
Is generating a total ordering by key from a realtime stream a tractable problem with Apache Beam? Is there another approach I should be considering?
Depending on your definition of tractable, it is certainly possible to totally order a stream per key by event timestamp in Apache Beam.
Here are the considerations behind the design:
Apache Beam does not guarantee in-order transport, so there is no use within a pipeline. So I will assume you are doing this so you can write to an external system with only the capability to handle things if they come in order.
If an event has timestamp t, you can never be certain no earlier event will arrive unless you wait until t is droppable.
So here's how we'll do it:
We'll write a ParDo that uses state and timers (blog post still under review) in the global window. This makes it a per-key workflow.
We'll buffer elements in state when they arrive. So your allowed lateness affects how efficient of a data structure you need. What you need is a heap to peek and pop the minimum timestamp and element; there's no built-in heap state so I'll just write it as a ValueState.
We'll set a event time timer to receive a call back when an element's timestamp can no longer be contradicted.
I'm going to assume a custom EventHeap data structure for brevity. In practice, you'd want to break this up into multiple state cells to minimize the data transfered. A heap might be a reasonable addition to primitive types of state.
I will also assume that all the coders we need are already registered and focus on the state and timers logic.
new DoFn<KV<K, Event>, Void>() {
#StateId("heap")
private final StateSpec<ValueState<EventHeap>> heapSpec = StateSpecs.value();
#TimerId("next")
private final TimerSpec nextTimerSpec = TimerSpec.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void process(
ProcessContext ctx,
#StateId("heap") ValueState<EventHeap> heapState,
#TimerId("next") Timer nextTimer) {
EventHeap heap = firstNonNull(
heapState.read(),
EventHeap.createForKey(ctx.element().getKey()));
heap.add(ctx.element().getValue());
// When the watermark reaches this time, no more elements
// can show up that have earlier timestamps
nextTimer.set(heap.nextTimestamp().plus(allowedLateness);
}
#OnTimer("next")
public void onNextTimestamp(
OnTimerContext ctx,
#StateId("heap") ValueState<EventHeap> heapState,
#TimerId("next") Timer nextTimer) {
EventHeap heap = heapState.read();
// If the timer at time t was delivered the watermark must
// be strictly greater than t
while (!heap.nextTimestamp().isAfter(ctx.timestamp())) {
writeToExternalSystem(heap.pop());
}
nextTimer.set(heap.nextTimestamp().plus(allowedLateness);
}
}
This should hopefully get you started on the way towards whatever your underlying use case is.