Convert column to bitmap/HLL during kafka routine loading to StarRocks - starrocks

I intend to do count distinct in SR, and noticed columns can be converted to bitmap or HLL during loading. But I only find the transformation during stream load and broker load, is there any way to covert bitmap or HLL during kafka routine load?

CREATE ROUTINE LOAD example_tbl1_ordertest1 ON example_tbl1
COLUMNS (c1, c2, temp, c3=to_bitmap(c3))
FROM KAFKA
(
"kafka_broker_list" ="<kafka_broker1_ip>:<kafka_broker1_port>",
"kafka_topic" = "ordertest1",
"kafka_partitions" ="0,1,2,3,4",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);

Related

SuperCollider Error: Buffer UGen: no buffer data

Working through how to read sound files into a Buffer and then looping it. When I run the script to create a Buffer and read a sound file into it, it succeeds, but when I create a SynthDef using that buffer (the second line of code here), it gives me the error Buffer UGen: no buffer data. It's drawing on the same bufnum, so I'm not sure what's going on.
b = Buffer.read(s, Platform.resourceDir +/+ "sounds/testing.wav");
c= SynthDef(\loopbuffer, {arg start=0, end=10000; Out.ar(0,Pan2.ar(BufRd.ar(1, 0, Phasor.ar(0, BufRateScale.kr(b.bufnum), start, end),0.0)))}).play(s);
Platform.resourceDir ++ "/sounds/testing.wav"
The ++ here means no space is inserted when concatenating.
BufRd.ar(b.numChannels, b.bufNum)
The missing b.bufNum is causing your error. The channels 0 through 3 are reserved for hardware in/outs.

Access a single element in large published array with Dask

Is there a faster way to only retrieve a single element in a large published array with Dask without retrieving the entire array?
In the example below client.get_dataset('array1')[0] takes roughly the same time as client.get_dataset('array1').
import distributed
client = distributed.Client()
data = [1]*10000000
payload = {'array1': data}
client.publish(**payload)
one_element = client.get_dataset('array1')[0]
Note that anything you publish goes to the scheduler, not to the workers, so this is somewhat inefficient. Publish was intended to be used with Dask collections like dask.array.
Client 1
import dask.array as da
x = da.ones(10000000, chunks=(100000,)) # 1e7 size array cut into 1e5 size chunks
x = x.persist() # persist array on the workers of the cluster
client.publish(x=x) # store the metadata of x on the scheduler
Client 2
x = client.get_dataset('x') # get the lazy collection x
x[0].compute() # this selection happens on the worker, only the result comes down

Apache Kafka Streams Materializing KTables to a topic seems slow

I'm using kafka stream and I'm trying to materialize a KTable into a topic.
It works but it seems to be done every 30 secs or so.
How/When does Kafka Stream decides to materialize the current state of a KTable into a topic ?
Is there any way to shorten this time and to make it more "real-time" ?
Here is the actual code I'm using
// Stream of random ints: (1,1) -> (6,6) -> (3,3)
// one record every 500ms
KStream<Integer, Integer> kStream = builder.stream(Serdes.Integer(), Serdes.Integer(), RandomNumberProducer.TOPIC);
// grouping by key
KGroupedStream<Integer, Integer> byKey = kStream.groupByKey(Serdes.Integer(), Serdes.Integer());
// same behaviour with or without the TimeWindow
KTable<Windowed<Integer>, Long> count = byKey.count(TimeWindows.of(1000L),"total");
// same behaviour with only count.to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
count.toStream().map((k,v) -> new KeyValue<>(k.key(), v)).to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
This is controlled by commit.interval.ms, which defaults to 30s. More details here:
http://docs.confluent.io/current/streams/developer-guide.html
The semantics of caching is that data is flushed to the state store and forwarded to the next downstream processor node whenever the earliest of commit.interval.ms or cache.max.bytes.buffering (cache pressure) hits.
and here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-63%3A+Unify+store+and+downstream+caching+in+streams

Joining two streams

Is it possible to join two separate PubSubIo Unbounded PCollections using a key present in both of them? I try to accomplish the task with something like:
Read(FistStream)&Read(SecondStream) -> Flatten -> Generate key to use in joining -> Use Session Windowing to gather them together -> Group by key then rewindow with fixed size windows -> AvroIOWrite to disk using windowing.
EDIT:
Here is the pipeline code I created. I experience two problems:
Nothing get's written to the disk
Pipeline starts to be really unstable - it randomly slows down processing of certain steps. Especially group by. It's not able to keep up with ingestion speed even when I use 10 dataflow workers.
I need to handle ~ 10 000 sessions a second. Each session comprises of 1 or 2 events, then needs to be closed.
PubsubIO.Read<String> auctionFinishedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_finished");
PubsubIO.Read<String> auctionAcceptedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_accepted");
PCollection<String> auctionFinishedStream = p.apply("ReadAuctionFinished", auctionFinishedReader);
PCollection<String> auctionAcceptedStream = p.apply("ReadAuctionAccepted", auctionAcceptedReader);
PCollection<String> combinedEvents = PCollectionList.of(auctionFinishedStream)
.and(auctionAcceptedStream).apply(Flatten.pCollections());
PCollection<KV<String, String>> keyedAuctionFinishedStream = combinedEvents
.apply("AddKeysToAuctionFinished", WithKeys.of(new GenerateKeyForEvent()));
PCollection<KV<String, Iterable<String>>> sessions = keyedAuctionFinishedStream
.apply(Window.<KV<String, String>>into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(GroupByKey.create());
PCollection<SodaSession> values = sessions
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, SodaSession> () {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
c.output(new SodaSession("auctionid", "stattedat"));
}
}));
PCollection<SodaSession> windowedEventStream = values
.apply("ApplyWindowing", Window.<SodaSession>into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
);
AvroIO.Write<SodaSession> avroWriter = AvroIO
.write(SodaSession.class)
.to("gs://storage/")
.withWindowedWrites()
.withFilenamePolicy(new EventsToGCS.PerWindowFiles("sessionsoda"))
.withNumShards(3);
windowedEventStream.apply("WriteToDisk", avroWriter);
I've found an efficient solution. As one of my collection was disproportionate in size compared to the other one so I used side input to speed up grouping operation. Here is an overview of my solution:
Read both event streams.
Flatten them into single PCollection.
Use sliding window sized (closable session duration + session max length, every closable session duration).
Partition collections again.
Create PCollectionView from smaller PCollection.
Join both streams using sideInput with the view created in the previous step.
Write sessions to disk.
It handles joining 4000 events/sec stream (larger one) + 60 events/sec stream on 1-2 DataFlow workers versus ~15 workers when used Session windowing along with GroupBy.

serial data flow: How to ensure completion

I have a device that sends serial data over a USB to COM port to my program at various speeds and lengths.
Within the data there is a chunk of several thousands bytes that starts and ends with special distinct code ('FDDD' for start, 'FEEE' for end).
Due to the stream's length, occasionally not all data is received in one piece.
What is the recommended way to combine all bytes into one message BEFORE parsing it?
(I took care of the buffer size, but have no control over the serial line quality, and can not use hardware control with USB)
Thanks
One possible way to accomplish this is to have something along these lines:
# variables
# buffer: byte buffer
# buffer_length: maximum number of bytes in the buffer
# new_char: char last read from the UART
# prev_char: second last char read from the UART
# n: index to the buffer
new_char := 0
loop forever:
prev_char := new_char
new_char := receive_from_uart()
# start marker
if prev_char = 0xfd and new_char = 0xdd
# set the index to the beginning of the buffer
n := 0
# end marker
else if prev_char = 0xfe and new_char = 0xee
# the frame is ready, do whatever you need to do with a complete message
# the length of the payload is n-1 bytes
handle_complete_message(buffer, n-1)
# otherwise
else
if n < buffer_length - 1
n := n + 1
buffer[n] := new_char
A few tips/comments:
you do not necessarily need a separate start and end markers (you can the same for both purposes)
if you want to have two-byte markers, it would be easier to have them with the same first byte
you need to make sure the marker combinations do no occur in your data stream
if you use escape codes to avoid the markers in your payload, it is convenient to take care of them in the same code
see HDLC asynchronous framing (simply to encode, simple to decode, takes care of the escaping)
handle_complete_message usually either copies the contents of buffer elsewhere or swaps another buffer instead of buffer if in hurry
if your data frames do not have integrity checking, you should check if the payload length is equal to buffer_length- 1, because then you may have an overflow
After several tests, I came up with the following simple solution to my own question (for c#).
Shown is a minimal simplified solution. Can add length checking, etc.
'Start' and 'End' are string markers of any length.
public void comPort_DataReceived(object sender, SerialDataReceivedEventArgs e)
SerialPort port = (SerialPort)sender;
inData = port.ReadExisting();
{
if (inData.Contains("start"))
{
//Loop to collect all message parts
while (!inData.Contains("end"))
inData += port.ReadExisting();
//Complete by adding the last data chunk
inData += port.ReadExisting();
}
//Use your collected message
diaplaydata(inData);

Resources