How to determine the batchSize of the sinks in Flume? - flume

I am setting the properties of a Flume Agent and I am not sure what value should I use for the batchSize (number of events to batch together for send).
In my particular case I will use the console as a sink. As I understand the logger-sink is the type used in this case. But Flume documentation doesn't mention the batchSize paramenter for this kind of sink. Isn't it necessary to define a batchSize for logger-sinks?

Well, I found an answer for the question: Isn't it necessary to define a batchSize for logger-sinks?
https://flume.apache.org/FlumeUserGuide.html#logger-sink There is not batchSize, instead there is a parameter calle maxBytesToLog which defines the maximum number of bytes of the Event body to log (by default its value is 16). Here there is simple example I found of a Flume Agent which uses the console as sink:
node.sources = my-source
node.channels = my-channel
node.sinks = my-sink
# Since node 1 sink is avro-type, here we indicate avro as source type
node.sources.my-source.type = avro
node.sources.my-source.bind = 0.0.0.0
node.sources.my-source.port = 11112
node.sources.my-source.channels = my-channel
node.channels.my-channel.type = memory
node.channels.my-channel.capacity = 10000
node.channels.my-channel.transactionCapacity = 100
node.sinks.my-sink.type = logger
node.sinks.my-sink.channel = my-channel
node.sinks.my-sink.maxBytesToLog = 256
Source from: https://medium.com/#DCA/something-about-flume-3cb720ba00e8#.37zs23dnt
And about the main question How to determine the batchSize of the sink?
With regards to the hdfs batch size, the larger your batch size the better performance will be. However, keep in mind that if a transaction fails the entire transaction will be replayed which could have the implication of duplicate events downstream.
From:
https://cwiki.apache.org/confluence/display/FLUME/BatchSize,+ChannelCapacity+and+ChannelTransactionCapacity+Properties

Related

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

Dask compute fails when using client, works when no client setup

I am trying to use the dask client to parallelize my compute. When I run df.compute() I get the correct output (though it is very slow), but when I run the same thing after setting up a client, I get the following error:
distributed.protocol.pickle - INFO - Failed to serialize <function part at 0x7fd5186ed730>. Exception: can't pickle _thread.RLock objects
Here is my code, in the first df.compute(), I get the expected result, in the second I do not.
#dask.delayed
def part(x):
lower, upper = x
q = "SELECT id,tfidf_vec,emb_vec FROM document_table"
lines=man.session.execute(q)
counter = lower
df = []
for line in lines:
df.append(line)
counter += 1
if counter == upper:
break
return pd.DataFrame(df)
parts = [part(x) for x in [[0,100000],[100000,200000]]]
df = dd.from_delayed(parts)
df.compute()
from dask.distributed import Client
client = Client('127.0.0.1:8786')
df.compute()
Your function contains a reference to man.session, which is part of the function closure. When you use the default scheduler, threads, the object can be shared between the threads that execute your code. When you use the distributed scheduler, the function must be serialised and sent to workers in difference process(es).
You should make a function which creates the session object on each invocation, as was suggested as an answer to your very similar question.

I have collection of futures which are result of persist on dask dataframe. How to do a delayed operation on them?

I have setup a scheduler and 4 worker nodes to do some processing on csv. size of the csv is just 300 mb.
df = dd.read_csv('/Downloads/tmpcrnin5ta',assume_missing=True)
df = df.groupby(['col_1','col_2']).agg('mean').reset_index()
df = client.persist(df)
def create_sep_futures(symbol,df):
symbol_df = copy.deepcopy(df[df['symbol' == symbol]])
return symbol_df
lazy_values = [delayed(create_sep_futures)(symbol, df) for symbol in st]
future = client.compute(lazy_values)
result = client.gather(future)
st list contains 1000 elements
when I do this, I get this error:
distributed.worker - WARNING - Compute Failed
Function: create_sep_futures
args: ('PHG', symbol col_3 col_2 \
0 A 1.451261e+09 23.512857
1 A 1.451866e+09 23.886857
2 A 1.452470e+09 25.080429
kwargs: {}
Exception: KeyError(False,)
My assumption is that workers should get full dataframe and query on it. But I think it just gets the block and tries to do it.
What is the workaround for it? Since dataframe chunks are already in workers memory. I don't want to move the dataframe to each worker.
Operations on dataframes, using the dataframe syntax and API, are lazy (delayed) by default, you need do nothing more.
First problem: your syntax is wrong df[df['symbol' == symbol]] => df[df['symbol'] == symbol]. That is the origin of the False key.
So the solution you are probably looking for:
future = client.compute(df[df['symbol'] == symbol])
If you do want to work on the chunks separately, you can look into df.map_partitions, which you use with a normal function and takes care of passing data or delayed/futures or df.to_delayed, which will give you a set of delayed objects which you can use with a delayed function.

Apache Kafka Streams Materializing KTables to a topic seems slow

I'm using kafka stream and I'm trying to materialize a KTable into a topic.
It works but it seems to be done every 30 secs or so.
How/When does Kafka Stream decides to materialize the current state of a KTable into a topic ?
Is there any way to shorten this time and to make it more "real-time" ?
Here is the actual code I'm using
// Stream of random ints: (1,1) -> (6,6) -> (3,3)
// one record every 500ms
KStream<Integer, Integer> kStream = builder.stream(Serdes.Integer(), Serdes.Integer(), RandomNumberProducer.TOPIC);
// grouping by key
KGroupedStream<Integer, Integer> byKey = kStream.groupByKey(Serdes.Integer(), Serdes.Integer());
// same behaviour with or without the TimeWindow
KTable<Windowed<Integer>, Long> count = byKey.count(TimeWindows.of(1000L),"total");
// same behaviour with only count.to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
count.toStream().map((k,v) -> new KeyValue<>(k.key(), v)).to(Serdes.Integer(), Serdes.Long(), RandomCountConsumer.TOPIC);
This is controlled by commit.interval.ms, which defaults to 30s. More details here:
http://docs.confluent.io/current/streams/developer-guide.html
The semantics of caching is that data is flushed to the state store and forwarded to the next downstream processor node whenever the earliest of commit.interval.ms or cache.max.bytes.buffering (cache pressure) hits.
and here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-63%3A+Unify+store+and+downstream+caching+in+streams

How to increase the processing rate of flume agent

I have a flume agent that ingests data into elasticsearch. The agent is using spoolDir source. There is another agent which writes the files into the spoolDir of the elasticsearch agent.
Over the time the files are increased and the difference between the processed files files and the unprocessed files increase.
I want to increase the number of events processed by the flume agent to speedup the ingesting process.
Here is the configuration of the flume agent.
agent04.sources = s1
agent04.channels = ch1
agent04.channels = memoryChannel
agent04.channels.memoryChannel.type = memory
agent04.channels.memoryChannel.capacity=100000
agent04.channels.memoryChannel.transactionCapacity=1000
agent04.sources.s1.channels = memoryChannel
agent04.sources.s1.type = spooldir
agent04.sources.s1.spoolDir = /DataCollection/Flume_Cleaner_Output/Json_Elastic/
agent04.sources.s1.deserializer.maxLineLength = 100000
agent04.sinks = elasticsearch
agent04.sinks.elasticsearch.channel = memoryChannel
agent04.sinks.elasticsearch.type=org.css.cssElasticsearchSink
agent04.sinks.elasticsearch.batchSize=400
agent04.sinks.elasticsearch.hostNames = elastic-node01.css.org
agent04.sinks.elasticsearch.indexName = all_collections
agent04.sinks.elasticsearch.indexType = live_tweets
agent04.sinks.elasticsearch.indexNameBuilder= org.css.sa.flume.elasticsearch.sink.indexNameBuilder.HeaderValueBasedIndexNameBuilder
agent04.sinks.elasticsearch.clusterName = css_rai_social
agent04.sinks.elasticsearch.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer
agent04.sinks.elasticsearch.cache_period_ms=90d
Why are you chaining two Flume agents using the spooldir? That'll be really slow and is a surprising configuration. You're incurring the cost of frequent fsyncs as each batch gets processed.
I recommend you chain them using the Avro Sink and Avro Source. I would also up the batch size to at least 1000. (Computers really like batches and Flume is setup to do that).

Resources