loading large files into hdfs using Flume (spool directory) - flume

We copied a 150 mb csv file into flume's spool directory, when it is getting loaded into hdfs, the file was splitting into smaller size files like 80 kb's. is there a way to load the file without getting split into smaller files using flume? because more metadata will be generated inside namenode about the smaller files, so we need to avoid it.
My flume-ng code looks like this
# Initialize agent's source, channel and sink
agent.sources = TwitterExampleDir
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.TwitterExampleDir.type = spooldir
agent.sources.TwitterExampleDir.spoolDir = /usr/local/flume/live
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://info3s7:54310/spool5
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel

What you want is this:
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount = 0
agent.sinks.flumeHDFS.hdfs.rollInterval = 0
agent.sinks.flumeHDFS.hdfs.rollSize = 10000000
agent.sinks.flumeHDFS.hdfs.batchSize = 10000
From the flume documentation
hdfs.rollSize: File size to trigger roll, in bytes (0: never roll based on file size)
In your example you use rollInterval of 2000 which will roll over the file after 2000 seconds, resulting in small files.
Also note that batchSize reflects the number of events before the file is flushed to HDFS, not necessarily the number of events before the file is closed and a new one created. You'll want to set that to some value small enough to not time out writing a large file but large enough to avoid overhead of many requests to HDFS.

Related

How to limit number of lines per file written using FileIO

Is there a possible way to limit number of lines in each written shard using TextIO or may be FileIO?
Example:
Read rows from Big Query - Batch Job (Result is 19500 rows for example).
Make some transformations.
Write files to Google Cloud storage (19 files, each file is limited to 1000 records, one file has 500 records).
Cloud Function is triggered to make a POST request to an external API for each file in GCS.
Here is what I'm trying to do so far but doesn't work (Trying to limit 1000 rows per file):
BQ_DATA = p | 'read_bq_view' >> beam.io.Read(
beam.io.BigQuerySource(query=query,
use_standard_sql=True)) | beam.Map(json.dumps)
BQ_DATA | beam.WindowInto(GlobalWindows(), Repeatedly(trigger=AfterCount(1000)),
accumulation_mode=AccumulationMode.DISCARDING)
| WriteToFiles(path='fileio', destination="csv")
Am I conceptually wrong or is there any other way to implement this?
You can implement the write to GCS step inside ParDo and limit the number of elements to include in a "batch" like this:
from apache_beam.io import filesystems
class WriteToGcsWithRowLimit(beam.DoFn):
def __init__(self, row_size=1000):
self.row_size = row_size
self.rows = []
def finish_bundle(self):
if len(self.rows) > 0:
self._write_file()
def process(self, element):
self.rows.append(element)
if len(self.rows) >= self.row_size:
self._write_file()
def _write_file(self):
from time import time
new_file = 'gs://bucket/file-{}.csv'.format(time())
writer = filesystems.FileSystems.create(path=new_file)
writer.write(self.rows) # may need to format
self.rows = []
writer.close()
BQ_DATA | beam.ParDo(WriteToGcsWithRowLimit())
Note that this will not create any files with less than 1000 rows, but you can change the logic in process to do that.
(Edit 1 to handle the remainders)
(Edit 2 to stop using counters, as files will be overridden)

Access checksum of zfs dataset via cli

Is it possible to read/access the checksum of a zfs dataset? I want to access it to validate that it didnt change between boots.
Reading https://en.wikipedia.org/wiki/ZFS#ZFS_data_integrity: Is the top checksum of a Merkle Tree like checksuming scheme in zfs accessible from userspace?
There's a (mainly for developers) tool called zdb which can do this. It's hard to use and its format is not always backwards compatible :-)
However, if all you want is to make sure that a filesystem hasn't changed, you can use snapshots for this purpose. First, create a snapshot at the point you want to compare to later on with zfs snapshot <pool>/<fs>#<before-reboot-snap>. Then there are two different ways to compare the filesystem to that snapshot later:
After reboot, run zfs diff <pool>/<fs>#<before-reboot-snap> <pool>/<fs>. This will show you a list of "diffs" between the snapshot and the current filesystem:
# ls /tank/hello
file1 file2 file3 file4 file5
# zfs snapshot tank/hello#snap
# zfs diff tank/hello#snap tank/hello
# touch /tank/hello/file6
# zfs diff tank/hello#snap tank/hello
M /tank/hello/
+ /tank/hello/file6
# rm /tank/hello/file6
# zfs diff tank/hello#snap tank/hello
M /tank/hello/
Note that even after I deleted the new file, the directory it lived in is still marked as modified.
Take another snapshot after the reboot, and then use zfs send -i #<before-reboot-snap> <pool>/<fs>#<after-reboot-snap> to create a stream of all the changes that happened between those snapshots, and analyze it with another tool called zstreamdump:
zfs send -i #snap tank/hello#snap2 | zstreamdump
BEGIN record
hdrtype = 1
features = 4
magic = 2f5bacbac
creation_time = 59036f98
type = 2
flags = 0x4
toguid = 2f080aca53bff68e
fromguid = 66a1da82cd5f1571
toname = tank/hello#snap2
END checksum = 91043406e5/38f3c4043049b/ed0867661876670/1e265bea2b6c3315
SUMMARY:
Total DRR_BEGIN records = 1
Total DRR_END records = 1
Total DRR_OBJECT records = 12
Total DRR_FREEOBJECTS records = 5
Total DRR_WRITE records = 1
Total DRR_WRITE_BYREF records = 0
Total DRR_WRITE_EMBEDDED records = 0
Total DRR_FREE records = 17
Total DRR_SPILL records = 0
Total records = 37
Total write size = 512 (0x200)
Total stream length = 13232 (0x33b0)
The example above shows that there have been a bunch of diffs -- anything like WRITE, FREE, OBJECT, or FREEOBJECTS indicates a change from the original snapshot.

How to determine the batchSize of the sinks in Flume?

I am setting the properties of a Flume Agent and I am not sure what value should I use for the batchSize (number of events to batch together for send).
In my particular case I will use the console as a sink. As I understand the logger-sink is the type used in this case. But Flume documentation doesn't mention the batchSize paramenter for this kind of sink. Isn't it necessary to define a batchSize for logger-sinks?
Well, I found an answer for the question: Isn't it necessary to define a batchSize for logger-sinks?
https://flume.apache.org/FlumeUserGuide.html#logger-sink There is not batchSize, instead there is a parameter calle maxBytesToLog which defines the maximum number of bytes of the Event body to log (by default its value is 16). Here there is simple example I found of a Flume Agent which uses the console as sink:
node.sources = my-source
node.channels = my-channel
node.sinks = my-sink
# Since node 1 sink is avro-type, here we indicate avro as source type
node.sources.my-source.type = avro
node.sources.my-source.bind = 0.0.0.0
node.sources.my-source.port = 11112
node.sources.my-source.channels = my-channel
node.channels.my-channel.type = memory
node.channels.my-channel.capacity = 10000
node.channels.my-channel.transactionCapacity = 100
node.sinks.my-sink.type = logger
node.sinks.my-sink.channel = my-channel
node.sinks.my-sink.maxBytesToLog = 256
Source from: https://medium.com/#DCA/something-about-flume-3cb720ba00e8#.37zs23dnt
And about the main question How to determine the batchSize of the sink?
With regards to the hdfs batch size, the larger your batch size the better performance will be. However, keep in mind that if a transaction fails the entire transaction will be replayed which could have the implication of duplicate events downstream.
From:
https://cwiki.apache.org/confluence/display/FLUME/BatchSize,+ChannelCapacity+and+ChannelTransactionCapacity+Properties

How do I give Channel Directory (Checkpoint & Data Dir) name Date Dynamically

I use Channel as a backup in flume without any sink and it's working correctly. Below is my working code, but how can I give directory or file name dynamically? (I want to give name date wise when date is change new directory is created Dynamically and exist previous as a backup.)
# Name the components on this agent
a1.sources = r1
a1.channels =c1
# Describe/configure the source r1
a1.sources.r1.type = http
a1.sources.r1.port = 40441
a1.sources.r1.bind = X.X.X.X
a1.sources.r1.channels = c1
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.dataDirs = /data/disk11/flume/Test/dataDirs{**%y%m%d**}
a1.channels.c1.checkpointDir =data/disk11/flume/Test/checkpointDir{**%y%m%d**}
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

How to increase the processing rate of flume agent

I have a flume agent that ingests data into elasticsearch. The agent is using spoolDir source. There is another agent which writes the files into the spoolDir of the elasticsearch agent.
Over the time the files are increased and the difference between the processed files files and the unprocessed files increase.
I want to increase the number of events processed by the flume agent to speedup the ingesting process.
Here is the configuration of the flume agent.
agent04.sources = s1
agent04.channels = ch1
agent04.channels = memoryChannel
agent04.channels.memoryChannel.type = memory
agent04.channels.memoryChannel.capacity=100000
agent04.channels.memoryChannel.transactionCapacity=1000
agent04.sources.s1.channels = memoryChannel
agent04.sources.s1.type = spooldir
agent04.sources.s1.spoolDir = /DataCollection/Flume_Cleaner_Output/Json_Elastic/
agent04.sources.s1.deserializer.maxLineLength = 100000
agent04.sinks = elasticsearch
agent04.sinks.elasticsearch.channel = memoryChannel
agent04.sinks.elasticsearch.type=org.css.cssElasticsearchSink
agent04.sinks.elasticsearch.batchSize=400
agent04.sinks.elasticsearch.hostNames = elastic-node01.css.org
agent04.sinks.elasticsearch.indexName = all_collections
agent04.sinks.elasticsearch.indexType = live_tweets
agent04.sinks.elasticsearch.indexNameBuilder= org.css.sa.flume.elasticsearch.sink.indexNameBuilder.HeaderValueBasedIndexNameBuilder
agent04.sinks.elasticsearch.clusterName = css_rai_social
agent04.sinks.elasticsearch.serializer = org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer
agent04.sinks.elasticsearch.cache_period_ms=90d
Why are you chaining two Flume agents using the spooldir? That'll be really slow and is a surprising configuration. You're incurring the cost of frequent fsyncs as each batch gets processed.
I recommend you chain them using the Avro Sink and Avro Source. I would also up the batch size to at least 1000. (Computers really like batches and Flume is setup to do that).

Resources