Apache Flume file size - flume

I am using flume-ng and its HDFS sink. Now if I add some files into the given folder, it is also stored in the HDFS automatically. It is working fine; however I notice the size of the files in the HDFS is quite small. So if I put a 1GB file into the given folder, it is stored in the HDFS by several hundreds of files. Can I make the size of file in the HDFS more bigger? How can I configure it in flume-conf.properties?

Did you try increasing the hdfs.rollSize.
There is a open jira that will help increasing the hdfs block size https://issues.apache.org/jira/browse/FLUME-2003

Related

Creating MBTiles file with varied levels of detail using existing OpenMapTiles docker tasks?

I'm working hard to get up to speed with OpenMapTiles. The quickstart.sh script usually runs to completion so I've preferred it as a source of truth over the sometimes inconsistent documentation. Time to evolve.
What is the most efficient way to build an MBTiles file that contains, say, planet-level data for zooms 0-6 and bounded data for zooms 7-13, ideally for multiple bounded areas (e.g., a handful of metro areas). Seems a common use case during development. Can it be done with the existing Docker tools?
Did you try to download a OSM file from http://download.geofabrik.de/index.html and place it in /data folder, as stated in the quickstart.md (https://github.com/openmaptiles/openmaptiles/blob/master/QUICKSTART.md) ?
Placing the osm.pbf file in your /data folder and adjusting the .env and openmaptiles.yaml file to your preferred zoom should help you with a next step.
I'm not sure what you mean with the bounds.

Viewing a large production log file

I have a 22GB production.log file in my Ruby on Rails app. I want to browse/search the contents over SSH without downloading. Is this possible? Are there any tools?
22GB is a very large file so it would be risky for your server to open the whole file using any tools. I'd recommend to split the file into multiple parts and search in each part. For example, using this command to split your file into small chunks of 1GB.
split -b 1GB very_large_file small_file
Also, you should set logrotate for your server to avoid log file getting too big.

Small change in large file in Docker container produces huge layer

I am using docker to have versioned database on my local dev environment (e.g. to be able to snapshot/revert db state). I need it due to nature of my work. I can not use transactions to achieve what I want [one of reasons - some of statements are DDL]
So, I have docker container with one large file (MySQL Inno db file)
If I change this file a little bit (like update row in table), and then commit container, new layer will be created, and size of this layer will be size of this huge file, even if only couple of bytes in file changed.
I understand it happens because for docker file is 'atomic' structure, if file is being modified its copy is created in new layer, and this layer is later included in image
Is there a way to change this behavior and to make Docker to store diffs on file level, e.g. if 10 bytes of 10 GiG file was changed, create layer with size smaller then 10 GiG?
Mb I can use some other storage engine? [which one?]
I also not very bound to docker, so I can even switch to rkt, question is - do you guys think it can help? (mb image format is different and can store diffs on file content level)

Knowing the end of file in logstash

Is there any way to identify if logstash has parsed all lines upto the bottom of the file. Using logstash to parse STATIC file, so the logstash need not wait/run after it has complted parsing existing lines in files.
If there is no such feature in logstash, is there any work around to achieve it without modifying the log file?
logstash keeps a registry of the files that it is processing (and has processed). You could compare the offset stored in that file to the actual size of the file. If they match, it's "done".

Reading from compressed files in Dataflow

Is there a way (or any kind of hack) to read input data from compressed files?
My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.
The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.
Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.
I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.
Specifically - I run
gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>
I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

Resources