Knowing the end of file in logstash - parsing

Is there any way to identify if logstash has parsed all lines upto the bottom of the file. Using logstash to parse STATIC file, so the logstash need not wait/run after it has complted parsing existing lines in files.
If there is no such feature in logstash, is there any work around to achieve it without modifying the log file?

logstash keeps a registry of the files that it is processing (and has processed). You could compare the offset stored in that file to the actual size of the file. If they match, it's "done".

Related

Is there a way to make an XML file in RDB using Jena?

I used apache-jena-3.1.0 and JDBC to read and save XML files on a table (nodes, prefixes, quads, and trips) that was pre-made in MariaDB.
I wonder if I can make an XML file by reading MariaDB using Apache-jena.
And I wonder if I can save the stored data (nodes, prefixes, and triples) back to MariaDB in its original form (appropriate for RDB).
I am constantly trying to search for related jar or mathod.
Thank you.

Avoid reading the same file multiple times using Telegraf and file input plugin

I need to read csv files inside a folder. New csv files are generated every time a user submits a form. I'm using the "file" input plugin to read the data and send it to Influxdb. These steps are working fine.
The problem is that the same file is read multiple times every data collection interval. I was thinking of a solution where I could move the file that was read to a different folder, but I couldn't do that with Telegraf's "exec" output plug.
ps: I can't change the way csv files are generated.
Any ideas on how to avoid reading the same csv file multiple times?
As you discovered file input plugin is used to read entire files at each collection interval.
My suggestion is for you to instead use the directory monitor input plugin. This will read files in a directory, monitor the directory for new files, and parse the ones that have not already been picked up yet. There are some configuration settings in that plugin that make it easier to time when new files are read as well.
Another option is to use the tail input plugin which will tail a file and only read new updates to that file as things come. However, I think the directory monitor is more likely something you are after for your scenario.
Thanks!

Viewing a large production log file

I have a 22GB production.log file in my Ruby on Rails app. I want to browse/search the contents over SSH without downloading. Is this possible? Are there any tools?
22GB is a very large file so it would be risky for your server to open the whole file using any tools. I'd recommend to split the file into multiple parts and search in each part. For example, using this command to split your file into small chunks of 1GB.
split -b 1GB very_large_file small_file
Also, you should set logrotate for your server to avoid log file getting too big.

Are cloud dataflow job outputs transactional?

Assuming I don't know the job status that was supposed to generate some output files (in cloud store), can I assume that if some output files exist they contain all of the job's output?
Or it's possible that partial output is visible?
Thanks,
G
It is possible that only a subset of the files is visible, but the visible files are complete (cannot grow or change).
The filenames contain the total number of files (output-XXXXX-of-NNNNN), so once you have one file, you know how many more to expect.

Reading from compressed files in Dataflow

Is there a way (or any kind of hack) to read input data from compressed files?
My input consists of a few hundreds of files, which are produced as compressed with gzip and decompressing them is somewhat tedious.
Reading from compressed text sources is now supported in Dataflow (as of this commit). Specifically, files compressed with gzip and bzip2 can be read from by specifying the compression type:
TextIO.Read.from(myFileName).withCompressionType(TextIO.CompressionType.GZIP)
However, if the file has a .gz or .bz2 extension, you don't have do do anything: the default compression type is AUTO, which examines file extensions to determine the correct compression type for a file. This even works with globs, where the files that result from the glob may be a mix of .gz, .bz2, and uncompressed.
The slower performance with my work around was most likely because Dataflow was putting most of the files in the same split so they weren't being processed in parallel. You can try the following to speed things up.
Create a PCollection for each file by applying the Create transform multiple times (each time to a single file).
Use the Flatten transform to create a single PCollection containing all the files from PCollections representing individual files.
Apply your pipeline to this PCollection.
I also found that for files that reside in the cloud store, setting the content type and content encoding appears to "just work" without the need for a workaround.
Specifically - I run
gsutil -m setmeta -h "Content-Encoding:gzip" -h "Content-Type:text/plain" <path>
I just noticed that specifying the compression type is now available in the latest version of the SDK (v0.3.150210). I've tested it, and was able to load my GZ files directly from GCS to BQ without any problems.

Resources