Dataflow 2.1.0 streaming application is not cleaning temp folders - google-cloud-dataflow

We have a dataflow application which reads from Pub/Sub, windows into fixed-size 1-minute duration windows and writes the raw messages to GCS with 10 shards. Our application has been running for 10 days now and it has created a .temp-beam-2017**** folder. There are about 6200 files under it and the count is growing every day.
My understanding is data flow will move the temp files to the specified output folder after the write is complete.
Could you please suggest what can be done in this case ? Each of these files are about 100MB.
inputCollection.apply("Windowing",
Window.<String>into(FixedWindows.of(ONE_MINUTE))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(ONE_MINUTE))
.withAllowedLateness(ONE_HOUR)
.discardingFiredPanes()
)
//Writing to GCS
.apply(TextIO.write()
.withWindowedWrites()
.withNumShards(10)
.to(options.getOutputPath())
.withFilenamePolicy(
new
WindowedFileNames(options.getOutputPath())));

The discrepancy between 14400 and 13900 is most likely because your pipeline didn't get any data whose event time falls into a particular window and shard. When writing a windowed collection, we don't create empty files for "missing" windows, because in general, it is not possible to know which windows are "missing": it is, in theory, pretty clear for fixed or sliding windows, but not so for custom windowing functions or sessions etc. Moreover, assignment of shards is random, so it is possible that for a particular window very few data arrived, and then there's a pretty good chance that some of the 10 shards didn't get any of it.
As for why the temporary files are being left over: it seems that the pipeline is occasionally seeing exceptions when writing the files to GCS. The leftover files are "zombies" from those failed attempts to write data. We currently don't do a good job of cleaning up such files automatically in streaming mode (in batch, it is safe to delete the entire temporary directory when the pipeline is done, but in streaming we can't do that, and we delete only the individual files being renamed to their final location), but it should be safe for you to delete old temp files in that directory.
I've filed https://issues.apache.org/jira/browse/BEAM-3145 to improve the latter.

Related

Avoid reading the same file multiple times using Telegraf and file input plugin

I need to read csv files inside a folder. New csv files are generated every time a user submits a form. I'm using the "file" input plugin to read the data and send it to Influxdb. These steps are working fine.
The problem is that the same file is read multiple times every data collection interval. I was thinking of a solution where I could move the file that was read to a different folder, but I couldn't do that with Telegraf's "exec" output plug.
ps: I can't change the way csv files are generated.
Any ideas on how to avoid reading the same csv file multiple times?
As you discovered file input plugin is used to read entire files at each collection interval.
My suggestion is for you to instead use the directory monitor input plugin. This will read files in a directory, monitor the directory for new files, and parse the ones that have not already been picked up yet. There are some configuration settings in that plugin that make it easier to time when new files are read as well.
Another option is to use the tail input plugin which will tail a file and only read new updates to that file as things come. However, I think the directory monitor is more likely something you are after for your scenario.
Thanks!

Sequential Processing of Files in IIB v9

Environment:IIB9 broker on windows
SFTP server is on windows
We have requirements to process a batch of files generated by backend system in sequential order (i.e FIFO). A batch can have multiple files.
All the files are placed in the IIB source directory from where FileInputNode is polling using move command.
I Want to know if FileInputNode is capable of picking up files in the order they were created by backend system.
Thanks,
Below is the extract from help contents where in it says that FileInput node reads the file in the order they were created by default (oldest are picked up first).
How the broker reads a file at the start of a flow
The FileInput node processes messages that are read from files. The
FileInput node searches a specified input directory (in the file
system that is attached to the broker) for files that match specified
criteria. The node can also recursively search the input directory's
subdirectories.
Files that meet the criteria are processed in age order, that is the
oldest files are processed first regardless of where they appear in
the directory structure
.
https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0.0/com.ibm.etools.mft.doc/ac55280_.htm
Yes you can loop in your messageflow and set the filepath of the file you want to load, one at a time. Feel free to ask for information if needed.
For the order, I have to check(you can still call a java node that will list the files in the order you want)
Regards

File operations conflicts

I’m writing a program which is continously looking for new files in a directory. After it extracts data from each file and makes some treatments with it, the files are moved to another directory containing all scanned files.
Imagine I’m copying a new file in the scanned directory while my program is running. Can a file which has not finished copying be treated (and then produce unforeseen results), or is it locked by the System ?
Now, imagine two instances of the program are running on two different computers, continously scanning the same folder. What can happen if both instances are trying to move the same file ?
Thank you for your help.
I have a project that does much the same thing. Another application is receiving data from a feed and writing files to a folder. My application is processing those files by opening them, acting on them in some way, writing them to another folder, then deleting them.
The strategy I used in the application that does the processing and deleting is to simply open them like this:
TFileStream.Create(AFileName, fmOpenRead OR fmShareDenyWrite);
If the file that is being opened is still being written by another process, the above will fail, and can likely be opened successfully on a subsequent iteration.

Why am I sometimes getting files filled with zeros at their end after being downloaded?

I'm developing a download manager using Indy and Delphi XE (The application uses Multithreading to attempt several connections to the server). Everything works fine but sometimes the final downloaded file is broken and when I check downloaded temp files I see that 2 or 3 of them is filled with zero at their end. (Each temp file is download result of each connection).
The larger the file is, the more broken temp files I get as the result.
For example in one of the temp files which was 65,536,000 bytes, only the range of 0-34,359,426 was valid and from 34,359,427 to 64,535,999 it was full of zeros. If I delete those zeros, application will automatically download the missing segments and what I get as the result, well if the problem wouldn't happen again, is the healthy downloaded file.
I want to get rid of those zeros at the end of the temp files without having a lost in download speed.
P.S. I'm using TFileStream and I'm sending it directly to TIdHTTP and downloading the files using GET method.
Additional Info: I handle OnWork event which assigns AWorkCount to a public int64 variable. Each time the file is downloaded, the downloaded file size (That Int64 variable) is logged to a text file and from what the log says is that the file has been downloaded completely (even those zero bytes).
Make sure the server actually supports downloading byte ranges before you request a range to download. If the server does not support ranges, a requested range will be ignored by the server and the entire file will be sent instead. If you are not already doing so, you should be using TIdHTTP.Head() to text for range support before then calling TIdHTTP.Get(). You also need to do this anyway to detect if the remote file has been altered since the last time you downloaded it. Any decent download manager needs to be able to handle things like that.
Also keep in mind that if TIdHTTP knows up front how many bytes are being transferred, it will pre-allocate the size of the destination TStream before then downloading data into it. This is to speed up the transfer and optimize disc I/O when using a TFileStream. So you should NOT use TFileStream to access the same file as the destination for multiple simultaneous downloads, even if they are writing to different areas of the file. Pre-allocating multiple TFileStream objects will likely trample over each other trying to set the file size to different positions. If you need to download a file in multiple pieces simultaneously then either:
1) download each piece to a separate file and copy them into the final file as needed once you have all of the pieces that you need.
2) use a custom TStream class, or Indy's TIdEventStream class, to manage the file I/O yourself so you can ignore TIdHTTP's pre-allocation attempts and ensure that multiple file I/O operatons do not overlap each other incorrectly.

Heroku: Serving Large Dynamically-Generated Assets Without a Local Filesystem

I have a question about hosting large dynamically-generated assets and Heroku.
My app will offer bulk download of a subset of its underlying data, which will consist of a large file (>100 MB) generated once every 24 hours. If I were running on a server, I'd just write the file into the public directory.
But as I understand it, this is not possible with Heroku. The /tmp directory can be written to, but the guaranteed lifetime of files there seems to be defined in terms of one request-response cycle, not a background job.
I'd like to use S3 to host the download file. The S3 gem does support streaming uploads, but only for files that already exist on the local filesystem. It looks like the content size needs to be known up-front, which won't be possible in my case.
So this looks like a catch-22. I'm trying to avoid creating a gigantic string in memory when uploading to S3, but S3 only supports streaming uploads for files that already exist on the local filesystem.
Given a Rails app in which I can't write to the local filesystem, how do I serve a large file that's generated daily without creating a large string in memory?
${RAILS_ROOT}/tmp (not /tmp, it's in your app's directory) lasts for the duration of your process. If you're running a background DJ, the files in TMP will last for the duration of that process.
Actually, the files will last longer, the reason we say you can't guarantee availability is that tmp isn't shared across servers, and each job/process can run on a different server based on the cloud load. You also need to make sure you delete your files when you're done with them as part of the job.
-Another Heroku employee
Rich,
Have you tried writing the file to ./tmp then streaming the file to S3?
-Blake Mizerany (Heroku)

Resources