Are cloud dataflow job outputs transactional? - google-cloud-dataflow

Assuming I don't know the job status that was supposed to generate some output files (in cloud store), can I assume that if some output files exist they contain all of the job's output?
Or it's possible that partial output is visible?
Thanks,
G

It is possible that only a subset of the files is visible, but the visible files are complete (cannot grow or change).
The filenames contain the total number of files (output-XXXXX-of-NNNNN), so once you have one file, you know how many more to expect.

Related

Avoid reading the same file multiple times using Telegraf and file input plugin

I need to read csv files inside a folder. New csv files are generated every time a user submits a form. I'm using the "file" input plugin to read the data and send it to Influxdb. These steps are working fine.
The problem is that the same file is read multiple times every data collection interval. I was thinking of a solution where I could move the file that was read to a different folder, but I couldn't do that with Telegraf's "exec" output plug.
ps: I can't change the way csv files are generated.
Any ideas on how to avoid reading the same csv file multiple times?
As you discovered file input plugin is used to read entire files at each collection interval.
My suggestion is for you to instead use the directory monitor input plugin. This will read files in a directory, monitor the directory for new files, and parse the ones that have not already been picked up yet. There are some configuration settings in that plugin that make it easier to time when new files are read as well.
Another option is to use the tail input plugin which will tail a file and only read new updates to that file as things come. However, I think the directory monitor is more likely something you are after for your scenario.
Thanks!

Artifactory docker images without manifests

We have a number of broken docker image uploads in Artifactory. It's quite difficult to clean these up, since the package search feature does not find these image tags as packages. In the UI, the only way to remove these without search is 1 tag at a time. I'm curious as to whether anyone else has found a solution for this. Ideally, if there were some AQL or other method to identify and remove any folder in a docker repo that does not contain a manifest file.
You can try creating AQL Query. AQL has capabilities to search for artifacts based on properties which will help you in achieving clean up the way you want. https://www.jfrog.com/confluence/display/RTF/Artifactory+Query+Language
I don't think you can trap this with a single AQL,
but here is an idea that uses 2 AQLs -
Prepare a list of all paths that contain a manifest.json file
Prepare a list of all paths that contain sha256__* files
(will need to make it unique, because the same path will be
listed multiple times, probably)
Sort the two lists and compare them to each-other
Lines (i.e.: paths) that are showing only in the second list are
paths to broken images that are missing their manifest file
Now, after confirming the result-list (from step 4) is correct,
you can construct from it a set of DELETE API-calls (one for each path).

Dataflow 2.1.0 streaming application is not cleaning temp folders

We have a dataflow application which reads from Pub/Sub, windows into fixed-size 1-minute duration windows and writes the raw messages to GCS with 10 shards. Our application has been running for 10 days now and it has created a .temp-beam-2017**** folder. There are about 6200 files under it and the count is growing every day.
My understanding is data flow will move the temp files to the specified output folder after the write is complete.
Could you please suggest what can be done in this case ? Each of these files are about 100MB.
inputCollection.apply("Windowing",
Window.<String>into(FixedWindows.of(ONE_MINUTE))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(ONE_MINUTE))
.withAllowedLateness(ONE_HOUR)
.discardingFiredPanes()
)
//Writing to GCS
.apply(TextIO.write()
.withWindowedWrites()
.withNumShards(10)
.to(options.getOutputPath())
.withFilenamePolicy(
new
WindowedFileNames(options.getOutputPath())));
The discrepancy between 14400 and 13900 is most likely because your pipeline didn't get any data whose event time falls into a particular window and shard. When writing a windowed collection, we don't create empty files for "missing" windows, because in general, it is not possible to know which windows are "missing": it is, in theory, pretty clear for fixed or sliding windows, but not so for custom windowing functions or sessions etc. Moreover, assignment of shards is random, so it is possible that for a particular window very few data arrived, and then there's a pretty good chance that some of the 10 shards didn't get any of it.
As for why the temporary files are being left over: it seems that the pipeline is occasionally seeing exceptions when writing the files to GCS. The leftover files are "zombies" from those failed attempts to write data. We currently don't do a good job of cleaning up such files automatically in streaming mode (in batch, it is safe to delete the entire temporary directory when the pipeline is done, but in streaming we can't do that, and we delete only the individual files being renamed to their final location), but it should be safe for you to delete old temp files in that directory.
I've filed https://issues.apache.org/jira/browse/BEAM-3145 to improve the latter.

Sequential Processing of Files in IIB v9

Environment:IIB9 broker on windows
SFTP server is on windows
We have requirements to process a batch of files generated by backend system in sequential order (i.e FIFO). A batch can have multiple files.
All the files are placed in the IIB source directory from where FileInputNode is polling using move command.
I Want to know if FileInputNode is capable of picking up files in the order they were created by backend system.
Thanks,
Below is the extract from help contents where in it says that FileInput node reads the file in the order they were created by default (oldest are picked up first).
How the broker reads a file at the start of a flow
The FileInput node processes messages that are read from files. The
FileInput node searches a specified input directory (in the file
system that is attached to the broker) for files that match specified
criteria. The node can also recursively search the input directory's
subdirectories.
Files that meet the criteria are processed in age order, that is the
oldest files are processed first regardless of where they appear in
the directory structure
.
https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0.0/com.ibm.etools.mft.doc/ac55280_.htm
Yes you can loop in your messageflow and set the filepath of the file you want to load, one at a time. Feel free to ask for information if needed.
For the order, I have to check(you can still call a java node that will list the files in the order you want)
Regards

How to read list of files from a folder an export into database in Grails

I have lots of files of same type like .xml in a folder. How can I select this folder from interface and iterate over each file and send this to appropriate tables of database.
Thanks
Sonu
Do you always put the files in the same directory? For example, if you generate these files in some other system, then just want the data imported into your application you could:
Create a job that runs every X minutes
It iterates over each file in the directory and parses the XML, creating and saving the objects to the database
Shifts or deletes the files when it has processed them
Jobs are a Grails concept/plugin: http://www.grails.org/Job+Scheduling+(Quartz)
Processing XML is easy in Groovy - you have many options, depends on your specific scenario - http://groovy.codehaus.org/Processing+XML
Processing files is also trivial - http://groovy.codehaus.org/groovy-jdk/java/io/File.html#eachFile(groovy.lang.Closure)
This is a high level overview. Hope it helps.

Resources