Sequential Processing of Files in IIB v9 - message

Environment:IIB9 broker on windows
SFTP server is on windows
We have requirements to process a batch of files generated by backend system in sequential order (i.e FIFO). A batch can have multiple files.
All the files are placed in the IIB source directory from where FileInputNode is polling using move command.
I Want to know if FileInputNode is capable of picking up files in the order they were created by backend system.
Thanks,

Below is the extract from help contents where in it says that FileInput node reads the file in the order they were created by default (oldest are picked up first).
How the broker reads a file at the start of a flow
The FileInput node processes messages that are read from files. The
FileInput node searches a specified input directory (in the file
system that is attached to the broker) for files that match specified
criteria. The node can also recursively search the input directory's
subdirectories.
Files that meet the criteria are processed in age order, that is the
oldest files are processed first regardless of where they appear in
the directory structure
.
https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0.0/com.ibm.etools.mft.doc/ac55280_.htm

Yes you can loop in your messageflow and set the filepath of the file you want to load, one at a time. Feel free to ask for information if needed.
For the order, I have to check(you can still call a java node that will list the files in the order you want)
Regards

Related

Jenkins job to select multiple files or directory with parameter

I have a Jenkins job that can process any number of files. However the File Parameter only lets the user select a single file.
Is there a way to select an arbitrary number of files or to select a directory? (I can work out the appriopriate files given a directory).
There seems to be no option for selecting a directory with a parameter.
Asking the user to enter the full path in a String parameter is very unfriendly.
I did consider using a File Parameter to select one file in the directory and then extracting the directory from this but in fact you just get the single file uploaded - there doesn't appear to be any directory information.
(Note: it isn't an option to run the job multiple times once for each file, the files need to be processed as a set.)
Any ideas?

Avoid reading the same file multiple times using Telegraf and file input plugin

I need to read csv files inside a folder. New csv files are generated every time a user submits a form. I'm using the "file" input plugin to read the data and send it to Influxdb. These steps are working fine.
The problem is that the same file is read multiple times every data collection interval. I was thinking of a solution where I could move the file that was read to a different folder, but I couldn't do that with Telegraf's "exec" output plug.
ps: I can't change the way csv files are generated.
Any ideas on how to avoid reading the same csv file multiple times?
As you discovered file input plugin is used to read entire files at each collection interval.
My suggestion is for you to instead use the directory monitor input plugin. This will read files in a directory, monitor the directory for new files, and parse the ones that have not already been picked up yet. There are some configuration settings in that plugin that make it easier to time when new files are read as well.
Another option is to use the tail input plugin which will tail a file and only read new updates to that file as things come. However, I think the directory monitor is more likely something you are after for your scenario.
Thanks!

Dataflow 2.1.0 streaming application is not cleaning temp folders

We have a dataflow application which reads from Pub/Sub, windows into fixed-size 1-minute duration windows and writes the raw messages to GCS with 10 shards. Our application has been running for 10 days now and it has created a .temp-beam-2017**** folder. There are about 6200 files under it and the count is growing every day.
My understanding is data flow will move the temp files to the specified output folder after the write is complete.
Could you please suggest what can be done in this case ? Each of these files are about 100MB.
inputCollection.apply("Windowing",
Window.<String>into(FixedWindows.of(ONE_MINUTE))
.triggering(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(ONE_MINUTE))
.withAllowedLateness(ONE_HOUR)
.discardingFiredPanes()
)
//Writing to GCS
.apply(TextIO.write()
.withWindowedWrites()
.withNumShards(10)
.to(options.getOutputPath())
.withFilenamePolicy(
new
WindowedFileNames(options.getOutputPath())));
The discrepancy between 14400 and 13900 is most likely because your pipeline didn't get any data whose event time falls into a particular window and shard. When writing a windowed collection, we don't create empty files for "missing" windows, because in general, it is not possible to know which windows are "missing": it is, in theory, pretty clear for fixed or sliding windows, but not so for custom windowing functions or sessions etc. Moreover, assignment of shards is random, so it is possible that for a particular window very few data arrived, and then there's a pretty good chance that some of the 10 shards didn't get any of it.
As for why the temporary files are being left over: it seems that the pipeline is occasionally seeing exceptions when writing the files to GCS. The leftover files are "zombies" from those failed attempts to write data. We currently don't do a good job of cleaning up such files automatically in streaming mode (in batch, it is safe to delete the entire temporary directory when the pipeline is done, but in streaming we can't do that, and we delete only the individual files being renamed to their final location), but it should be safe for you to delete old temp files in that directory.
I've filed https://issues.apache.org/jira/browse/BEAM-3145 to improve the latter.

Are cloud dataflow job outputs transactional?

Assuming I don't know the job status that was supposed to generate some output files (in cloud store), can I assume that if some output files exist they contain all of the job's output?
Or it's possible that partial output is visible?
Thanks,
G
It is possible that only a subset of the files is visible, but the visible files are complete (cannot grow or change).
The filenames contain the total number of files (output-XXXXX-of-NNNNN), so once you have one file, you know how many more to expect.

How to read list of files from a folder an export into database in Grails

I have lots of files of same type like .xml in a folder. How can I select this folder from interface and iterate over each file and send this to appropriate tables of database.
Thanks
Sonu
Do you always put the files in the same directory? For example, if you generate these files in some other system, then just want the data imported into your application you could:
Create a job that runs every X minutes
It iterates over each file in the directory and parses the XML, creating and saving the objects to the database
Shifts or deletes the files when it has processed them
Jobs are a Grails concept/plugin: http://www.grails.org/Job+Scheduling+(Quartz)
Processing XML is easy in Groovy - you have many options, depends on your specific scenario - http://groovy.codehaus.org/Processing+XML
Processing files is also trivial - http://groovy.codehaus.org/groovy-jdk/java/io/File.html#eachFile(groovy.lang.Closure)
This is a high level overview. Hope it helps.

Resources