Adding date to the key expression in s3 sink properties of spring cloud stream - spring-cloud-dataflow

We have a spring cloud data flow stream, which processes input files and produces output files in S3 bucket.
We are using following key-expression property to specify the folder for the output file.
app.s3-sink-rabbit.s3.key-expression='XYZ/abc/'+headers.file_name
We are trying to add date in YYYYMMDD as folder for our output files.
i.e. output location should be XYZ/abc/20230110/{filename}
We understood that folder gets created automatically in S3 if it is not found, while generating file.
We could append date in YYYYMMDD and then '/' to the file name through program, but we want to know if it can be done through some expression in property.

I believe the following may do what you want:
app.s3-sink-rabbit.s3.key-expression='XYZ/abc/'+T(java.time.LocalDate).now().format(T(java.time.format.DateTimeFormatter).BASIC_ISO_DATE)+'/'+headers.file_name

Try the key-expression property.
From the S3MessageHandler javadocs...
An S3 Object {#code key} for upload and download can be determined by the provided {#link #keyExpression} or the {#link File#getName()} is used directly. The former has precedence.

Related

GCS and Java- Concatenate dynamic bucket name and file name in TestIO

I want to write a file to a GCS bucket. The bucket path and file name are dynamically provided in two different pipeline options. How can I concatenate those in TextIO to write the file to the GCS bucket.
I tried doing this but no luck.
o.apply("Test:",TextIO.write()
.to(options.getBucktName().toString()+options.getOutName().toString()));
where getOutName = test.txt
and getBucktName = gs://bucket
Edit: Options are ValueProvider
We have faced a similar situation in Dataflow Templates, and we have created DualInputNestedValueProvider to address this.
You can feed it with 2 ValueProviders (it could be RuntimeValueProvider and a StaticValueProvider, in your case), and a function to map them to a new ValueProvider.
Take a look here for an example.
By "dynamically provided" do you mean those options are runtime ValueProvider instances? If so, I don't think it's possible to express what you want, since there's currently no hook for combining value providers (per related question).
If these are not value providers, then the example you show should work fine (although missing a / between the bucket and path as written).
Can you share more about how the options are defined?

How to filter and get folder with latest date in google dataflow

I am passing in an wilcard match string as gs://dev-test/dev_decisions-2018-11-13*/. And i am passing to TextIO as below.
p.apply(TextIO.read().from(options.getLocalDate()))
Now i want to read all folders from the bucket named dev-test and filter and only read files from the latest folder. Each folder has a name with timestamp appended to it.
I am new to dataflow and not sure how would I go about doing this.
Looking at the JavaDoc here it seems as though we can code:
String folder = // The GS path to the latest/desired folder.
PCollection<String> myPcollection = p.apply(TextIO.Read.from(folder+"/*")
The resulting PCollection will thus contain all the text lines from all the files in the specified folder.
Assuming you can have multiple folders in the same bucket with the same date prefix/suffix as for example "data-2018-12-18_part1", "data-2018-12-18_part2" etc, the following will work. Its a python example but it works for Java as well. You will just need to get the date formatted as per your folder name and construct the path accordingly.
# defining the input path pattern
input = 'gs://MYBUCKET/data-' + datetime.datetime.today().strftime('%Y-%m-%d') + '*\*'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
...
...
it will read all the files from all the folders matching the pattern
If you know that the most recent folder will always be today's date, you could use a literal string as in Tanveer's answer. If you don't know that and need to filter the actual folder names for the most recent date, I think you'll need to use FileIO.match to read file and directory names, and then collect them all to one node in order to do figure out which is the most recent folder, then pass that folder name into TextIO.read().from().
The filtering might look something like:
ReduceByKey.of(FileIO.match("mypath"))
.keyBy(e -> 1) // constant key to get everything to one node
.valueBy(e -> e)
.reduceBy(s -> ???) // your code for finding the newest folder goes here
.windowBy(new GlobalWindows())
.triggeredBy(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.output()

Adtf dat files - streams and structure types

ADTF dat file contains streams of data. In the .dat file there is only a stream name. To find the structure of the stream one has to go through DDL .description file.
Sometimes the .description files are incomplete or are missing link from stream name to corresponding structure.
Is there some additional information about structure name hidden in the .dat file itself? (Or my understanding is completely wrong?)
You must differ between ADTF 2.x and ADTF 3.x and their (adtf)dat file structure.
ADTF 2.x:
You are right, you can only interpret data with ddl. The stream must point to a structure described in Media Description.
Sometimes the .description files are incomplete or are missing link
from stream name to corresponding structure.
You can avoid this by enable the Option Create Media Description in Harddisk Recorder. Then a *.dat.description will be stored next to the same-titled *.dat file, which contains the correct stream and structure reference, because it was available during recording.
Is there some additional information about structure name hidden in the .dat file itself?
No, it is only the stream name. So you need to know the data structure behind to interpret. If you have the header (c-struct), you can also convert to ddl and refer to that.
ADTF 3.x:
To avoid these problems for not available or incorrect description files, the DDL is now stored in the *.adtfdat file in ADTF 3.x

Multiple file generation while writing to XML through Apache Beam

I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.
When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.
Any idea about the occurrence of this unexpected behaviour? please find below the code snippet I'm using :
PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));
PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
#ProcessElement
public void processElement(ProcessContext c)
{
String elements[]=c.element().toString().split(",");
c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));
System.out.println("Values to be written have been provided to constructor ");
}
})).setCoder(AvroCoder.of(XMLFormatter.class));
input_object.apply(XmlIO.<XMLFormatter>write()
.withRecordClass(XMLFormatter.class)
.withRootElement("library")
.to("gs://balajee_test/book_output"));
Please let me know the way to generate a single XML file(book_output.xml) at output.
XmlIO.write().to() is documented as follows:
/**
* Writes to files with the given path prefix.
*
* <p>Output files will have the name {#literal {filenamePrefix}-0000i-of-0000n.xml} where n is
* the number of output bundles.
*/
I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.
Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.

Can writing to an output ksds be done if we use alternate key concept with dynamic access mode in a input ksds?

I have an input KSDS file, i am using emp-id as primary key and emp-dept as alternate key, with access mode as dynamic. I am reading the file using dynamic access base upon the alternate key, in runjcl i am using base ksds file and ksds path file, so normally cobol will read from the path file
(which is sorted based upon the alternate key not the primary key).
but problem is while i am writing to an output ksds it is showing file status 21 error, because
in ksds record can only be inserted if it is sorted based upon the primary key, so what to do? is there any other alternate method??
Why not
Write the output to a normal Sequential file
Sort copy the sequential file into the output VSAM file.
If updating an existing file, you should be able to update the file. Alternatively you can always use 2 programs and sort the output from the first program.
Does the output file really need to be a VSAM file ???.

Resources