GCS and Java- Concatenate dynamic bucket name and file name in TestIO - google-cloud-dataflow

I want to write a file to a GCS bucket. The bucket path and file name are dynamically provided in two different pipeline options. How can I concatenate those in TextIO to write the file to the GCS bucket.
I tried doing this but no luck.
o.apply("Test:",TextIO.write()
.to(options.getBucktName().toString()+options.getOutName().toString()));
where getOutName = test.txt
and getBucktName = gs://bucket
Edit: Options are ValueProvider

We have faced a similar situation in Dataflow Templates, and we have created DualInputNestedValueProvider to address this.
You can feed it with 2 ValueProviders (it could be RuntimeValueProvider and a StaticValueProvider, in your case), and a function to map them to a new ValueProvider.
Take a look here for an example.

By "dynamically provided" do you mean those options are runtime ValueProvider instances? If so, I don't think it's possible to express what you want, since there's currently no hook for combining value providers (per related question).
If these are not value providers, then the example you show should work fine (although missing a / between the bucket and path as written).
Can you share more about how the options are defined?

Related

Consuming contents of declare_directory

I have rule A implemented with a macro that uses declare_directory to produce a set of files:
output = ctx.actions.declare_directory("selected")
Names of those files are not known in advance. The implementation returns the directory created by declare_directory with the following:
return DefaultInfo(
files = depset([output]),
)
Rule A is included in "srcs" attribute of rule B. Rule B is also implemented with a macro. Unfortunately the list of files passed to B implementation through "srcs" attribute only contains the "selected" directory created by rule A instead of files residing in that directory.
I know that Args class supports expansion of directories so I could pass names of all files in "selected" directory to a single action. What I need, however, is a separate action for every individual file for parallelism and caching. What is the best way to achieve that?
This is one of the intended use cases of directory outputs (called TreeArtifacts in the implementation), and it's implemented using ActionTemplate:
https://github.com/bazelbuild/bazel/blob/c2100ad420618bb53754508da806b5624209d9be/src/main/java/com/google/devtools/build/lib/actions/ActionTemplate.java#L24-L57
However, this is not exposed to Starlark, and has only a couple usages currently, in the Android rules AndroidBinary.java and C++ rules CcCompilationHelper.java. The Android rules and C++ rules are going to be migrated over to Starlark, so this functionality might eventually be made available in Starlark, but I'm not sure of any concrete timelines. It would probably be good to file a feature request on Github.

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

Is there a plugin to get Folder path as parameter in Jenkins?

I wanted to allow users to select a folder path as a parameter and get the entire folder path as the parameter value. Is there any plugin for this purpose.
I have explored the File Parameter, this allows to select a file path and gives only the file name as output and not the path.
I also explored the File systems object parameter list, this is used to list the folders inside a file as choices.
Have you tried using a String parameter for it? I have several pipelines where I have paths defined as string parameters for use in shell scripts.
Are you sure Filesystem List Parameter can't be conigured to meet your needs?
I believe Extended Choice Parameter should allow you to do this. You'd have write as custom groovy, which could be tricky or take time to load.
You could maybe (request) enhance the Filesystem List Parameter plugin

How to create and load a configuration file in dxl

I have a script which saves some files at a given location. It works fine but when I send this code to someone else, he has to change the paths in the code. It's not comfortable for someone who does not know what is in that code and for me to explain every time where and how the code should be changed.
I want to get this path in a variable which will be taken from the configuration file. So it will be easier for everyone to change just this config file and nothing in my code. But I have never done this before and could not find any information on how I can do this in the internet.
PS: I do not have any code and I ask about an ultimate solution but it is really difficult to find something good in the internet about dxl, especially since I'm new with that. Maybe someone of you already does that or has an idea how it could be done?
DXL has a perm to read the complete context of a file into a variable: string readFile (string) (or Buffer readFile (string))
you can split the output by \n and then use regular expressions to find all lines that match the pattern
^\s*([^;#].*)\s*=\s*(.*)\s*$
(i.e. key = value - where comment lines start with ; or #)
But in DOORS I prefer using DOORS modules as configuration modules. Object Heading can be the key, Object Text can be the value.
Hardcode the full name of the configuration module into your DXL file and the user can modify the behaviour of the application.
The advantage over a file is that you need not make assumptions on where the config file is to be stored on the file system.
It really depends on your situation. You are going to need to be a little more specific about what you mean by "they need to change the paths in the code". What are these paths to? Are they DOORS module paths, are they paths to local/network files, or are the something else entirely?
Like user3329561 said, you COULD use a DOORS module as a configuration file. I wouldn't recommend it though, simply because that is not what DOORS modules were designed for. DOORS is fully capable of reading system files in one line at a time as well as all at once, but I can't recommend that option either until I know what types of paths you want to load and why.
I suspect that there is a better solution for your problem that will present itself once more information is provided.
I had the same problem, I needed to specify the path of my configuration file used in my dxl script.
I solved this issue passing the directory path as a parameter to DOORS.exe as follow:
"...\DOORS\9.3\bin\doors.exe" -dxl "string myVar = \"Hello Word\"
then in my dxl script, the variable myVar is a global variable.

Lua - My documents path and file creation date

I'm planning to do a program with Lua that will first of all read specific files
and get information from those files. So my first question is whats the "my documents" path name? I have searched a lot of places, but I'm unable to find anything. My second question is how can I use the first four letters of a file name to see which one is the newest made?
Finding the files in "my documents" then find the newest created file and read it.
The reading part shouldn't be a problem, but navigating to "my documents" and finding the newest created file in a folder.
For your first question, depends how robust you want your script to be. You could use Lua's builtin os.getenv() to get a variety of environment vars related to user, such as USERNAME, USERPROFILE, HOMEDRIVE, HOMEPATH. Example:
username = os.getenv('USERNAME')
dir = 'C:\\users\\' .. username .. '\\Documents'
For the second question, there is no builtin mechanism in Windows to have the file creation or modification timestamp as part of the filename. You could read the creation or modification timestamp, via a C extension you create or using an existing Lua library like lfs. Or you could read the contents of a folder and parse the filenames if they were named according to the pattern you mention. Again there is nothing built into Lua to do this, you would either use os.execute() or lfs or, again, your own C extension module, or combinations of these.

Resources