A .csv file from another job is being copied into my job's workspace.
Since the .csv file name changes every day, how can I read it without specifying its name?
It's the only .csv file in the folder.
copyArtifacts(projectName: 'scanPinnedVersions')
def lastSuccessBuildNum = Jenkins.instance.getItem("scanPinnedVersions").lastSuccessfulBuild.displayName.replace("#","")
def da = readFile WORKSPACE + "/" + lastSuccessBuildNum + '/scanPinnedVersionReport.2021-12-05-080054.csv'
scanPinnedVersionReport.2021-12-05-080054.csv
Needs to be:
scanPinnedVersionReport*.csv
Thanks for helping. I would love to learn how to do this.
Using findFiles you should be able to accomplish this. Link to Jenkins doc.
It will return a list of files (in this case, sounds like only one) and you can then process it accordingly.
Related
So I have the case that I need to populate findfiles with files from more than one Dir
FILES = steps.findFiles(glob: "${FILE}/*.zip")
then I need to go to another folder and update it
FILES = steps.findFiles(glob: "${AnotherFilePath}/*.zip")
End goal is to iterate over the files and for each file do something.
e.g
for(file in FILES) {
I really want to get away from bash but is it possible to do that Jenkins Groovy way? Can u populate Files Variable?
you could use collectMany groovy method that executes closure for every item in initial list and joins result into one array
def FILES = [FILE, AnotherFilePath].collectMany{ steps.findFiles(glob: "${it}/*.zip") }
Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.
I am passing in an wilcard match string as gs://dev-test/dev_decisions-2018-11-13*/. And i am passing to TextIO as below.
p.apply(TextIO.read().from(options.getLocalDate()))
Now i want to read all folders from the bucket named dev-test and filter and only read files from the latest folder. Each folder has a name with timestamp appended to it.
I am new to dataflow and not sure how would I go about doing this.
Looking at the JavaDoc here it seems as though we can code:
String folder = // The GS path to the latest/desired folder.
PCollection<String> myPcollection = p.apply(TextIO.Read.from(folder+"/*")
The resulting PCollection will thus contain all the text lines from all the files in the specified folder.
Assuming you can have multiple folders in the same bucket with the same date prefix/suffix as for example "data-2018-12-18_part1", "data-2018-12-18_part2" etc, the following will work. Its a python example but it works for Java as well. You will just need to get the date formatted as per your folder name and construct the path accordingly.
# defining the input path pattern
input = 'gs://MYBUCKET/data-' + datetime.datetime.today().strftime('%Y-%m-%d') + '*\*'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
...
...
it will read all the files from all the folders matching the pattern
If you know that the most recent folder will always be today's date, you could use a literal string as in Tanveer's answer. If you don't know that and need to filter the actual folder names for the most recent date, I think you'll need to use FileIO.match to read file and directory names, and then collect them all to one node in order to do figure out which is the most recent folder, then pass that folder name into TextIO.read().from().
The filtering might look something like:
ReduceByKey.of(FileIO.match("mypath"))
.keyBy(e -> 1) // constant key to get everything to one node
.valueBy(e -> e)
.reduceBy(s -> ???) // your code for finding the newest folder goes here
.windowBy(new GlobalWindows())
.triggeredBy(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.output()
I want to configure a parametrized job in jenkins, who manipulate file:
parameters([
file(defaultValue: 'DEFAULT', name : 'tomcatCodesUrl' , description: 'URL of service where to find tomcat mapping json file'),
the issus is , this parameter only return the name of the file. how can I acces to this content?
Currently there is no easy way to do this. You can find discussion about this in JENKINS-27413
Yeah that parameter is as redundant as it can be. Might aswell just use string.
Anyway. You can get the files content with readFile:
def content = readFile encoding: 'utf-8', file: 'tomcatCodesUrl'
How to use file parameter in jenkins
This post might be helpful. The upshot is, when users upload a file, it will be saved into the root directory of the project's workspace. You can directly access the file using any programming language you like, given the file name. The file content is not returned to you as a parameter, but anyway, since you know its saved place (workspace dir) and file name, you are in control.
In the Job DSL, there is the method readFileFromWorkspace(), which makes it possible to read a files content from the workspace.
Now it would like to have something like readFilesFromDirectory() which gives me all files in some directory.
The goal is to make it possible to choose from different ansible playbooks:
choiceParam('PLAYBOOK_FILE', ['playbook1.yml', 'playbook2.yml'])
and to populate this list with existing files from a directory. Is something like this possible?
Well, shortly after asking this question, I found the solution.
So the Hudson API can be used:
hudson.FilePath workspace =
hudson.model.Executor.currentExecutor().getCurrentWorkspace()
def resultList = workspace.list().findAll { it.name ==~ /deploy.*\.yml/ }