How to filter and get folder with latest date in google dataflow - google-cloud-dataflow

I am passing in an wilcard match string as gs://dev-test/dev_decisions-2018-11-13*/. And i am passing to TextIO as below.
p.apply(TextIO.read().from(options.getLocalDate()))
Now i want to read all folders from the bucket named dev-test and filter and only read files from the latest folder. Each folder has a name with timestamp appended to it.
I am new to dataflow and not sure how would I go about doing this.

Looking at the JavaDoc here it seems as though we can code:
String folder = // The GS path to the latest/desired folder.
PCollection<String> myPcollection = p.apply(TextIO.Read.from(folder+"/*")
The resulting PCollection will thus contain all the text lines from all the files in the specified folder.

Assuming you can have multiple folders in the same bucket with the same date prefix/suffix as for example "data-2018-12-18_part1", "data-2018-12-18_part2" etc, the following will work. Its a python example but it works for Java as well. You will just need to get the date formatted as per your folder name and construct the path accordingly.
# defining the input path pattern
input = 'gs://MYBUCKET/data-' + datetime.datetime.today().strftime('%Y-%m-%d') + '*\*'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
...
...
it will read all the files from all the folders matching the pattern

If you know that the most recent folder will always be today's date, you could use a literal string as in Tanveer's answer. If you don't know that and need to filter the actual folder names for the most recent date, I think you'll need to use FileIO.match to read file and directory names, and then collect them all to one node in order to do figure out which is the most recent folder, then pass that folder name into TextIO.read().from().
The filtering might look something like:
ReduceByKey.of(FileIO.match("mypath"))
.keyBy(e -> 1) // constant key to get everything to one node
.valueBy(e -> e)
.reduceBy(s -> ???) // your code for finding the newest folder goes here
.windowBy(new GlobalWindows())
.triggeredBy(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.output()

Related

Parsing logs in Zabbix

i'm try to parse logs from windows folder in Zabbix, but everyday creates a new directory like "2022_03_15" and log files in her, how parse a new name directory?
log["C:\Windows\Temp\app\web\0\Log\YYYY_MM_DD\Application.log"]
The logrt item would come closest but reading https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/zabbix_agent#supported-item-keys
It notes:
file_regexp - absolute path to file and the file name described by a regular expression. Note that only the file name is a regular expression
So, sadly, this is not possible yet.
You could try to work around this by first using a list to find the latest directory and use lld to create a new log item when that is needed. For these cases it is a pity that the file and location are part of the item key.

Is there a plugin to get Folder path as parameter in Jenkins?

I wanted to allow users to select a folder path as a parameter and get the entire folder path as the parameter value. Is there any plugin for this purpose.
I have explored the File Parameter, this allows to select a file path and gives only the file name as output and not the path.
I also explored the File systems object parameter list, this is used to list the folders inside a file as choices.
Have you tried using a String parameter for it? I have several pipelines where I have paths defined as string parameters for use in shell scripts.
Are you sure Filesystem List Parameter can't be conigured to meet your needs?
I believe Extended Choice Parameter should allow you to do this. You'd have write as custom groovy, which could be tricky or take time to load.
You could maybe (request) enhance the Filesystem List Parameter plugin

Can I avoid hardcoding file locations in SPSS syntax?

I'm using SPSS 25 syntax to open and process a set of datafiles. I would like these syntax files to be as portable as possible. For that reason, I want the user to be able to select the file locations at runtime without having to recode the syntax itself.
I'm running Windows 10, although hopefully that doesn't matter. I do have the Python plugin for SPSS, although ideally this would be a base SPSS syntax solution.
In SPSS right now, I'm doing this:
GET
FILE='C:\Users\xkcd\studies\project\rawdata'+
'\reallyraw\veryraw.sav'
PASSWORD='CorrectHorseBatteryStaple'.
DATASET NAME Demo WINDOW=FRONT.
In R, I would do this:
message("Where is the veryraw.sav file?")
demo<-fread(file.choose())
Ideally, the user would, at runtime, select the individual files one at a time.
Less ideally, the user would select a folder in which all of the files, with known names.
I could use FILE HANDLE so that the user would only have to hardcode a few folder locations, but that's less than ideal - I really would rather that the user isn't editing the syntax at all.
Thanks in advance!
Following up on the idea of a fully automated process - the following code will work assuming there is a specific file name you need to run your code on, and only one copy exists in the folder you are searching. This is possible to run on drive C: directly, but will take much less time to run if you can narrow down the path:
* this will create a text file that has the path of the required file.
HOST COMMAND=['dir /s /b "C:\Users\somename\*required file name.sav" > C:\Users\somename\tempname.sps'].
* now to read the name and put in in a handle.
DATA LIST file = "C:\Users\somename\tempname.sps" fixed / pth 1-500 (a).
exe.
string cmd(a500).
compute cmd=concat("file handle myfile / name='", rtrim(pth), "'.").
write out="C:\Users\somename\tempname.sps" /cmd.
exe.
* inserting the new syntax will activate the handle.
insert file = "C:\Users\somename\tempname.sps".
Now you can use the handle myfile in the syntax, e.g:
get file=myfile.

Flume: How to track specified sub folders using spoolDir?

We're having a system uploads log files into a folder which named by date. It looks like:
/logs
/20181030
/20181031
/20181101
/20181102
/...
Suppose that I want to track the log files which produced during November by using spoolDir, How could I do this ?
#this won't work
a1.sources.r1.spoolDir = /logs/201811??
#this seems only works with files. Is it possible to filter folders here?
a1.sources.r1.includePattern = ^.*\.txt$
Acoording to the flume source code, folders that match the ignorePattern are skipped while recursing the folder tree(to register folder trackers). So you can ignore the folders which don't match your criteria. ^(?!201811..).*$ would exclude all the folders that are not folders of November 2018. Other folders will not be tracked.
But this pattern will also apply to file names. So any file with name that does not match ^201811..$ will also be ignored. You can add the ^.*\.txt$ pattern (the one you are using for the include pattern) to the regex to make flume accept your input files.
a1.sources.r1.ignorePattern = ^(?!(201810..)|(.*\\.txt)).*$
would do the trick for you.

Lua - My documents path and file creation date

I'm planning to do a program with Lua that will first of all read specific files
and get information from those files. So my first question is whats the "my documents" path name? I have searched a lot of places, but I'm unable to find anything. My second question is how can I use the first four letters of a file name to see which one is the newest made?
Finding the files in "my documents" then find the newest created file and read it.
The reading part shouldn't be a problem, but navigating to "my documents" and finding the newest created file in a folder.
For your first question, depends how robust you want your script to be. You could use Lua's builtin os.getenv() to get a variety of environment vars related to user, such as USERNAME, USERPROFILE, HOMEDRIVE, HOMEPATH. Example:
username = os.getenv('USERNAME')
dir = 'C:\\users\\' .. username .. '\\Documents'
For the second question, there is no builtin mechanism in Windows to have the file creation or modification timestamp as part of the filename. You could read the creation or modification timestamp, via a C extension you create or using an existing Lua library like lfs. Or you could read the contents of a folder and parse the filenames if they were named according to the pattern you mention. Again there is nothing built into Lua to do this, you would either use os.execute() or lfs or, again, your own C extension module, or combinations of these.

Resources