I am trying to import different csv files from a folder containing different csv file types.
So I need to filter my foreach loop with the starting flatfile name.
I only want to process files starting with MyFileType_1*.csv and not the others in the same folder.
Any suggestion is welcome, thanks.
In your for-each container can specify which files to read as Ocaso is saying. In the Variable mappings tab in the for-each container you can there set which variable you want to save the files found. Then you can use this variable as the connection string to a flat file connection.
Nighty_'s answer is correct, but just for the sake of completeness it is worth mentioning that to set the ConnectionString in a Flat File Connection you must use a variable of package-level scope in the expression (in his case #[User::v_FilePath]). This is because the connection itself is package level. This might feel a bit unintuitive... or ugly... it is.
Related
I have rule A implemented with a macro that uses declare_directory to produce a set of files:
output = ctx.actions.declare_directory("selected")
Names of those files are not known in advance. The implementation returns the directory created by declare_directory with the following:
return DefaultInfo(
files = depset([output]),
)
Rule A is included in "srcs" attribute of rule B. Rule B is also implemented with a macro. Unfortunately the list of files passed to B implementation through "srcs" attribute only contains the "selected" directory created by rule A instead of files residing in that directory.
I know that Args class supports expansion of directories so I could pass names of all files in "selected" directory to a single action. What I need, however, is a separate action for every individual file for parallelism and caching. What is the best way to achieve that?
This is one of the intended use cases of directory outputs (called TreeArtifacts in the implementation), and it's implemented using ActionTemplate:
https://github.com/bazelbuild/bazel/blob/c2100ad420618bb53754508da806b5624209d9be/src/main/java/com/google/devtools/build/lib/actions/ActionTemplate.java#L24-L57
However, this is not exposed to Starlark, and has only a couple usages currently, in the Android rules AndroidBinary.java and C++ rules CcCompilationHelper.java. The Android rules and C++ rules are going to be migrated over to Starlark, so this functionality might eventually be made available in Starlark, but I'm not sure of any concrete timelines. It would probably be good to file a feature request on Github.
i'm try to parse logs from windows folder in Zabbix, but everyday creates a new directory like "2022_03_15" and log files in her, how parse a new name directory?
log["C:\Windows\Temp\app\web\0\Log\YYYY_MM_DD\Application.log"]
The logrt item would come closest but reading https://www.zabbix.com/documentation/current/en/manual/config/items/itemtypes/zabbix_agent#supported-item-keys
It notes:
file_regexp - absolute path to file and the file name described by a regular expression. Note that only the file name is a regular expression
So, sadly, this is not possible yet.
You could try to work around this by first using a list to find the latest directory and use lld to create a new log item when that is needed. For these cases it is a pity that the file and location are part of the item key.
Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.
I'm using SPSS 25 syntax to open and process a set of datafiles. I would like these syntax files to be as portable as possible. For that reason, I want the user to be able to select the file locations at runtime without having to recode the syntax itself.
I'm running Windows 10, although hopefully that doesn't matter. I do have the Python plugin for SPSS, although ideally this would be a base SPSS syntax solution.
In SPSS right now, I'm doing this:
GET
FILE='C:\Users\xkcd\studies\project\rawdata'+
'\reallyraw\veryraw.sav'
PASSWORD='CorrectHorseBatteryStaple'.
DATASET NAME Demo WINDOW=FRONT.
In R, I would do this:
message("Where is the veryraw.sav file?")
demo<-fread(file.choose())
Ideally, the user would, at runtime, select the individual files one at a time.
Less ideally, the user would select a folder in which all of the files, with known names.
I could use FILE HANDLE so that the user would only have to hardcode a few folder locations, but that's less than ideal - I really would rather that the user isn't editing the syntax at all.
Thanks in advance!
Following up on the idea of a fully automated process - the following code will work assuming there is a specific file name you need to run your code on, and only one copy exists in the folder you are searching. This is possible to run on drive C: directly, but will take much less time to run if you can narrow down the path:
* this will create a text file that has the path of the required file.
HOST COMMAND=['dir /s /b "C:\Users\somename\*required file name.sav" > C:\Users\somename\tempname.sps'].
* now to read the name and put in in a handle.
DATA LIST file = "C:\Users\somename\tempname.sps" fixed / pth 1-500 (a).
exe.
string cmd(a500).
compute cmd=concat("file handle myfile / name='", rtrim(pth), "'.").
write out="C:\Users\somename\tempname.sps" /cmd.
exe.
* inserting the new syntax will activate the handle.
insert file = "C:\Users\somename\tempname.sps".
Now you can use the handle myfile in the syntax, e.g:
get file=myfile.
I have a script which saves some files at a given location. It works fine but when I send this code to someone else, he has to change the paths in the code. It's not comfortable for someone who does not know what is in that code and for me to explain every time where and how the code should be changed.
I want to get this path in a variable which will be taken from the configuration file. So it will be easier for everyone to change just this config file and nothing in my code. But I have never done this before and could not find any information on how I can do this in the internet.
PS: I do not have any code and I ask about an ultimate solution but it is really difficult to find something good in the internet about dxl, especially since I'm new with that. Maybe someone of you already does that or has an idea how it could be done?
DXL has a perm to read the complete context of a file into a variable: string readFile (string) (or Buffer readFile (string))
you can split the output by \n and then use regular expressions to find all lines that match the pattern
^\s*([^;#].*)\s*=\s*(.*)\s*$
(i.e. key = value - where comment lines start with ; or #)
But in DOORS I prefer using DOORS modules as configuration modules. Object Heading can be the key, Object Text can be the value.
Hardcode the full name of the configuration module into your DXL file and the user can modify the behaviour of the application.
The advantage over a file is that you need not make assumptions on where the config file is to be stored on the file system.
It really depends on your situation. You are going to need to be a little more specific about what you mean by "they need to change the paths in the code". What are these paths to? Are they DOORS module paths, are they paths to local/network files, or are the something else entirely?
Like user3329561 said, you COULD use a DOORS module as a configuration file. I wouldn't recommend it though, simply because that is not what DOORS modules were designed for. DOORS is fully capable of reading system files in one line at a time as well as all at once, but I can't recommend that option either until I know what types of paths you want to load and why.
I suspect that there is a better solution for your problem that will present itself once more information is provided.
I had the same problem, I needed to specify the path of my configuration file used in my dxl script.
I solved this issue passing the directory path as a parameter to DOORS.exe as follow:
"...\DOORS\9.3\bin\doors.exe" -dxl "string myVar = \"Hello Word\"
then in my dxl script, the variable myVar is a global variable.