Flume: How to track specified sub folders using spoolDir? - flume

We're having a system uploads log files into a folder which named by date. It looks like:
/logs
/20181030
/20181031
/20181101
/20181102
/...
Suppose that I want to track the log files which produced during November by using spoolDir, How could I do this ?
#this won't work
a1.sources.r1.spoolDir = /logs/201811??
#this seems only works with files. Is it possible to filter folders here?
a1.sources.r1.includePattern = ^.*\.txt$

Acoording to the flume source code, folders that match the ignorePattern are skipped while recursing the folder tree(to register folder trackers). So you can ignore the folders which don't match your criteria. ^(?!201811..).*$ would exclude all the folders that are not folders of November 2018. Other folders will not be tracked.
But this pattern will also apply to file names. So any file with name that does not match ^201811..$ will also be ignored. You can add the ^.*\.txt$ pattern (the one you are using for the include pattern) to the regex to make flume accept your input files.
a1.sources.r1.ignorePattern = ^(?!(201810..)|(.*\\.txt)).*$
would do the trick for you.

Related

How to filter and get folder with latest date in google dataflow

I am passing in an wilcard match string as gs://dev-test/dev_decisions-2018-11-13*/. And i am passing to TextIO as below.
p.apply(TextIO.read().from(options.getLocalDate()))
Now i want to read all folders from the bucket named dev-test and filter and only read files from the latest folder. Each folder has a name with timestamp appended to it.
I am new to dataflow and not sure how would I go about doing this.
Looking at the JavaDoc here it seems as though we can code:
String folder = // The GS path to the latest/desired folder.
PCollection<String> myPcollection = p.apply(TextIO.Read.from(folder+"/*")
The resulting PCollection will thus contain all the text lines from all the files in the specified folder.
Assuming you can have multiple folders in the same bucket with the same date prefix/suffix as for example "data-2018-12-18_part1", "data-2018-12-18_part2" etc, the following will work. Its a python example but it works for Java as well. You will just need to get the date formatted as per your folder name and construct the path accordingly.
# defining the input path pattern
input = 'gs://MYBUCKET/data-' + datetime.datetime.today().strftime('%Y-%m-%d') + '*\*'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
...
...
it will read all the files from all the folders matching the pattern
If you know that the most recent folder will always be today's date, you could use a literal string as in Tanveer's answer. If you don't know that and need to filter the actual folder names for the most recent date, I think you'll need to use FileIO.match to read file and directory names, and then collect them all to one node in order to do figure out which is the most recent folder, then pass that folder name into TextIO.read().from().
The filtering might look something like:
ReduceByKey.of(FileIO.match("mypath"))
.keyBy(e -> 1) // constant key to get everything to one node
.valueBy(e -> e)
.reduceBy(s -> ???) // your code for finding the newest folder goes here
.windowBy(new GlobalWindows())
.triggeredBy(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.output()

Flume "Spooling Directory Source" recursive-look for the the files within subdirectories

I am looking for the Flume "Spooling Directory Source" recursive-look for the the files within subdirectories.
There are some references here https://issues.apache.org/jira/browse/FLUME-1899
however since then multiple versions have come out, is there any way we can have recursive directory lookup within subdirectories for the files in Spooling Source.
I think you can use the patch FLUME-1899-2.patch directly.
set the "recursiveDirectorySearch" as ture in your config file.
NOTE: the regex in ignorePattern of config file will also affect the recursiveDirectory folder name. so you might need to modify the code in org/apache/flume/client/avro/ReliableSpoolingFileEventReader.java if you want to ignore the folder name.

TFS exclude a directory - .tfignore not working and it is a very large directory

I have seen other posts and read them on stackoverflow - How to ignore files/directories in TFS for avoiding them to go to central source repository?
However this does not seem to work.
I have a folder of the root called /FS which in that directory I have the following .tfignore
################################################################################
# This .tfignore file was automatically created by Microsoft(R) Visual Studio.
#
# Local items matching filespecs in this file will not be added to version
# control. This file can be checked in to share exclusions with others.
#
# Wildcard characters are * and ?. Patterns are matched recursively unless the
# pattern is prefixed by the \ character.
#
# You can prepend a path to a pattern to make it more specific. If you do,
# wildcard characters are not permitted in the path portion.
#
# The # character at the beginning of a line indicates a comment.
#
# The ! prefix negates a pattern. This can be used to re-include an item after
# it was excluded by a .tfignore file higher in the tree, or by the Team
# Project Collection's global exclusions list.
#
# The / character is interpreted as a \ character on Windows platforms.
#
# Examples:
#
# # Excludes all files ending in .txt in Alpha\Beta and all its subfolders.
# Alpha\Beta\*.txt
#
# # Excludes all files ending in .cpp in this folder only.
# \*.cpp
#
# # Excludes all files ending in .cpp in this folder and all subfolders.
# *.cpp
#
# # If "Contoso" is a folder, then Contoso and all its children are excluded.
# # If it is a file, then only the "Contoso" in this folder is excluded.
# \Contoso
#
# # If Help.exe is excluded by a higher .tfignore file or by the Team Project
# # Collection global exclusions list, then this pattern re-includes it in
# # this folder only.
# !\Help.exe
#
################################################################################
\BRAND
\CO
\COVER
\COVERBANNER
\DEPT
\DEPT-CATS
\LIB
\PRODUCTS
However, TFS keeps trying to add these directories. I even have at the root directory another .tfignore
/FS
I simply need /FS out of TFS. This is 15GB workth of images that does not belong in SOURCE control, we have multiple areas where these images are backed up and this is very resource intense creating branches.
Do I need to delete the /FS from TFS with CMD PROMPT? Any help would be great. I am simply frustrated and stuck.
I would assume that .tfignore works the same with local and server workspaces, so:
Yes, you have to delete \FS from source control, but doing so from the Team Explorer should suffice.
Add a .tfignore in \FSs parent directory.
This will ensure that your \FS folder does not sit in source control and get copied over when you make a new branch.
This is an old post, but I'm providing an answer anyway since I stumbled upon the same issue and finally found an answer:
Someone commented that .tfignore only works with local workspaces -> this is correct
The syntax for excluding entire subfolders within your current folder (where your .tfignore file resides) can be either of those:
/subfolder
\subfolder
If you want to exclude subfolders of subfolders or files in subfolders of it can be either of those or something similar (depending on what you want to filter out):
/subfolder/subsubfolder
\subfolder\subsubfolder
/subfolder/*.txt
\subfolder*.txt
It is noteworthy that tf.exe has some strange quirks, when anything isn't right nothing in .tfignore will have effect and you will not receive any visible error message (maybe hidden somewhere in some log, not sure). At first I thought I was having some issue with line endings, because when I added a new line with a new entry then nothing got excluded, only after adding a blank line in between. Later I noticed that it seems like tf.exe is doing some kind of validation that doesn't always succeed the first time for certain syntaxes. Hard to put my finger on under what scenarios exactly that happens, but it works the second time.
I suggest anyone doing this having the text editor open and save your changes in .tfignore, then check in Team Explorer if the detected files disappear (which they should if they're ignored). When this doesn't work within 1 to 3 seconds, try just saving the .tfignore file again without changing anything, then it'll probably work. I've seen this bevahior now 100 times.
After editing and saving .tfignore file:
After a blank re-save of the file:

Exclude directory in uDeploy plugin for jenkins

I'm trying to import a new version of a udeploy component through Jenkins and the uDeploy plugin that comes from a Git repository and has the .git folder in it. Everything I've tried to exclude the .git folder from syncing doesn't work. I'm thinking that the plugin is looking for files with a .git extension rather than folder. How do I exclude the .git folder form syncing?
I tried ".git", **/.git/, *.git/*, **.git/*, and a handful of other 'terms' and they all show up in the console output as:
Working Directory: C:\Program Files (x86)\Jenkins\jobs\DIT Com\workspace
Includes: **/
Excludes: ".git" Uploading files in C:\Program Files (x86)\Jenkins\jobs\DIT Com\workspace Uploading: .git/hooks/pre-commit.sample
...
Uploading: .git/refs/heads Files committed Finished: SUCCESS
This is what the exclude section looks like, with the help bubble clicked (that's what's in the gray box)
Unable to comment so adding as an answer-
Two consecutive asterisks ("**") in patterns matched against full pathname may have special meaning:
A leading "**" followed by a slash means match in all directories. For example, "**/foo" matches file or directory "foo" anywhere, the same as pattern "foo". "**/foo/bar" matches file or directory "bar" anywhere that is directly under directory "foo".
A trailing "/**" matches everything inside. For example, "abc/**" matches all files inside directory "abc", relative to the location of the .gitignore file, with infinite depth.
A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, "a/**/b" matches "a/b", "a/x/b", "a/x/y/b" and so on.
Other consecutive asterisks are considered invalid.
Have you tried a regular expression? say, ^/.*/.git/
Looks like the answer to excluding directories is in the form of **/dir_name/**.
If someone could give some more information on what the leading *'s are doing (not sure how the second * wildcard interacts, nor the trailing second *) I would be really interested in understanding why it works!
reference: ant fileset dir exclude certain directory

tfignore wildcard directory segment

Is it possible using .tfignore to add a wildcard to directories? I assumed it would have been a case of just adding an asterisk wildcard to the directory segment. For example:
\path\*\local.properties
However this does not work and I am unsure how I would achieve such behaviour without explicitly declaring every reference that I need excluding. .
Documentation
# begins a comment line
The * and ? wildcards are supported.
A filespec is recursive unless prefixed by the \ character.
! negates a filespec (files that match the pattern are not ignored)
Extract from the documentation.
The documentation should more correctly read:
The * and ? wildcards are supported in the leaf name only.
That is, you can use something like these to select multiple files or multiple subdirectories, respectively, in a common parent:
/path/to/my/file/foo*.txt
/path/to/my/directories/temp*
What may work in your case--to ignore the same file in multiple directories--is just this:
foo*.txt
That is, specify a path-less name or glob pattern to ignore matching files throughout your tree. Unfortunately you have only those two options, local or global; you cannot use a relative path like this--it will not match any files!
my/file/foo*.txt
The global option is a practical one because .tfignore only affects unversioned files. Once you add a file to source control, changes to that file will be properly recognized. Furthermore, if you need to add an instance of an ignored name to source control, you can always go into TFS source control explorer and manually add it.
It seems this is now supported
As you see I edited tfignore in the root folder of the project such that any new branch will ignore its .vs folder when being examined for source control changes
\*\.vs
Directory/folder name wildcarding works for me in VS2019 Professional. For example if I put this in .tfignore:
*uncheckedToTFS
The above will ignore any folder named ending with "uncheckedToTFS", regardless of where the folder is (it doesn't have to be top level folder, can be many levels deep).

Resources