I have to admit that I always forgot the syntactical intracacies of the naming patterns for Nant (eg. those used in filesets). The double asterisk/single asterisk stuff seems to be very forgettable in my mind.
Can someone provide a definitive guide to the naming patterns?
The rules are:
a single star (*) matches zero or more characters within a path name
a double star (**) matches zero or more characters across directory levels
a question mark (?) matches exactly one character within a path name
Another way to think about it is double star (**) matches slash (/) but single star (*) does not.
Let's say you have the files:
bar.txt
src/bar.c
src/baz.c
src/test/bartest.c
Then the patterns:
*.c matches nothing (there are no .c files in the current directory)
src/*.c matches 2 and 3
*/*.c matches 2 and 3 (because * only matches one level)
**/*.c matches 2, 3, and 4 (because ** matches any number of levels)
bar.* matches 1
**/bar.* matches 1 and 2
**/bar*.* matches 1, 2, and 4
src/ba?.c matches 2 and 3
Here's a few extra pattern matches which are not so obvious from the documentation. Tested using NAnt for the example files in benzado's answer:
bar.txt
src/bar.c
src/baz.c
src/test/bartest.c
src** matches 2, 3 and 4
**.c matches 2, 3, and 4
**ar.* matches 1 and 2
**/bartest.c/** matches 4
src/ba?.c/** matches 2 and 3
Double asterisks (**) are associated with the folder-names matching, whereas single symbols asterisk (* = multi characters) as well as the question-mark (? = single character) are used to match the file-names.
Check out the Nant reference. The fileset patterns are:
'*' matches zero or more characters, e.g. *.cs
'?' matches one character, e.g. ?.cs
And '**' matches a directory tree e.g. src/**/*.cs will find all cs files in any sub-directory of src.
Related
Just came across this pattern, which I really don't understand:
^[%w-.]+$
And could you give me some examples to match this expression?
Valid in Lua, where %w is (almost) the equivalent of \w in other languages
^[%w-.]+$ means match a string that is entirely composed of alphanumeric characters (letters and digits), dashes or dots.
Explanation
The ^ anchor asserts that we are at the beginning of the string
The character class [%w-.] matches one character that is a letter or digit (the meaning of %w), or a dash, or a period. This would be the equivalent of [\w-.] in JavaScript
The + quantifier matches such a character one or more times
The $ anchor asserts that we are at the end of the string
Reference
Lua Patterns
Actually it will match nothing. Because there is an error: w- this is a start of a text range and it is out of order. So it should be %w\- instead.
^[%w\-.]+$
Means:
^ assert position at start of the string
[%w\-.]+ match a single character present in the list below
+ Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
%w a single character in the list %w literally (case sensitive)
\- matches the character - literally
. the literal character .
$ assert position at end of the string
Edit
As the OP changed the question and the tags this answer no longer fits as a proper answer. It is POSIX based answer.
As #zx81 comment:
%w is \w in Lua which means any alphanumeric characters plus "_"
I can't figure out what does this regex match:
A: "\\/\\/c\\/(\\d*)"
B: "\\/\\/(\\d*)"
I suppose they are matching some kind of number sequence since \d matches any digit but I'd like to know an example of a string that would be a match for this regex.
The pattern syntax is that specified by ICU. Expressions are created with NSRegularExpression in an iOS app and are correct.
The first matches //c/ + 0 or more digits. The second matches // + 0 or more digits. In both the digits are captured.
An example of a match for A) is //c/123
An example of a match for B) is //12345
When I use Cygwin which emulates Bash on Windows, I sometimes run into situations where I have to escape my escape characters which is what I think is making this expression look so weird. For instance, when I use sed to look for a single '\' I sometimes have to write it as '\\\\'. (Funny, StackOverflow proved my point. If you write 4 backslashes in the comment, it only shows two. So if you process it again, they might all disappear depending on your situation).
Considering this, it might be helpful to think of pairs of backslashes as representing only one if you're coming from a similar situation. My guess would be you are. Because of this I would say Erik Duymelinck is probably spot on. This will capture a sequence of digits that may or may not follow a couple slashes and a c:
//c/000
//00000
This regex matches an odd sequence of characters, which, at first glance, almost seem like a regex, since \d is a digit, and followed by an asterisk (\d*) would mean zero-or-more digits. But it's not a digit, because the escape-slash is escaped.
\\/\\/c\\/(\\d*)
So, for instance, this one matches the following text:
\/\/c\/\
\/\/c\/\d
\/\/c\/\dd
\/\/c\/\ddd
\/\/c\/\dddd
\/\/c\/\ddddd
\/\/c\/\dddddd
...
This one is almost the same
\\/\\/(\\d*)
except you just delete the c\/ from the above results:
\/\/\
\/\/\d
\/\/\dd
\/\/\ddd
\/\/\dddd
\/\/\ddddd
\/\/\dddddd
...
In both cases, the final \ and optional d is [capture group][1] one.
My first impression was that these regexes were intended for escaping in Java strings, meaning they would be completely invalid. If the were escaped for Java strings, such as
Pattern p = Pattern.compile("\\/\\/c\\/(\\d*)");
It would be invalid, because after un-escaping, it would result in this invalid regex:
\/\/c\/(\d*)
The single escape-slashes (\) are invalid. But the \d is valid, as it would mean any digit.
But again, I don't think they're invalid, and they're not escaped for a Java string. They're just odd.
I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.
I am working on an ANT pattern parser as part of a large server project.
There are some good examples of ANT patterns in the answer to this post: How do I use Nant/Ant naming patterns? however, I am still confused about some possible permutations.
One of the examples on the ANT pattern documentation here http://nant.sourceforge.net/release/0.85/help/types/fileset.html is as follows:
**/test/** Matches all files that have a test element in their path, including test as a filename.
My understanding is that ** matches one or more directories and also files under those directories. So I would expect **/test/** to match src/test/subfolder/file.txt and test/file2.txt but this statement seems to imply that it would also match a file named src/test. Is this correct even though there is a / after the test in the pattern?
Also, its not clear whether the following patterns would be valid:
folder**
folder1/folder**
**folder/file.txt
I would imagine that they would work the same as
folder*/**
folder1/folder*/**
**/*folder/file.txt
but are they allowed?
I did some testing with NAnt as per coolcfan's suggestion and answered my own question. The patterns in the question are all valid.
Based on the following files from the link in my question above:
bar.txt
src/bar.c
src/baz.c
src/test/bartest.c
The following unexpected patterns are also valid:
src** matches 2, 3 and 4
**.c matches 2, 3, and 4
**ar.* matches 1 and 2
**/bartest.c/** matches 4
src/ba?.c/** matches 2 and 3
For completeness, these are in addition to the following patterns taken from the link in my question above:
*.c matches nothing (there are no .c files in the current directory)
src/*.c matches 2 and 3
*/*.c matches 2 and 3 (because * only matches one level)
**/*.c matches 2, 3, and 4 (because ** matches any number of levels)
bar.* matches 1
**/bar.* matches 1 and 2
**/bar*.* matches 1, 2, and 4
src/ba?.c matches 2 and 3
I'm not quite good in regex.
With my input string LT 1 BLK 4 LAKES OF PARKWAY 5 R/P & AMEND
I'd like to match just the only part between the figure 4 and 5 in the string.
meaning that, my expected result is LAKES OF PARKWAY.
I've tried to come up with a pattern to get such result.
\d+\s+([A-z ]+)(\d+.*?)*$
but with my pattern, it only matches BLK and 5 R/P & AMEND, as group #1 and group #2 respectively. At the end of my thought pattern, I decide to use end of string matching, $.
So, when 5 R/P & AMEND got matched, the pointer should move further behind to the sub sequence part. Then, ([A-z ]+) should match LAKES OF PARKWAY.
What's wrong with my pattern? and how to get it to work?
Any advice would be very much appreciated.
Try \d+\s+(\D+)\d+\D*$
\D means 'anything that is not \d, so it won't be allowed to match, for example, between the first 1 and 4, because then the ending of the regex would be rejected at the later 5.