I'm trying to do this example :
sentence="{My name is {Adam} and I don't work here}"
Result should be 'Adam'
So what I'm trying to say is however many parenthesis exist I want the result to show the value of the last closed parenthesis
It's not clear from your question, but if there can only ever be one set of outer braces at any level (i.e. "{My name} {is {Adam}}" and "{My {name} is {Adam}}" are invalid input), you can take advantage of the fact that what you want is the last opening brace in the sentence.
def deepest(sentence):
intermediate = sentence.rpartition("{")[-1]
return intermediate[:intermediate.index("}")]
deepest("{My name is {Adam} and I don't work here}")
# 'Adam'
deepest("{Someone {set us {{up} the bomb}!}}")
# 'up'
The regex answer also makes this assumption, though regex is likely to be much slower. If multiple outer braces are possible, please make your question clearer.
You can't just index strings like that... The best way is to use a clever regex:
>>> import re
>>> re.search(r'{[^{}]*}', "{My name is {Adam} and I don't work here}").group()
'{Adam}'
This regex pattern essentially searches for every set of {} that doesn't have the characters { or } in them.
Related
my problem is I need to write a Lua code to interpret a text file and match lines with a pattern like
if line_str:match(myPattern) then do myAction(arg) end
Let's say I want a pattern to match lines containing "hello" in any context except one containing "hello world". I found that in regex, what I want is called negative lookahead, and you would write it like
.*hello (?!world).*
but I'm struggling to find the Lua version of this.
Let's say I want a pattern to match lines containing "hello" in any context except one containing "hello world".
As Wiktor has correctly pointed out, the simplest way to write this would be line:find"hello" and not line:find"hello world" (you can use both find and match here, but find is probably more performant; you can also turn off pattern matching for find).
I found that in regex, what I want is called negative lookahead, and
you would write it like .*hello (?!world).*
That's incorrect. If you checked against the existence of such a match, all it would tell you would be that there exists a "hello" which is not followed by a "world". The string hello hello world would match this, despite containing "hello world".
Negative lookahead is a questionable feature anyways as it isn't trivially provided by actually regular expressions and thus may not be implemented in linear time.
If you really need it, look into LPeg; negative lookahead is implemented as pattern1 - pattern2 there.
Finally, the RegEx may be translated to "just Lua" simply by searching for (1) the pattern without the negative part (2) the pattern with the negative part and checking whether there is a match in (1) that is not in (2) simply by counting:
local hello_count = 0; for _ in line:gmatch"hello" do hello_count = hello_count + 1 end
local helloworld_count = 0; for _ in line:gmatch"helloworld" do helloworld_count = helloworld_count + 1 end
if hello_count > helloworld_count then
-- there is a "hello" not followed by a "world"
end
isnumber(search("-tr",right(j2,3
))),isnumber(search("-trus",right(j2,5))),isnumber(search(" ll",right(j2,3))),isnumber(search(" homes",right(j2,6))),isnumber(search("the ",left(j2,4))),isnumber(search(" hoa",right(j2,4))),isnumber(search("b ch",right(j2,4))),isnumber(search(" ch",right(j2,3))),isnumber(search("-trs",right(j2,4))),isnumber(search(" prop",right(j2,5))),isnumber(search(" st",right(j2,3))),isnumber(search(" av",right(j2,3))),isnumber(search(" ave",right(j2,4))),isnumber(search(" servi",right(j2,6))),isnumber(search(" maint",right(j2,6))),isnumber(search(" home",right(j2,5))),isnumber(search(" tr",right(j2,3))),isnumber(search(" assn",right(j2,5))),isnumber(search(" co",right(j2,3))),isnumber(search(" trus",right(j2,5))),isnumber(search(" trs",right(j2,4))),isnumber(search("-trs",right(j2,4))),isnumber(search(" tru",right(j2,4))),isnumber(search("jtrs",right(j2,4))),isnumber(search(" est of",right(j2,7))),isnumber(search(" trs",right(j2,4))),isnumber(value(LEFT(j2,1))),isnumber(search(" apts",right(j2,5))),isnumber(value(right(j2,3))),isnumber(search(" grp",right(j2,4))),isnumber(value(left(right(j2,4),1))),isnumber(search(" mgmt",right(j2,5))),isnumber(search(" props",right(j2,6))),isnumber(search(" tr",right(j2,3))),isnumber(search(" dev",right(j2,4))),isnumber(search(" tr",right(j2,3))),isnumber(search(" fdn",right(j2,4))),isnumber(search(" ent",right(j2,4))),isnumber(search(" PRPTS",right(j2,6))),isnumber(search(" ARPTS",right(j2,6))),isnumber(search(" univ",right(j2,5)))
So I have this giant =OR() statement containing a bunch of isnumner(search() statements checking to see if the string in a cell ends in these phrases. It is for the purpose of identifying company names in lists that contain both peoples names and company names. I feel like there must be a more efficient way. Adding them all together in one isnumber(search() in this format {item1|item2|item3} does not work.
I feel like there must be a more efficient way.
Building on the answer provided here, matching the end of the string can be done by using the $-sign (which means 'end of the string in regular expressions). Matching the beginning of the string on the other hand is done by providing a pattern after a caret (^), indicating the beginning of a string.
So, if you'd want to add both the the formula provided in the other thread
(LP|JT/RS)$ : match LP OR JT/RS at the end of the string
^(ABC|DEF) : match ABC OR DEF at the beginning of the string
That would make the formula look something like:
=REGEXMATCH(A2, "(?i)LLC|CORPORATION|COMPANY|HOLDINGS|PARTNERS|EQUITY|(LP|JT/RS)$|^(ABC|DEF)")
REFERENCE:
REGEXMATCH()
RE2 SYNTAX
I'm trying to match the prefix of the string Something. For example, If input So,SOM,SomeTH,some,S, it is all accepted because they are all prefixes of Something.
My code
Ss[oO]|Ss[omOMOmoM] {
printf("Accept Something": %s\n", yytext);
}
Input
Som
Output
Accept Something: So
Invalid Character
It's suppose to read Som because it is a prefix of Something. I don't get why my code doesn't work. Can anyone correct me on what I am doing wrong?
I don't know what you think the meaning of
Ss[oO]|Ss[omOMOmoM]
is, but what it matches is either:
an S followed by an s followed by exactly one of the letters o or O, or
an S followed by an s followed by exactly one of the letters o, O, m or M. Putting a symbol more than once inside a bracket expression has no effect.
Also, I don't see how that could produce the output you report. Perhaps there was a copy-and-paste error, or perhsps you have other pattern rules.
If you want to match prefixes, use nested optional matches:
s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?
If you want case-insensitive matcges, you could write out all the character classes, but that gets tiriesome; simpler is to use a case-insensitve flag:
(?i:s(o(m(e(t(h(i(ng?)?)?)?)?)?)?)?)
(?i: turns on the insensitive flag, until the matching close parenthesis.
In practice, this is probably not what you want. Normally, you will want to recognise a complete word as a token. You could then check to see if the word is a prefix in the rule action:
[[:alpha:]]+ { if (yyleng <= strlen("something") && 0 == strncasemp(yytext, "something", yyleng) {
/* do something */
}
}
There is lots of information in the Flex manual.
Right now your code (as shown) should only match "Sso" or "SsO" or "Ssm" or "SsM".
You have two alternatives that each start with Ss (without square brackets) so those will be matched literally. That's followed by either [oO] or [omOMomoM], but the characters in square brackets represent alternatives, so that's equivalent to [oOmM] --i.e., any one character of of o, O, m or M.
I'd start with: %option caseless to make it a case-insensitive scanner, so you don't have to list the upper- and lower-case equivalents of every letter.
Then it's probably easiest to just list the alternatives literally:
s|so|som|some|somet|someth|somethi|somethin|something { printf("found prefix"); }
I guess you can make the pattern a bit shorter (at least in the source code) by doing something on this order:
s(o(m(e(t(h(i(n(n(g)?)?)?)?)?)?)?)?)? { printf("found prefix"); }
Doesn't seem like a huge improvement to me, but some might find it more attractive than I do.
If you don't want to use %option caseless the basic idea helps more:
[sS]([oO]([mM]([eE]([tT]([hH]([iI]([nN]([gG])?)?)?)?)?)?)?)? { printf("found prefix"); }
Listing every possible combination of upper and lower case would get tedious.
Trying to work out how to parse out phone numbers that are left in a string.
e.g.
"Hi Han, this is Chewie, Could you give me a call on 02031234567"
"Hi Han, this is Chewie, Could you give me a call on +442031234567"
"Hi Han, this is Chewie, Could you give me a call on +44 (0) 203 123 4567"
"Hi Han, this is Chewie, Could you give me a call on 0207-123-4567"
"Hi Han, this is Chewie, Could you give me a call on 02031234567 OR +44207-1234567"
And be able to consistently replace any one of them with some other item (e.g. some text, or a link).
Am assuming it's a regex type approach (I'm already doing something similar with email which works well).
I've got to
text.scan(/([^A-Z|^"]{6,})/i)
Which leaves me a leading space I can't work out how to drop (would appreciate the help there).
Is there a standard way of doing this that people use?
It also drops things into arrays, which isn't particularly helpful
i.e. if there were multiple numbers.
[["02031234567"]["+44207-1234567"]]
as opposed to
["02031234567","+44207-1234567"]
Adding in the third use-case with spaces is difficult. I think the only way to successfully meet that acceptance criteria would be to chain a #gsub call on to your #scan.
Thus:
text.gsub(/\s+/, "").scan(/([^A-Z|^"|^\s]{6,})/i)
The following code will extract all the numbers for you:
text.scan(/(?<=[ ])[\d \-+()]+$|(?<=[ ])[\d \-+()]+(?=[ ]\w)/)
For the examples you supplied this results in:
["02031234567"]
["+442031234567"]
["+44 (0) 203 123 4567"]
["0207-123-4567"]
["02031234567", "+44207-1234567"]
To understand this regex, what we are matching is:
[\d \-+()]+ which is a sequence of one or more digits, spaces, minus, plus, opening or closing brackets (in any order - NB regex is greedy by default, so it will match as many of these characters next to each other as possible)
that must be preceded by a space (?<=[ ]) - NB the space in the positive look-behind is not captured, and therefore this makes sure that there are no leading spaces in the results
and is either at the end of the string $, or | is followed by a space then a word character (?=[ ]\w) (NB this lookahead is not captured)
This pattern will get rid of the space but not match your third case with spaces:
/([^A-Z|^"|^\s]{6,})/i
This is what I came to in the end in case it helps somebody
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
That gives me an array of
["+442031234567", "02031234567"]
I'm sure there is a more elegant way of doing this and possibly you'd want to check the numbers for likelihood of being phonelike - e.g. using the brilliant Phony gem.
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
real_numbers = numbers.keep_if{|n| Phony.plausible? PhonyRails.normalize_number(n, default_country_code: "GB")}
Which should help exclude serial numbers or the like from being identified as numbers. You'll obviously want to change the country code to something relevant for you.
Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)