Simple text parser - parsing

I want to create a very simple parser to convert:
"I wan't this to be ready by 10:15 p.m. today Mr. Gönzalés.!" to:
(
'I',
' ',
'wan',
'\'',
't',
' ',
'this',
' ',
'to',
' ',
'be',
' ',
'ready',
' ',
'by',
' ',
'10',
':',
'15',
' ',
'p',
'.',
'm',
'.',
' ',
'today',
' ',
'Mr'
'.'
' ',
'Gönzalés',
'.'
'!'
)
So basically I want consecutive letters and numbers to be grouped into a single string. I'm using Python 3 and I don't want to install external libs. I also would like the solution to be as efficient as possible as I will be processing a book.
So what approaches would you recommend me with regard to solving this problem. Any examples?
The only way I can think of now is to step trough the text, character for character, in a for loop. But I'm guessing there's a better more elegant approach.
Thanks,
Barry

You are looking for a procedure called tokenization. That means splitting raw text into discrete "tokens", in our case just words. For programming languages this is fairly easy, but unfortunately it is not so for natural language.
You need to do two things: Split up the text in sentences and split the sentences into words. Usually we do this with regular expressions. Naïvely you could split sentences by the pattern ". ", ie period followed by space, and then split up the words in sentences by space. This won't work very well however, because abbreviations are often also ending in periods. As it turns out, tokenizing and sentence segmentation is actually fairly tricky to get right. You could experiment with several regexps, but it would be better to use a ready made tokenizer. I know you didn't want to install any external libs, but im sure this will spare you pain later on. NLTK has good tokenizers.

I believe this is a solution:
import regex
text = "123 2 can't, 4 Å, é, and 中ABC _ sh_t"
print(regex.findall('\d+|\P{alpha}|\p{alpha}+', text))
Can it be improved?
Thank!

Related

Delphi 5 Use Pos / PosIgnoreCase for whole words only

I know Delphi 5 is really old but I have no other choice for now because my employer doesn't want to change, so I am stuck with old functions etc.
I would like to know if there was a way to get the position of the whole words I am looking for:
I have a list of words (if, then, else, and etc) named KEYWORDS, and for each word in it I have to check in every .pas file if this word contains some uppercase characters.
On my code, I am reading each line, and for each line, I am using this to find if I find any word in the list and if it has some uppercase characters:
if(PosIgnoreCase(KEYWORDS[I], S) <> Pos(KEYWORDS[I], S)) //Then the keyword has some uppercases in this line and I must raise an error
My problem is that if I use some words that contains the keywords ( for example "MODIFICATION") this will detect the uppercase IF in it and raise an error
I tried using if(PosIgnoreCase(' ' + KEYWORDS[I] + ' ', S) <> Pos(' ' + KEYWORDS[I] + ' ', S))
but there may be some parentheses or other characters instead of the spaces so I would like to avoid making a new condition for each character.
Is there a clean way to do it ? I found myself struggling quite often with the lack of functions in Delphi 5
Sorry if my question is somewhat confusing, english is not my first language.
Thank you for your time.
Update (from comments):
My list of keywords only contains the reserved keywords on Delphi

Match a word or whitespaces in Lua

(Sorry for my broken English)
What I'm trying to do is matching a word (with or without numbers and special characters) or whitespace characters (whitespaces, tabs, optional new lines) in a string in Lua.
For example:
local my_string = "foo bar"
my_string:match(regex) --> should return 'foo', ' ', 'bar'
my_string = " 123!#." -- note: three whitespaces before '123!#.'
my_string:match(regex) --> should return ' ', ' ', ' ', '123!#.'
Where regex is the Lua regular expression pattern I'm asking for.
Of course I've done some research on Google, but I couldn't find anything useful. What I've got so far is [%s%S]+ and [%s+%S+] but it doesn't seem to work.
Any solution using the standart library, e.g. string.find, string.gmatch etc. is OK.
Match returns either captures or the whole match, your patterns do not define those. [%s%S]+ matches "(space or not space) multiple times more than once", basically - everything. [%s+%S+] is plain wrong, the character class [ ] is a set of single character members, it does not treat sequences of characters in any other way ("[cat]" matches "c" or "a"), nor it cares about +. The [%s+%S+] is probably "(a space or plus or not space or plus) single character"
The first example 'foo', ' ', 'bar' could be solved by:
regex="(%S+)(%s)(%S+)"
If you want a variable number of captures you are going to need the gmatch iterator:
local capt={}
for q,w,e in my_string:gmatch("(%s*)(%S+)(%s*)") do
if q and #q>0 then
table.insert(capt,q)
end
table.insert(capt,w)
if e and #e>0 then
table.insert(capt,e)
end
end
This will not however detect the leading spaces or discern between a single space and several, you'll need to add those checks to the match result processing.
Lua standard patterns are simplistic, if you are going to need more intricate matching, you might want to have a look at lua lpeg library.

Can a lexer rule be applied in only one parser rule?

The issue we're having with ANTLR is that we have a grammar that's parsing something like this:
Hello, my name is bob.
bob offset: 5
Keep in mind that the "bob." in the first line is dynamic, and could be anything. One of those things is "bob". The "bob offset" line is not dynamic, and is in every file of the type that we are parsing.
So, to parse this, we have a couple of rules:
greeting: 'Hello, my name is' id1=IDENT '.' NEWLINE
{ System.out.println("Name: " + $id1.text"); }
;
bob_offset: 'bob offset:' id1=5 NEWLINE
{ System.out.println("bob offset: " + $id1.text); }
;
So, the issue is that 'bob offset:' is a token that the lexer reads. Now, when the greeting rule goes, an error is thrown because it's trying to match 'bob' to 'bob offset:', but it can't.
The solution that would be ideal is if ANTLR had some way to specify context- or parser rule-specific lexer rules. This way, the 'bob offset:' token wouldn't be mistaken anywhere else in the grammar.
Any thoughts on this issue would be appreciated.
We ended up having to work around this with more parser rules to flesh it out more specifically for ANTLR.

Replace underscores with ampersands and hyphens with spaces

I have a mysql table that contains words joined by underscores and also words joined by hyphens.
example: Engineering-Service_Civil-Geotech
I am able to replace the underscore with an ampersand and add a space on either side, but im stuck at how to replace the hyphen with one blank space as well.
$cleanCat = str_replace( '_', ' & ', $Cat);
echo $cleanCat;
The result of the above code gives me one solution but not both:
example: Engineering-Service & Civil-Geotech
Do i have to use a different command to achieve this?
thanks in advance.
$cleanCat = str_replace('-', ' ', str_replace( '_', ' & ', $Cat));
str_replace( '-', ' ', $Cat); or str_replace( '-', ' ', $Cat);
should work

Mathematica function foo that can distinguish foo[.2] from foo[.20]

Suppose I want a function that takes a number and returns it as a string, exactly as it was given. The following doesn't work:
SetAttributes[foo, HoldAllComplete];
foo[x_] := ToString[Unevaluated#x]
The output for foo[.2] and foo[.20] is identical.
The reason I want to do this is that I want a function that can understand dates with dots as delimiters, eg, f[2009.10.20]. I realize that's a bizarre abuse of Mathematica but I'm making a domain-specific language and want to use Mathematica as the parser for it by just doing an eval (ToExpression). I can actually make this work if I can rely on double-digit days and months, like 2009.01.02 but I want to also allow 2009.1.2 and that ends up boiling down to the above question.
I suspect the only answer is to pass the thing in as a string and then parse it, but perhaps there's some trick I don't know. Note that this is related to this question: Mathematica: Unevaluated vs Defer vs Hold vs HoldForm vs HoldAllComplete vs etc etc
I wouldn't rely on Mathematica's float-parsing. Instead I'd define rules on MakeExpression for foo. This allows you to intercept the input, as boxes, prior to it being parsed into floats. This pair of rules should be a good starting place, at least for StandardForm:
MakeExpression[RowBox[{"foo", "[", dateString_, "]"}], StandardForm] :=
With[{args = Sequence ## Riffle[StringSplit[dateString, "."], ","]},
MakeExpression[RowBox[{"foo", "[", "{", args, "}", "]"}], StandardForm]]
MakeExpression[RowBox[{"foo", "[", RowBox[{yearMonth_, day_}], "]"}],
StandardForm] :=
With[{args =
Sequence ## Riffle[Append[StringSplit[yearMonth, "."], day], ","]},
MakeExpression[RowBox[{"foo", "[", "{", args, "}", "]"}], StandardForm]]
I needed the second rule because the notebook interface will "helpfully" insert a space if you try to put a second decimal place in a number.
EDIT: In order to use this from the kernel, you'll need to use a front end, but that's often pretty easy in version 7. If you can get your expression as a string, use UsingFrontEnd in conjunction with ToExpression:
UsingFrontEnd[ToExpression["foo[2009.09.20]", StandardForm]
EDIT 2: There's a lot of possibilities if you want to play with $PreRead, which allows you to apply special processing to the input, as strings, before they're parsed.
$PreRead = If[$FrontEnd =!= Null, #1,
StringReplace[#,x:NumberString /; StringMatchQ[x,"*.*0"] :>
StringJoin[x, "`", ToString[
StringLength[StringReplace[x, "-" -> ""]] -
Switch[StringTake[StringReplace[x,
"-" -> ""], 1], "0", 2, ".", 1, _,
1]]]]] & ;
will display foo[.20] as foo[0.20]. The InputForm of it will be
foo[0.2`2.]
I find parsing and displaying number formats in Mathematica more difficult than
it should be...
Floats are, IIRC, parsed by Mathematica into actual Floats, so there's no real way to do what you want.

Resources