HowTo parse numbers from string with BOOST methods? - parsing

Problem: Visual C++ 10 project (using MFC and Boost libraries). In one of my methods I'm reading simple test.txt file.
Here is what inside of the file (std::string):
12 asdf789, 54,19 1000 nsfewer:22!13
Then I need to convert all digits to int only with boost methods. For example, I have a list of different characters which I have to parse:
( ’ ' )
( [ ], ( ), { }, ⟨ ⟩ )
( : )
( , )
( ! )
( . )
( - )
( ? )
( ‘ ’, “ ”, « » )
( ; )
( / )
And after conversation I must have some kind of a massive of int's values, like this one:
12,789,54,19,1000,22,13
Maybe some one already did this job?
PS. I'm new for boost.
Thanks!
Update
Here is my sample:
std::vector<int> v;
rule<> r = int_p[append(v)] >> *(',' >> int_p[append(v)]);
parse(data.c_str(), r, space_p);
All I have to do, is to add additional escape characters (,'[](){}:!...) in my code, but did not find how to do that!

Easy way out is regex.
Hard way out is using spirit
Middle-of-the-road is using algorithm::string::split, with the correct separators, and then looping over all individual parts using lexical_cast<>(). That way you can filter out the integers.
But again, regex will be much more robust plus it's much cleaner than all sorts of primitive string manipulation hacking.

In addition to regex, boost::spirit, and manually parsing the text, you can use AXE parser generator with VC++ 2010. The AXE rule would look something like this (not tested):
std::vector<unsigned> v;
auto text_rule = *(*(axe::r_any() - axe::r_numstr()) & ~axe::r_numstr()
>> axe::e_push_back(v)) & axe::r_end();
// test it
std::string str("12 asdf789, 54,19 1000 nsfewer:22!13");
text_rule(str.begin(), str.end());
// print result
std::for_each(v.begin(), v.end(), [](unsigned i) { std::cout << i << '\n'; });
The basic idea it to skip all input characters which don't match the number string rule (r_numstr).

Related

Skip over input stream in ATLAST forth

I'm trying to implement a kind of "conditional :" in ATLAST, the reasoning being I have a file that gets FLOADed multiple times to handle multiple steps of my program flow (I'm essentially abusing Forth as an assembler, step 1 does a first parsing for references, etc. and in step 2 the instruction words actually emit bytes).
So when declaring words for "macros" in that file, it errors out in step 2, because they were already declared in step 1, but I also can't just FORGET them, because that would forget everything that came afterwards, such as the references I just collected in step 1.
So essentially I need a ": that only runs in step 1", my idea being something like this:
VARIABLE STAGE
: ::
STAGE # 0 = IF
[COMPILE] : ( be a word declaration )
EXIT
THEN
BEGIN ( eat the disabled declaration )
' ( get the address of the next word )
['] ; ( get the address of semicolon )
= ( loop until they are equal )
UNTIL
; IMMEDIATE
:: FIVE 5 ; ( declares as expected )
FIVE . ( prints 5 )
1 STAGE ! ( up to here everything's fine )
:: FIVE 6 ; ( is supposed to do nothing, but errors out )
FIVE . ( is supposed to print 5 again )
The traced error message (starting from 1 STAGE !):
Trace: !
Trace: ::
Trace: STAGE
Trace: #
Trace: (LIT) 0
Trace: =
Trace: ?BRANCH
Trace: '
Trace: (LIT) 94721509587192
Trace: =
Trace: ?BRANCH
Trace: '
Word not specified when expected.
Trace: ;
Compiler word outside definition.
Walkback:
;
KEY ( -- ch ) as common in some other Forths for reading a single character from the input stream ( outside the :: declaration, since it's IMMEDIATE ) doesn't exist in ATLAST, the only related words I could find are:
': is supposed to read a word from the input stream, then pushes its compile address
[']: like ' but reads a word from the current line (the inside of the :: declaration)
(LIT)/(STRLIT): are supposed to read literals from the input stream according to the documentation, I could only ever make them segmentation fault, I think they're for compiler-internal use only (e.g., if the compiler encounters a number literal it will compile the (LIT) word to make it push that number onto the stack)
There aren't any WORD or PARSE either, as in some other Forths.
As you can see, ' is struggling actually getting something from the input stream for some weird reason, and it looks like ['] is failing to capture the ; which then errors out because it's suddenly encountering a ; where it doesn't belong.
I suspect it actually ran ' ['], even though it's supposed to work on the input stream, not the immediate line, and I'm clearly in compile mode there.
I did a similar thing with conditionally declaring variables, there it was rather easy to just [COMPILE] ' DROP to skip a single word (turning RES x into ' x DROP), but here I'm pretty sure I can't actually compile those instructions, because I can't emit a loop outside of a declaration. Unless there is a way to somehow compile similar code that recursively gets rid of everything until the ;.
A problem is that ' cannot find a number. A possible solution is to use a special dummy name for the definition, instead of skip it over:
: ::
STAGE # 0 = IF : EXIT THEN
' DROP \ this xt isn't needed
" : _dummy" EVALUATE ( -- n ) DROP
;
Or maybe use a new name every time:
: ::
STAGE # 0 = IF : EXIT THEN
' >NAME # \ ( s1 ) \ should be checked
": _dummy_" DUP >R S+
R> EVALUATE ( -- n ) DROP
;
But due to non standard words it might not work. Another problem is that non colon-definitions are out of the scope.
Perhaps, a better solution is a preprocessing by external means.
It appears that ATLAST is a primitive Forth, that doesn't allow you to go to a more sophisticated handling of sources. But all is not lost!
For example, a Forth implementation according to the ISO standard will handle the matter with ease with one or more of: REQUIRE [IF] [THEN] [DEFINED] SRC >IN NAME WORD FIND.
As you have a Forth, you can steal these words from another Forth and compile the code.
Another solution that may help directly is executing EXIT in interpret mode while loading a file.
You have to find out whether you can create a flag whether to abandon the input source. Then this definition might help:
: ?abandon IF S" EXIT" EVALUATE THEN ;
S" FIVE" FOUND ?abandon
Note that ?abandon must be executed in interpret mode.

What is RDROP in Forth?

I'm new to Forth and I'm using SwiftForth. I am looking for a way to read a matrix from file as described here Writing a text file into an array on Forth, but rdrop is not recognised. Is this exclusive to Gforth or is it part of a library? If it's a library, what are the steps needed to use it?
RDROP is a well known but not standardized word.
This word can be defined in the following way:
: rdrop ( R: x -- ) postpone r> postpone drop ; immediate
A conditional definition in a portable library can look like the following:
[UNDEFINED] RDROP [IF]
: RDROP ( R: x -- ) POSTPONE R> POSTPONE DROP ; IMMEDIATE
[THEN]
"rdrop" can also be defined as followed, although it is not strictly standards compliant:
: rdrop r> r> drop >r ;
This has the advantage that is can be used as an execution token and it will not attempt to compile words into the dictionary, although it will not likely do anything sensible.

New lines in word definition using interpreter directives of Gforth

I am using the interpreter directives (non ANS standard) control structures of Gforth as described in the manual section 5.13.4 Interpreter Directives. I basically want to use the loop words to create a dynamically sized word containing literals. I came up with this definition for example:
: foo
[ 10 ] [FOR]
1
[NEXT]
;
Yet this produces an Address alignment exception after the [FOR] (yes, I know you should not use a for loop in Forth at all. This is just for an easy example).
In the end it turned out that you have to write loops as one-liners in order to ensure their correct execution. So doing
: foo [ 10 [FOR] ] 1 [ [NEXT] ] ;
instead works as intended. Running see foo yields:
: foo
1 1 1 1 1 1 1 1 1 1 1 ; ok
which is exactly what I want.
Is there a way to get new lines in the word definition? The words I would like to write are way more complex, and for a presentation I would need them better formatted.
It would really be best to use an immediate word instead. For example,
: ones ( n -- ) 0 ?do 1 postpone literal loop ; immediate
: foo ( -- ten ones ) [ 10 ] ones ;
With SEE FOO resulting in the same as your example. With POSTPONE, especially with Gforth's ]] .. [[ syntax, the repeated code can be as elaborate as you like.
A multiline [FOR] would need to do four things:
Use REFILL to read in subsequent lines.
Save the read-in lines, because you'll need to evaluate them one by one to preserve line-expecting parsing behavior (such as from comments: \ ).
Stop reading in lines, and loop, when you match the terminating [NEXT].
Take care to leave >IN right after the [NEXT] so that interpretation can continue normally.
You might still run into issues with some code, like code checking SOURCE-ID.
For an example of using REFILL to parse across multiple lines, here's code from a recent posting from CLF, by Gerry:
: line, ( u1 caddr2 u2 -- u3 )
tuck here swap chars dup allot move +
;
: <text>  ( "text" -- caddr u )
here 0
begin
refill
while
bl word count s" </text>" compare
while
0 >in ! source line, bl c, 1+
repeat then
;
This collects everything between <text> and a </text> that's on its own line, as with a HERE document, while also adding spaces. To save the individual lines for [FOR] in an easy way, I'd recommend leaving 0 as a sentinel on the data stack and then drop SAVE-MEM 'd lines on top of it.

Wiki-fying a text using LPeg

Long story coming up, but I'll try to keep it brief. I have many pure-text paragraphs which I extract from a system and re-output in wiki format so that the copying of said data is not such an arduous task. This all goes really well, except that there are no automatic references being generated for the 'topics' we have pages for, which end up needing to be added by reading through all the text and adding it in manually by changing Topic to [[Topic]].
First requirement: each topic is only to be made clickable once, which is the first occurrence. Otherwise, it would become a really spammy linkfest, which would detract from readability. To avoid issues with topics that start with the same words
Second requirement: overlapping topic names should be handled in such a way that the most 'precise' topic gets the link, and in later occurrences, the less precise topics do not get linked, since they're likely not correct.
Example:
topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
input = "Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project."
output = function_to_be_written(input)
-- "[[Mary]] and [[Mr. Moore]] work together on [[Project Omega]]. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the [[Project]]."
Now, I quickly figured out a simple or complicated string.gsub() could not get me what I need to satisfy the second requirement, as it provides no way to say 'Consider this match as if it did not happen - I want you to backtrack further'. I need the engine to do something akin to:
input = "abc def ghi"
-- Looping over the input would, in this order, match the following strings:
-- 1) abc def ghi
-- 2) abc def
-- 3) abc
-- 4) def ghi
-- 5) def
-- 6) ghi
Once a string matches an actual topic and has not been replaced before by its wikified version, it is replaced. If this topic has been replaced by a wikified version before, don't replace, but simply continue the matching at the end of the topic. (So for a topic "abc def", it would test "ghi" next in both cases.)
Thus I arrive at LPeg. I have read up on it, played with it, but it is considerably complex, and while I think I need to use lpeg.Cmt and lpeg.Cs somehow, I am unable to mix the two properly to make what I want to do work. I am refraining from posting my practice attempts as they are of miserable quality and probably more likely to confuse anyone than assist in clarifying my problem.
(Why do I want to use a PEG instead of writing a triple-nested loop myself? Because I don't want to, and it is a great excuse to learn PEGs.. except that I am in over my head a bit. Unless it is not possible with LPeg, the first is not an option.)
So... I got bored and needed something to do:
topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
pcall ( require , 'luarocks.require' )
require 'lpeg'
local locale = lpeg.locale ( )
local endofstring = -lpeg.P(1)
local endoftoken = (locale.space+locale.punct)^1
table.sort ( topics , function ( a , b ) return #a > #b end ) -- Sort by word length (longest first)
local topicpattern = lpeg.P ( false )
for i = 1, #topics do
topicpattern = topicpattern + topics [ i ]
end
function wikify ( input )
local topicsleft = { }
for i = 1 , #topics do
topicsleft [ topics [ i ] ] = true
end
local makelink = function ( topic )
if topicsleft [ topic ] then
topicsleft [ topic ] = nil
return "[[" .. topic .. "]]"
else
return topic
end
end
local patt = lpeg.Ct (
(
lpeg.Cs ( ( topicpattern / makelink ) )* #(-locale.alnum+endofstring) -- Match topics followed by something thats not alphanumeric
+ lpeg.C ( ( lpeg.P ( 1 ) - endoftoken )^0 * endoftoken ) -- Skip tokens that aren't topics
)^0 * endofstring -- Match adfinum until end of string
)
return table.concat ( patt:match ( input ) )
end
print(wikify("Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project.")..'"')
print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa.")..'"')
I start off my making a pattern which matches all the different topics; we want to match the longest topics first, so sort the table by word length from longest to shortest.
Now we need to make a list of the topics we haven't seen in the current input.
makelink quotes/links the topic if we haven't seen it already, otherwise leaves it be.
Now for the actual lpeg stuff:
lpeg.Ct packs all our captures into a table (to be concated together for output)
topicpattern / makelink captures a topic, and passes in through our makelink function.
lpeg.Cs substitutes the result of makelink back in where the match of the topic was.
+ lpeg.C ( ( lpeg.P ( 1 ) - locale.space )^0 * locale.space^1 ) if we didn't match a topic, skip a word (that is, not spaces followed by a space)
^0 repeat.
Hope thats what you wanted :)
Daurn
Note: Edited code, description no longer correct
So why don't you use string.find? It search only for a first topic occurrence and gives you its starting index and length. All you have to do is to add '[[' to a result.
For each chunk, copy the topics table and when the first occurency has been found, remove it.
Sort topics by length, most long first so that the most relevant topic will be found first
LPeg is a good tool, but it's not necessary to use it here.

Comparison Efficiency

What is generally faster:
if (num >= 10)
or:
if (!(num < 10))
The compiler will most likely optimize that sort of thing. Don't worry about it, just code for clarity in this case.
Assembly languages often have operations for >= and <= that are the same number of steps as < and >. For instance, with a Motorola 68k, if you want to compare the data registers %d0 and %d1 and branch if %d0 is greater than or equal to %d1, you would say something like:
cmp %d0, %d1 // compare %d0 and %d1, storing the result
// in the condition code registers.
bge labelname // Branch to the given label name if the comparison
// yielded "greater than or equal to" (hence bge)
It's a common mistake to think that a >= b means the computer will perform two operations instead of one because of that "or" in "greater than or equal to".
Any decent compiler will optimize those two statements to exactly the same underlying code. In fact, it will most likely generate exactly the same code for:
if (!(!(!(!(!(!(!(num < 10))))))))
I would opt for the first of yours just because its intent seems much clearer (mildly clearer than your second choice, massively clearer than that monstrosity I posted above). I tend to think in terms of how I would read it. Think of the two sentences:
if number is greater than or equal to ten.
if it's not the case that number is less than ten.
I believe the first one to be clearer.
In fact, just testing with "gcc -s" to get the assembler output, both statements generate the following code:
cmpl $9,-8(%ebp) ; compare value with 9
jle .L3 ; branch if 9 or less.
I believe you're wasting your time looking at micro-optimisations like this - you'd be far more efficient looking at things like algorithm selection. There's likely to be a much greater return on investment there.
In general any speed difference won't matter a great deal, but they don't necessarily mean exactly the same thing.
In many languages, comparing the floating point value NaN returns false for all comparisons, so if num = NaN, the first is false and the second true.
#include <iostream>
#include <limits>
int main ( ) {
using namespace std;
double num = numeric_limits<double>::quiet_NaN();
cout << boolalpha;
cout << "( num >= 10 ) " << ( num >= 10 ) << endl;
cout << "( ! ( num < 10 ) ) " << ( ! ( num < 10 ) ) << endl;
cout << endl;
}
outputs
( num >= 10 ) false
( ! ( num < 10 ) ) true
So the compiler can use a single instruction to compare num and the value 10 in the first case, but in the second may issue a second instruction to invert the result of the comparison. ( or it may just use a branch if zero rather than branch if non-zero, you can't say in general )
Other languages and compilers will vary, and for types where they really have the same semantics the code emitted might well be identical.

Resources