Spirit Qi conditional parsing - parsing

I am writing a pdf parsing library.
Once upon a time, I had an input to parse like this one:
1 0 obj
(anything)
endobj
I've created parsing rule for the outer container and then separate rule for the inner object:
CONTAINER_PARSER %=
number >> number >> "obj" >> OBJECT_PARSER >> "endobj";
OBJECT_PARSER %= number | value | ...
This worked without any problems. But, for various reasons a I had to redesign the rules, so that both container values belongs to the object itself.
The container itself is only optional. Meaning, the previous code and the following denotes the same object, without the additional container info:
(anything)
I had 2 ideas, how to solve this problem, but it seems to me, that both are incompatible with Qi approach.
Alternative parser
I wanted to tell the parser, to parse either value contained inside obj - endobj, or to parse only the value.
start %=
(
object_number
>> generation_number
>> qi::lit("obj")
>> object
> qi::lit("endobj")
) | object;
// I intentionally missed some semantic actions assigning the values to the object,
because it is out of the scope of my problem
I didn't manage to make this work, because both parts of the alternation has the same exposed attribute, and the compiler was confused.
Optional approach
I've tried to tell the parser, that the former container is only optional to the parsed value.
start %=
-(
object_number
>> generation_number
>> qi::lit("obj")
)
>> object
> -qi::lit("endobj");
Problem with this approach is, that the last part "endobj" has to be present, if the first part is present as well.
The solution might be trivial, but I was really not able to figure it out from either code, documentation and stackoverflow answers.

UPDATE After the comment:
start =
(
( object_number >> generation_number
| qi::attr(1) > qi::attr(0) // defaults
) >> "obj" >> object > "endobj"
| qi::attr(1) >> qi::attr(0) >> object
)
;
Assuming you're not interested in the (optional) numbers:
start =
-qi::omit [ object_number >> generation_number ]
>> "obj" >> object > "endobj"
;
If you are interested and have suitable defaults:
start =
( object_number >> generation_number
| qi::attr(1) > qi::attr(0) // defaults
)
>> "obj" >> object > "endobj"
;
Of course, you could
alter the recipient type to expect optional<int> for the object_numbers so you could simply -object_number >> -generation_number; This would be kinda sloppy since it also allows "1 obj (anything) endobj"
alter the recipient type to be a variant:
boost::variant<simple_object, object_contaier>
in this case your AST matches the "alternative" approach (first one) from your question

Related

Skip over input stream in ATLAST forth

I'm trying to implement a kind of "conditional :" in ATLAST, the reasoning being I have a file that gets FLOADed multiple times to handle multiple steps of my program flow (I'm essentially abusing Forth as an assembler, step 1 does a first parsing for references, etc. and in step 2 the instruction words actually emit bytes).
So when declaring words for "macros" in that file, it errors out in step 2, because they were already declared in step 1, but I also can't just FORGET them, because that would forget everything that came afterwards, such as the references I just collected in step 1.
So essentially I need a ": that only runs in step 1", my idea being something like this:
VARIABLE STAGE
: ::
STAGE # 0 = IF
[COMPILE] : ( be a word declaration )
EXIT
THEN
BEGIN ( eat the disabled declaration )
' ( get the address of the next word )
['] ; ( get the address of semicolon )
= ( loop until they are equal )
UNTIL
; IMMEDIATE
:: FIVE 5 ; ( declares as expected )
FIVE . ( prints 5 )
1 STAGE ! ( up to here everything's fine )
:: FIVE 6 ; ( is supposed to do nothing, but errors out )
FIVE . ( is supposed to print 5 again )
The traced error message (starting from 1 STAGE !):
Trace: !
Trace: ::
Trace: STAGE
Trace: #
Trace: (LIT) 0
Trace: =
Trace: ?BRANCH
Trace: '
Trace: (LIT) 94721509587192
Trace: =
Trace: ?BRANCH
Trace: '
Word not specified when expected.
Trace: ;
Compiler word outside definition.
Walkback:
;
KEY ( -- ch ) as common in some other Forths for reading a single character from the input stream ( outside the :: declaration, since it's IMMEDIATE ) doesn't exist in ATLAST, the only related words I could find are:
': is supposed to read a word from the input stream, then pushes its compile address
[']: like ' but reads a word from the current line (the inside of the :: declaration)
(LIT)/(STRLIT): are supposed to read literals from the input stream according to the documentation, I could only ever make them segmentation fault, I think they're for compiler-internal use only (e.g., if the compiler encounters a number literal it will compile the (LIT) word to make it push that number onto the stack)
There aren't any WORD or PARSE either, as in some other Forths.
As you can see, ' is struggling actually getting something from the input stream for some weird reason, and it looks like ['] is failing to capture the ; which then errors out because it's suddenly encountering a ; where it doesn't belong.
I suspect it actually ran ' ['], even though it's supposed to work on the input stream, not the immediate line, and I'm clearly in compile mode there.
I did a similar thing with conditionally declaring variables, there it was rather easy to just [COMPILE] ' DROP to skip a single word (turning RES x into ' x DROP), but here I'm pretty sure I can't actually compile those instructions, because I can't emit a loop outside of a declaration. Unless there is a way to somehow compile similar code that recursively gets rid of everything until the ;.
A problem is that ' cannot find a number. A possible solution is to use a special dummy name for the definition, instead of skip it over:
: ::
STAGE # 0 = IF : EXIT THEN
' DROP \ this xt isn't needed
" : _dummy" EVALUATE ( -- n ) DROP
;
Or maybe use a new name every time:
: ::
STAGE # 0 = IF : EXIT THEN
' >NAME # \ ( s1 ) \ should be checked
": _dummy_" DUP >R S+
R> EVALUATE ( -- n ) DROP
;
But due to non standard words it might not work. Another problem is that non colon-definitions are out of the scope.
Perhaps, a better solution is a preprocessing by external means.
It appears that ATLAST is a primitive Forth, that doesn't allow you to go to a more sophisticated handling of sources. But all is not lost!
For example, a Forth implementation according to the ISO standard will handle the matter with ease with one or more of: REQUIRE [IF] [THEN] [DEFINED] SRC >IN NAME WORD FIND.
As you have a Forth, you can steal these words from another Forth and compile the code.
Another solution that may help directly is executing EXIT in interpret mode while loading a file.
You have to find out whether you can create a flag whether to abandon the input source. Then this definition might help:
: ?abandon IF S" EXIT" EVALUATE THEN ;
S" FIVE" FOUND ?abandon
Note that ?abandon must be executed in interpret mode.

Is it possible to do a zip operation in apache beam on two PCollections?

I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.

New lines in word definition using interpreter directives of Gforth

I am using the interpreter directives (non ANS standard) control structures of Gforth as described in the manual section 5.13.4 Interpreter Directives. I basically want to use the loop words to create a dynamically sized word containing literals. I came up with this definition for example:
: foo
[ 10 ] [FOR]
1
[NEXT]
;
Yet this produces an Address alignment exception after the [FOR] (yes, I know you should not use a for loop in Forth at all. This is just for an easy example).
In the end it turned out that you have to write loops as one-liners in order to ensure their correct execution. So doing
: foo [ 10 [FOR] ] 1 [ [NEXT] ] ;
instead works as intended. Running see foo yields:
: foo
1 1 1 1 1 1 1 1 1 1 1 ; ok
which is exactly what I want.
Is there a way to get new lines in the word definition? The words I would like to write are way more complex, and for a presentation I would need them better formatted.
It would really be best to use an immediate word instead. For example,
: ones ( n -- ) 0 ?do 1 postpone literal loop ; immediate
: foo ( -- ten ones ) [ 10 ] ones ;
With SEE FOO resulting in the same as your example. With POSTPONE, especially with Gforth's ]] .. [[ syntax, the repeated code can be as elaborate as you like.
A multiline [FOR] would need to do four things:
Use REFILL to read in subsequent lines.
Save the read-in lines, because you'll need to evaluate them one by one to preserve line-expecting parsing behavior (such as from comments: \ ).
Stop reading in lines, and loop, when you match the terminating [NEXT].
Take care to leave >IN right after the [NEXT] so that interpretation can continue normally.
You might still run into issues with some code, like code checking SOURCE-ID.
For an example of using REFILL to parse across multiple lines, here's code from a recent posting from CLF, by Gerry:
: line, ( u1 caddr2 u2 -- u3 )
tuck here swap chars dup allot move +
;
: <text>  ( "text" -- caddr u )
here 0
begin
refill
while
bl word count s" </text>" compare
while
0 >in ! source line, bl c, 1+
repeat then
;
This collects everything between <text> and a </text> that's on its own line, as with a HERE document, while also adding spaces. To save the individual lines for [FOR] in an easy way, I'd recommend leaving 0 as a sentinel on the data stack and then drop SAVE-MEM 'd lines on top of it.

Variable assignment confusion

I'm stuck on a homework problem, and found a solution by looking at someone's work. Problem is I don't understand it. Here's the code
ordering, #title_header = {:order => :title}, 'hilite'
What is happening to 'ordering' in this case? I tried googling to see if that's a method, but couldn't find anything.
What you're seeing is actually list expansion. Ruby can do the following:
a, b = 1, 2
This is essentially the same as:
a = 1
b = 2
I'll leave the rest for you to figure out.
This is "parallel assignment".
Parallel assignment in Ruby is what happens when there is more than 1 lvalue (i.e., value on the left-hand side of the equals sign) and/or more than 1 rvalue (value on the right-hand side of the equals sign).
To understand parallel assignment, there are various cases to consider.
Case #1: One value on the left, multiple values on the right
The first, simplest case is when there is 1 lvalue and multiple rvalues. For example:
> a = 1, 2
> a
=> [1, 2]
All that's happening is the right-hand comma-separated list of values is converted into an array and assigned to the left-hand variable.
Case #2: Multiple values on the left, the same number of values on the right
The second case is when there are the same number of lvalues and rvalues. For example:
> a,b = 'foo', 'bar'
> a
=> "foo"
> b
=> "bar"
This is also pretty straightforward -- you simply evaluate each item on the right hand, then assign it to its corresponding variable on the left-hand side, in order.
Case #3: Multiple values on the left, one value on the right -- but it's an array
The third case is when the rvalue is an array, and that array's elements are distributed (or "expanded") among multiple lvalues. For example:
> a, b = ['foo', 'bar']
> a
=> "foo"
> b
=> "bar"
This is effectively the same as case #2, above, except in this case explicitly employing array syntax.
Case #4: More values on the left than on the right
The fourth case is when there are multiple values on both sides, and there are more lvalues than rvalues. For example:
> a, b, c = 'foo', 'bar' # Note only 2 values on the right
> a
=> 'foo'
> b
=> 'bar'
> c
=> nil
As you can see, Ruby did the best it could to distribute the values, but ran out of them and was forced to assign nil to the last variable on the left.
Case #5: More values on the right than on the left
The fifth case is when there are multiple values on both sides, and there are fewer lvalues than rvalues. For example:
> a, b = 'foo', 'bar', 'baz' # Note only 2 values on the left
> a
=> 'foo'
> b
=> 'bar'
Again, Ruby did the best it could to distribute the values, but had too many to parcel out, and was forced to send the right-most value ("baz") into the ether.
Case #6+: Splatting
Note in the case above, we lost the last value. However it is actually possible to capture any such excess values and gather them into an array, using a related operator called the "splat" operator (which consists of an asterisk in front of the variable, as in *my_var). You can do a number of useful things with this "splat" operator, but so as not to overload this answer too much, it's probably better to go elsewhere to look at examples of it in action. E.g., this blog post lists several variant uses.
Note: Swapping without a temp variable
One nice aspect of this parallel assignment facility is that you can swap values conveniently.
To swap values without parallel assignment, you might write something like this:
> a = 'foo'
> b = 'bar'
> temp = a # Introduce a temp variable
> a = b
> b = temp
> a
=> "bar"
> b
=> "foo"
But with parallel assignment, you can simply rely on Ruby to implicitly handle the swapping for you:
> a = 'foo'
> b = 'bar'
> b, a = a, b
> a
=> "bar"
> b
=> "foo"

HowTo parse numbers from string with BOOST methods?

Problem: Visual C++ 10 project (using MFC and Boost libraries). In one of my methods I'm reading simple test.txt file.
Here is what inside of the file (std::string):
12 asdf789, 54,19 1000 nsfewer:22!13
Then I need to convert all digits to int only with boost methods. For example, I have a list of different characters which I have to parse:
( ’ ' )
( [ ], ( ), { }, ⟨ ⟩ )
( : )
( , )
( ! )
( . )
( - )
( ? )
( ‘ ’, “ ”, « » )
( ; )
( / )
And after conversation I must have some kind of a massive of int's values, like this one:
12,789,54,19,1000,22,13
Maybe some one already did this job?
PS. I'm new for boost.
Thanks!
Update
Here is my sample:
std::vector<int> v;
rule<> r = int_p[append(v)] >> *(',' >> int_p[append(v)]);
parse(data.c_str(), r, space_p);
All I have to do, is to add additional escape characters (,'[](){}:!...) in my code, but did not find how to do that!
Easy way out is regex.
Hard way out is using spirit
Middle-of-the-road is using algorithm::string::split, with the correct separators, and then looping over all individual parts using lexical_cast<>(). That way you can filter out the integers.
But again, regex will be much more robust plus it's much cleaner than all sorts of primitive string manipulation hacking.
In addition to regex, boost::spirit, and manually parsing the text, you can use AXE parser generator with VC++ 2010. The AXE rule would look something like this (not tested):
std::vector<unsigned> v;
auto text_rule = *(*(axe::r_any() - axe::r_numstr()) & ~axe::r_numstr()
>> axe::e_push_back(v)) & axe::r_end();
// test it
std::string str("12 asdf789, 54,19 1000 nsfewer:22!13");
text_rule(str.begin(), str.end());
// print result
std::for_each(v.begin(), v.end(), [](unsigned i) { std::cout << i << '\n'; });
The basic idea it to skip all input characters which don't match the number string rule (r_numstr).

Resources