Libclang tokenKinds - clang

Where can I find the token kinds in libclang?
for example I know these token kinds exists:
eof, r_paren, l_paren, r_brace, l_brace
Where can I find the rest of the token kinds?
Thank you.

In libclang, the token kinds are CXToken_Punctuation, CXToken_Keyword, CXToken_Identifier, CXToken_Literal, and CXToken_Comment. (link)
In clang, the list of token kinds can be found by preprocessing the file clang/Basic/TokenKinds.def after making the definition #define TOK(X) X. Doing this gives me the following list:
unknown,
eof,
eod,
code_completion,
comment,
identifier,
raw_identifier,
numeric_constant,
char_constant,
wide_char_constant,
utf8_char_constant,
utf16_char_constant,
utf32_char_constant,
string_literal,
wide_string_literal,
angle_string_literal,
utf8_string_literal,
utf16_string_literal,
utf32_string_literal,
l_square,
r_square,
l_paren,
r_paren,
l_brace,
r_brace,
period,
ellipsis,
amp,
ampamp,
ampequal,
star,
starequal,
plus,
plusplus,
plusequal,
minus,
arrow,
minusminus,
minusequal,
tilde,
exclaim,
exclaimequal,
slash,
slashequal,
percent,
percentequal,
less,
lessless,
lessequal,
lesslessequal,
greater,
greatergreater,
greaterequal,
greatergreaterequal,
caret,
caretequal,
pipe,
pipepipe,
pipeequal,
question,
colon,
semi,
equal,
equalequal,
comma,
hash,
hashhash,
hashat,
periodstar,
arrowstar,
coloncolon,
at,
lesslessless,
greatergreatergreater,
kw_auto,
kw_break,
kw_case,
kw_char,
kw_const,
kw_continue,
kw_default,
kw_do,
kw_double,
kw_else,
kw_enum,
kw_extern,
kw_float,
kw_for,
kw_goto,
kw_if,
kw_inline,
kw_int,
kw_long,
kw_register,
kw_restrict,
kw_return,
kw_short,
kw_signed,
kw_sizeof,
kw_static,
kw_struct,
kw_switch,
kw_typedef,
kw_union,
kw_unsigned,
kw_void,
kw_volatile,
kw_while,
kw__Alignas,
kw__Alignof,
kw__Atomic,
kw__Bool,
kw__Complex,
kw__Generic,
kw__Imaginary,
kw__Noreturn,
kw__Static_assert,
kw__Thread_local,
kw___func__,
kw___objc_yes,
kw___objc_no,
kw_asm,
kw_bool,
kw_catch,
kw_class,
kw_const_cast,
kw_delete,
kw_dynamic_cast,
kw_explicit,
kw_export,
kw_false,
kw_friend,
kw_mutable,
kw_namespace,
kw_new,
kw_operator,
kw_private,
kw_protected,
kw_public,
kw_reinterpret_cast,
kw_static_cast,
kw_template,
kw_this,
kw_throw,
kw_true,
kw_try,
kw_typename,
kw_typeid,
kw_using,
kw_virtual,
kw_wchar_t,
kw_alignas,
kw_alignof,
kw_char16_t,
kw_char32_t,
kw_constexpr,
kw_decltype,
kw_noexcept,
kw_nullptr,
kw_static_assert,
kw_thread_local,
kw_concept,
kw_requires,
kw_co_await,
kw_co_return,
kw_co_yield,
kw__Decimal32,
kw__Decimal64,
kw__Decimal128,
kw___null,
kw___alignof,
kw___attribute,
kw___builtin_choose_expr,
kw___builtin_offsetof,
kw___builtin_types_compatible_p,
kw___builtin_va_arg,
kw___extension__,
kw___imag,
kw___int128,
kw___label__,
kw___real,
kw___thread,
kw___FUNCTION__,
kw___PRETTY_FUNCTION__,
kw___auto_type,
kw_typeof,
kw___FUNCDNAME__,
kw___FUNCSIG__,
kw_L__FUNCTION__,
kw___is_interface_class,
kw___is_sealed,
kw___is_destructible,
kw___is_nothrow_destructible,
kw___is_nothrow_assignable,
kw___is_constructible,
kw___is_nothrow_constructible,
kw___has_nothrow_assign,
kw___has_nothrow_move_assign,
kw___has_nothrow_copy,
kw___has_nothrow_constructor,
kw___has_trivial_assign,
kw___has_trivial_move_assign,
kw___has_trivial_copy,
kw___has_trivial_constructor,
kw___has_trivial_move_constructor,
kw___has_trivial_destructor,
kw___has_virtual_destructor,
kw___is_abstract,
kw___is_base_of,
kw___is_class,
kw___is_convertible_to,
kw___is_empty,
kw___is_enum,
kw___is_final,
kw___is_literal,
kw___is_pod,
kw___is_polymorphic,
kw___is_trivial,
kw___is_union,
kw___is_trivially_constructible,
kw___is_trivially_copyable,
kw___is_trivially_assignable,
kw___underlying_type,
kw___is_lvalue_expr,
kw___is_rvalue_expr,
kw___is_arithmetic,
kw___is_floating_point,
kw___is_integral,
kw___is_complete_type,
kw___is_void,
kw___is_array,
kw___is_function,
kw___is_reference,
kw___is_lvalue_reference,
kw___is_rvalue_reference,
kw___is_fundamental,
kw___is_object,
kw___is_scalar,
kw___is_compound,
kw___is_pointer,
kw___is_member_object_pointer,
kw___is_member_function_pointer,
kw___is_member_pointer,
kw___is_const,
kw___is_volatile,
kw___is_standard_layout,
kw___is_signed,
kw___is_unsigned,
kw___is_same,
kw___is_convertible,
kw___array_rank,
kw___array_extent,
kw___private_extern__,
kw___module_private__,
kw___declspec,
kw___cdecl,
kw___stdcall,
kw___fastcall,
kw___thiscall,
kw___vectorcall,
kw___forceinline,
kw___unaligned,
kw___super,
kw___global,
kw___local,
kw___constant,
kw___private,
kw___generic,
kw___kernel,
kw___read_only,
kw___write_only,
kw___read_write,
kw___builtin_astype,
kw_vec_step,
kw___builtin_omp_required_simd_align,
kw_pipe,
kw___pascal,
kw___vector,
kw___pixel,
kw___bool,
kw_half,
kw___bridge,
kw___bridge_transfer,
kw___bridge_retained,
kw___bridge_retain,
kw___covariant,
kw___contravariant,
kw___kindof,
kw__Nonnull,
kw__Nullable,
kw__Null_unspecified,
kw___ptr64,
kw___ptr32,
kw___sptr,
kw___uptr,
kw___w64,
kw___uuidof,
kw___try,
kw___finally,
kw___leave,
kw___int64,
kw___if_exists,
kw___if_not_exists,
kw___single_inheritance,
kw___multiple_inheritance,
kw___virtual_inheritance,
kw___interface,
kw___builtin_convertvector,
kw___unknown_anytype,
annot_cxxscope,
annot_typename,
annot_template_id,
annot_primary_expr,
annot_decltype,
annot_pragma_unused,
annot_pragma_vis,
annot_pragma_pack,
annot_pragma_parser_crash,
annot_pragma_captured,
annot_pragma_dump,
annot_pragma_msstruct,
annot_pragma_align,
annot_pragma_weak,
annot_pragma_weakalias,
annot_pragma_redefine_extname,
annot_pragma_fp_contract,
annot_pragma_ms_pointers_to_members,
annot_pragma_ms_vtordisp,
annot_pragma_ms_pragma,
annot_pragma_opencl_extension,
annot_pragma_openmp,
annot_pragma_openmp_end,
annot_pragma_loop_hint,
annot_module_include,
annot_module_begin,
annot_module_end

Related

How to parse output of unknown type in Julia

I am running a function from an external library here:
https://github.com/baggepinnen/SingularSpectrumAnalysis.jl
When running, I get this output printed in the console:
LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}}
U factor:
3465×10 Array{Float64,2}:
-0.0176092 0.0162669 -0.0286626 … -0.0123348 -0.00889247 0.0149834
-0.0176079 0.023189 -0.00313753 0.0234491 -0.000954835 0.0237124
-0.0175925 0.0216939 0.0187119 0.0418525 -0.0296555 0.0665848
-0.0175613 0.0146738 0.0288932 -0.0266382 0.0127913 0.00602873
-0.0175472 0.0072105 0.0349358 0.0225667 -0.0167306 -0.02098
-0.0175337 -3.25703e-5 0.0304511 … -0.0229247 -0.00725249 -0.00814757
-0.0175243 -0.0070557 0.0154106 0.0424862 -0.0206749 -0.0115423
⋮ ⋱
-0.0124291 0.0454897 -0.0153655 -0.019238 0.00716989 0.00251159
-0.0122423 0.0435812 -0.0148046 … -0.0139234 -0.0187464 0.00739847
-0.0121735 0.0346687 0.00278444 -0.00218233 0.0110443 -0.00929289
-0.0121211 0.0290382 0.0110726 0.0107806 -0.00106763 0.0317442
-0.0120607 0.0194982 0.0217969 0.00578442 -0.0117156 -0.00232344
-0.0120144 0.0126667 0.0164779 -0.0106475 0.00061507 -0.00797532
singular values:
10-element Array{Float64,1}:
23396.412954604883
89.77233712912785
22.739080907231845
6.7695870707469386
1.3883392478470917
2.8068174835480837e-12
8.400039642654283e-13
8.317837915706779e-13
8.065049690243271e-13
7.945414181455442e-13
Vt factor:
10×10 Array{Float64,2}:
-0.316135 -0.316298 -0.316323 … -0.316206 -0.31614 -0.315936
0.408793 0.414593 0.333553 -0.33494 -0.41203 -0.407189
0.363599 0.306612 0.0442128 0.0537767 0.309203 0.350534
-0.314074 -0.183525 0.255186 -0.246725 0.186592 0.323551
-0.295283 -0.0484189 0.451627 0.462796 -0.0426143 -0.315369
0.455353 -0.378881 -0.296744 … 0.171696 0.409193 -0.445054
-0.326408 0.586176 -0.435839 0.420663 -0.0126025 -0.143472
0.114133 0.0540144 -0.485488 -0.155216 -0.293115 0.27703
-0.0714588 0.00900755 0.0450054 -0.018732 -0.13437 0.165802
0.287902 -0.328451 0.0436344 0.528208 -0.571383 0.290639
How do I parse this output to get the "U factor:", "singular values:" and "Vt factor:" output as arrays I can work with in a notebook?
I've tried indexing by position and by name ([1] or ["U factor"]). Both result in errors such as this:
MethodError: no method matching getindex(::LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}}, ::Int64)
It is a part of a standard library, so it can be found in documentation: https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svd
Output is only visual representation of the data, so it can't be used to access data programmatically. You should use docs or introspection functions like fieldnames to understand how to work with the object. In this case, you should use fields U, S and Vt of an SVD object.

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
%%
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

Parsing custom serialized object in Rails

I am able to export a serialized text representation of an object from our proprietary CNC programming software and need to parse it to import objects in my Rails app.
Example serialized output:
Header {
code "Centric 20170117 16gaHRS"
label "Centric 20170117 16gaHRS"
lccShortname "Centric 20170117 16gaHRS"
jobgroup "20170117 - Pike Sign"
waste 97.5516173272
unit INCH
Material {
code "HRS"
label "HRS"
labelDIN "HRS"
density 0.283647787542
thickness 0.125
}
}
Rawmaterials {
Rawmaterial {
id 52312
format 120 48.25
stock +999
used +1
}
}
Parts {
Part {
id 1
code "8581-Sign"
label "8581-Sign"
need +2
used +2
priority +1
turnAngleIncrement +180
ccAllowed +0
filler +0
area 141.761356753
positioningTime 10.369402427
cuttingTime 346.222969467
piercingTime 35.5976025504
positioningWay 1949.56
cuttingWay 9249.13
countPiercingNormal +75
countPiercingPuls +4
}
}
Plans {
Plan {
id 52313
label "Centric 20170117 16gaHRS 1"
filename "Centric 20170117 16gaHRS01"
border 0.5 0.5 0.5 0.5
cycleCount +1
waste 97.5516173272
positioningTime 11.9357066923
cuttingTime 345.629256802
piercingTime 35.5976025504
auxiliaryProcessTime 79.2405450926
positioningWay 1954.13
cuttingWay 9215.92
countPiercingNormal +75
countPiercingPuls +4
RawmaterialReference 52312
PartReferences {
PartReference {
id 1
layer 21
partId 1
insert -128.833464567 -97.2358267717
}
}
}
Plan {
id 52314
label "Centric 20170117 16gaHRS 2"
filename "Centric 20170117 16gaHRS02"
border 0.5 0.5 0.5 0.5
cycleCount +1
waste 97.5516173272
positioningTime 11.9357066923
cuttingTime 345.629256802
piercingTime 35.5976025504
auxiliaryProcessTime 79.2405450926
positioningWay 1954.13
cuttingWay 9215.92
countPiercingNormal +75
countPiercingPuls +4
RawmaterialReference 52312
PartReferences {
PartReference {
id 1
layer 21
partId 1
insert -128.833464567 -97.2358267717
}
}
}
}
To start with, I would like to extract the code attribute from the Header section, and the filename attribute for each Plan.
I could iterate through the file keeping note of curly braces and which section we are currently processing, but it seems as though there must be a simpler way. I could easily parse it if it were JSON or XML data, but I am at a loss as to the simplest way to parse this non-standard format.
There is no simple way.
A json and xml parser does exactly the same, going through the file character by character and keeping track of everything, just that someone else wrote that code for you.
I see 5 options
you do as suggested, reading line by line and partially parsing the file. That is called an "island grammar" parser
you use a series of regular expressions to turn the file into a valid JSON file and then parse that, the formats look similar enough that it might be possible
you reverse engineer the format and write your own complete parser
you get the name of the file format from the proprietary vendor and search for a gem that implements a parser. Most likely there will be none
you get the proprietary vendor to export the data in a different format. Most likely they will charge an astronomic price or just say no
I would give the first two options a try …

How to convert this code into Antlr Groovy Grammar v4?

I am using Groovy grammar with antlr to parse Groovy files. I am using the following Groovy grammar:
https://github.com/groovy/groovy-core/blob/master/src/main/org/codehaus/groovy/antlr/groovy.g
I converted the entire Grammar into antlr v4 though it took me 2 complete days. But still I am wondering how to convert this into antlr v4 grammar :
tokens {
BLOCK; MODIFIERS; OBJBLOCK; SLIST; METHOD_DEF; VARIABLE_DEF;
INSTANCE_INIT; STATIC_INIT; TYPE; CLASS_DEF; INTERFACE_DEF; TRAIT_DEF;
PACKAGE_DEF; ARRAY_DECLARATOR; EXTENDS_CLAUSE; IMPLEMENTS_CLAUSE;
PARAMETERS; PARAMETER_DEF; LABELED_STAT; TYPECAST; INDEX_OP;
POST_INC; POST_DEC; METHOD_CALL; EXPR;
IMPORT; UNARY_MINUS; UNARY_PLUS; CASE_GROUP; ELIST; FOR_INIT; FOR_CONDITION;
FOR_ITERATOR; EMPTY_STAT; FINAL="final"; ABSTRACT="abstract";
UNUSED_GOTO="goto"; UNUSED_CONST="const"; UNUSED_DO="do";
STRICTFP="strictfp"; SUPER_CTOR_CALL; CTOR_CALL; CTOR_IDENT; VARIABLE_PARAMETER_DEF;
STRING_CONSTRUCTOR; STRING_CTOR_MIDDLE;
CLOSABLE_BLOCK; IMPLICIT_PARAMETERS;
SELECT_SLOT; DYNAMIC_MEMBER;
LABELED_ARG; SPREAD_ARG; SPREAD_MAP_ARG; //deprecated - SCOPE_ESCAPE;
LIST_CONSTRUCTOR; MAP_CONSTRUCTOR;
FOR_IN_ITERABLE;
STATIC_IMPORT; ENUM_DEF; ENUM_CONSTANT_DEF; FOR_EACH_CLAUSE; ANNOTATION_DEF; ANNOTATIONS;
ANNOTATION; ANNOTATION_MEMBER_VALUE_PAIR; ANNOTATION_FIELD_DEF; ANNOTATION_ARRAY_INIT;
TYPE_ARGUMENTS; TYPE_ARGUMENT; TYPE_PARAMETERS; TYPE_PARAMETER; WILDCARD_TYPE;
TYPE_UPPER_BOUNDS; TYPE_LOWER_BOUNDS; CLOSURE_LIST;MULTICATCH;MULTICATCH_TYPES;
}
Until these lexer tokens are not correct I won't get the correct output. Now the token file I got is like this:
T__0=1
T__1=2
T__2=3
T__3=4
T__4=5
T__5=6
T__6=7
T__7=8
T__8=9
T__9=10
T__10=11
T__11=12
T__12=13
T__13=14
T__14=15
T__15=16
T__16=17
T__17=18
T__18=19
T__19=20
T__20=21
T__21=22
T__22=23
T__23=24
T__24=25
T__25=26
T__26=27
T__27=28
T__28=29
T__29=30
T__30=31
T__31=32
T__32=33
T__33=34
T__34=35
T__35=36
T__36=37
T__37=38
T__38=39
T__39=40
T__40=41
T__41=42
T__42=43
T__43=44
T__44=45
T__45=46
T__46=47
T__47=48
T__48=49
T__49=50
T__50=51
T__51=52
QUESTION=53
LPAREN=54
RPAREN=55
LBRACK=56
RBRACK=57
LCURLY=58
RCURLY=59
COLON=60
COMMA=61
DOT=62
ASSIGN=63
COMPARE_TO=64
EQUAL=65
IDENTICAL=66
LNOT=67
BNOT=68
NOT_EQUAL=69
NOT_IDENTICAL=70
PLUS=71
PLUS_ASSIGN=72
INC=73
MINUS=74
MINUS_ASSIGN=75
DEC=76
STAR=77
STAR_ASSIGN=78
MOD=79
MOD_ASSIGN=80
SR=81
SR_ASSIGN=82
BSR=83
BSR_ASSIGN=84
GE=85
GT=86
SL=87
SL_ASSIGN=88
LE=89
LT=90
BXOR=91
BXOR_ASSIGN=92
BOR=93
BOR_ASSIGN=94
LOR=95
BAND=96
BAND_ASSIGN=97
LAND=98
SEMI=99
RANGE_INCLUSIVE=100
RANGE_EXCLUSIVE=101
TRIPLE_DOT=102
SPREAD_DOT=103
OPTIONAL_DOT=104
ELVIS_OPERATOR=105
MEMBER_POINTER=106
REGEX_FIND=107
REGEX_MATCH=108
STAR_STAR=109
STAR_STAR_ASSIGN=110
CLOSABLE_BLOCK_OP=111
WS=112
NLS=113
SL_COMMENT=114
SH_COMMENT=115
ML_COMMENT=116
STRING_LITERAL=117
REGEXP_LITERAL=118
DOLLAR_REGEXP_LITERAL=119
DOLLAR_REGEXP_CTOR_END=120
IDENT=121
AT=122
FINAL=123
ABSTRACT=124
UNUSED_GOTO=125
UNUSED_DO=126
STRICTFP=127
UNUSED_CONST=128
'package'=1
'import'=2
'static'=3
'def'=4
'class'=5
'interface'=6
'enum'=7
'trait'=8
'extends'=9
'super'=10
'void'=11
'boolean'=12
'byte'=13
'char'=14
'short'=15
'int'=16
'float'=17
'long'=18
'double'=19
'as'=20
'private'=21
'public'=22
'protected'=23
'transient'=24
'native'=25
'threadsafe'=26
'synchronized'=27
'volatile'=28
'default'=29
'implements'=30
'this'=31
'throws'=32
'if'=33
'else'=34
'while'=35
'switch'=36
'for'=37
'in'=38
'return'=39
'break'=40
'continue'=41
'throw'=42
'assert'=43
'case'=44
'try'=45
'finally'=46
'catch'=47
'false'=48
'instanceof'=49
'new'=50
'null'=51
'true'=52
'?'=53
'('=54
')'=55
'['=56
']'=57
'{'=58
'}'=59
':'=60
','=61
'.'=62
'='=63
'<=>'=64
'=='=65
'==='=66
'!'=67
'~'=68
'!='=69
'!=='=70
'+'=71
'+='=72
'++'=73
'-'=74
'-='=75
'--'=76
'*'=77
'*='=78
'%'=79
'%='=80
'>>'=81
'>>='=82
'>>>'=83
'>>>='=84
'>='=85
'>'=86
'<<'=87
'<<='=88
'<='=89
'<'=90
'^'=91
'^='=92
'|'=93
'|='=94
'||'=95
'&'=96
'&='=97
'&&'=98
';'=99
'..'=100
'..<'=101
'...'=102
'*.'=103
'?.'=104
'?:'=105
'.&'=106
'=~'=107
'==~'=108
'**'=109
'**='=110
'->'=111
'#'=122
'final'=123
'abstract'=124
'goto'=125
'do'=126
'strictfp'=127
'const'=128
All these tokens starting with T are wrong and corresponding value in Groovy Lexer is null.

PEGKit combine matched symbols on stack

I'm writing a grammar for PEGKit to parse a Twine exported Twee file. This is my first time using PEGKit and I'm trying to get to grips with how it works.
I have this twee source file that I'm parsing
:: Passage One
P1 Line One
P1 Line Two
:: Passage Two
P2 Line One
P2 Line Two
Currently I've worked out how to parse the above using the following grammar
#before {
PKTokenizer *t = self.tokenizer;
[t.symbolState add:#"::"];
[t.commentState addSingleLineStartMarker:#"::"];
// New lines as symbols
[t.whitespaceState setWhitespaceChars:NO from:'\n' to:'\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\r' to:'\r'];
[t setTokenizerState:t.symbolState from:'\n' to:'\n'];
[t setTokenizerState:t.symbolState from:'\r' to:'\r'];
}
start = passage+;
passage = passageTitle contentLine*;
passageTitle = passageStart Word+ eol+;
contentLine = singleLine eol+;
singleLine = Word+;
passageStart = '::'!;
eol = '\n'! | '\r'!;
and the result I get is
[Passage, One, P1, Line, One, P1, Line, Two, Passage, Two, P2, Line, One, P2, Line, Two]::/Passage/One/
/P1/Line/One/
/P1/Line/Two/
/
/::/Passage/Two/
/P2/Line/One/
/P2/Line/Two/
^
Ideally, I'd like the parser to combine the words matched for the passageTitle into a single string similar to how the built in PEGKit QuotedString grammar works. I would also like the words matched for a contentLine to be combined as well.
So, eventually, I would have this on the stack
[Passage One, P1 Line One, P1 Line Two, Passage Two, P2 Line One, P2 Line Two]
Any thoughts on how to achieve this would be appreciated.
Creator of PEGKit here.
I understand your ultimate strategy (to collect/combine lines as single string objects), and agree that it makes sense, however, I disagree with your proposed tactic to achieve that (to alter tokenization to try to combine what are essentially multiple separate tokens into single tokens).
Combining lines into convenient string objects makes sense, but altering tokenization to achieve that, doesn't make sense IMO (at least not with a recursive descent parsing kit PEGKit) when the lines in question don't have obvious 'bracketing' characters like quotes or brackets.
You could treat the passageTitle lines starting with :: as single-line Comment tokens, but I probably wouldn't since I gather they are semantically not comments.
So instead of merging multiple tokens via the tokenizer, you should merge multiple tokens in the more natural way for PEGKit: in the parser delegate callbacks.
We have two different cases to deal with here:
The passageTitle lines
The contentLine lines
In your grammar, remove this line so we won't be treating passageTitles as Comment tokens (you didn't have that completely correctly configured anyhow, but never mind that):
[t.commentState addSingleLineStartMarker:#"::"];
And also in your grammar, remove the ! from your passageStart rule so that those tokens won't be discarded:
passageStart = '::';
That's all for the grammar. Now in your Parser Delegate callbacks, implement the two necessary callback methods for the title and content lines. And in each callback, pull all of the necessary tokens off the PKAssembly's stack, and merge them into a single string (in reverse).
#interface TweeDelegate : NSObject
#end
#implementation TweeDelegate
- (void)parser:(PKParser *)p didMatchPassageTitle:(PKAssembly *)a {
NSArray *toks = [a objectsAbove:[PKToken tokenWithTokenType:PKTokenTypeSymbol stringValue:#"::" doubleValue:0.0]];
[a pop]; // discard `::`
NSMutableString *buf = [NSMutableString string];
for (PKToken *tok in [toks reverseObjectEnumerator]) {
[buf appendFormat:#"%# ", tok.stringValue];
}
CFStringTrimWhitespace((CFMutableStringRef)buf);
NSLog(#"Title: %#", buf); // Passage One
}
- (void)parser:(PKParser *)p didMatchContentLine:(PKAssembly *)a {
NSArray *toks = [a objectsAbove:nil];
NSMutableString *buf = [NSMutableString string];
for (PKToken *tok in [toks reverseObjectEnumerator]) {
[buf appendFormat:#"%# ", tok.stringValue];
}
CFStringTrimWhitespace((CFMutableStringRef)buf);
NSLog(#"Content: %#", buf); // P1 Line One
}
#end
I receive the following output:
Title: Passage One
Content: P1 Line One
Content: P1 Line Two
Title: Passage Two
Content: P2 Line One
Content: P2 Line Two
As for what to do with these strings once you have created them, I'll leave up to you :). Hope that helps.

Resources