Libclang tokenKinds - clang
Where can I find the token kinds in libclang?
for example I know these token kinds exists:
eof, r_paren, l_paren, r_brace, l_brace
Where can I find the rest of the token kinds?
Thank you.
In libclang, the token kinds are CXToken_Punctuation, CXToken_Keyword, CXToken_Identifier, CXToken_Literal, and CXToken_Comment. (link)
In clang, the list of token kinds can be found by preprocessing the file clang/Basic/TokenKinds.def after making the definition #define TOK(X) X. Doing this gives me the following list:
unknown,
eof,
eod,
code_completion,
comment,
identifier,
raw_identifier,
numeric_constant,
char_constant,
wide_char_constant,
utf8_char_constant,
utf16_char_constant,
utf32_char_constant,
string_literal,
wide_string_literal,
angle_string_literal,
utf8_string_literal,
utf16_string_literal,
utf32_string_literal,
l_square,
r_square,
l_paren,
r_paren,
l_brace,
r_brace,
period,
ellipsis,
amp,
ampamp,
ampequal,
star,
starequal,
plus,
plusplus,
plusequal,
minus,
arrow,
minusminus,
minusequal,
tilde,
exclaim,
exclaimequal,
slash,
slashequal,
percent,
percentequal,
less,
lessless,
lessequal,
lesslessequal,
greater,
greatergreater,
greaterequal,
greatergreaterequal,
caret,
caretequal,
pipe,
pipepipe,
pipeequal,
question,
colon,
semi,
equal,
equalequal,
comma,
hash,
hashhash,
hashat,
periodstar,
arrowstar,
coloncolon,
at,
lesslessless,
greatergreatergreater,
kw_auto,
kw_break,
kw_case,
kw_char,
kw_const,
kw_continue,
kw_default,
kw_do,
kw_double,
kw_else,
kw_enum,
kw_extern,
kw_float,
kw_for,
kw_goto,
kw_if,
kw_inline,
kw_int,
kw_long,
kw_register,
kw_restrict,
kw_return,
kw_short,
kw_signed,
kw_sizeof,
kw_static,
kw_struct,
kw_switch,
kw_typedef,
kw_union,
kw_unsigned,
kw_void,
kw_volatile,
kw_while,
kw__Alignas,
kw__Alignof,
kw__Atomic,
kw__Bool,
kw__Complex,
kw__Generic,
kw__Imaginary,
kw__Noreturn,
kw__Static_assert,
kw__Thread_local,
kw___func__,
kw___objc_yes,
kw___objc_no,
kw_asm,
kw_bool,
kw_catch,
kw_class,
kw_const_cast,
kw_delete,
kw_dynamic_cast,
kw_explicit,
kw_export,
kw_false,
kw_friend,
kw_mutable,
kw_namespace,
kw_new,
kw_operator,
kw_private,
kw_protected,
kw_public,
kw_reinterpret_cast,
kw_static_cast,
kw_template,
kw_this,
kw_throw,
kw_true,
kw_try,
kw_typename,
kw_typeid,
kw_using,
kw_virtual,
kw_wchar_t,
kw_alignas,
kw_alignof,
kw_char16_t,
kw_char32_t,
kw_constexpr,
kw_decltype,
kw_noexcept,
kw_nullptr,
kw_static_assert,
kw_thread_local,
kw_concept,
kw_requires,
kw_co_await,
kw_co_return,
kw_co_yield,
kw__Decimal32,
kw__Decimal64,
kw__Decimal128,
kw___null,
kw___alignof,
kw___attribute,
kw___builtin_choose_expr,
kw___builtin_offsetof,
kw___builtin_types_compatible_p,
kw___builtin_va_arg,
kw___extension__,
kw___imag,
kw___int128,
kw___label__,
kw___real,
kw___thread,
kw___FUNCTION__,
kw___PRETTY_FUNCTION__,
kw___auto_type,
kw_typeof,
kw___FUNCDNAME__,
kw___FUNCSIG__,
kw_L__FUNCTION__,
kw___is_interface_class,
kw___is_sealed,
kw___is_destructible,
kw___is_nothrow_destructible,
kw___is_nothrow_assignable,
kw___is_constructible,
kw___is_nothrow_constructible,
kw___has_nothrow_assign,
kw___has_nothrow_move_assign,
kw___has_nothrow_copy,
kw___has_nothrow_constructor,
kw___has_trivial_assign,
kw___has_trivial_move_assign,
kw___has_trivial_copy,
kw___has_trivial_constructor,
kw___has_trivial_move_constructor,
kw___has_trivial_destructor,
kw___has_virtual_destructor,
kw___is_abstract,
kw___is_base_of,
kw___is_class,
kw___is_convertible_to,
kw___is_empty,
kw___is_enum,
kw___is_final,
kw___is_literal,
kw___is_pod,
kw___is_polymorphic,
kw___is_trivial,
kw___is_union,
kw___is_trivially_constructible,
kw___is_trivially_copyable,
kw___is_trivially_assignable,
kw___underlying_type,
kw___is_lvalue_expr,
kw___is_rvalue_expr,
kw___is_arithmetic,
kw___is_floating_point,
kw___is_integral,
kw___is_complete_type,
kw___is_void,
kw___is_array,
kw___is_function,
kw___is_reference,
kw___is_lvalue_reference,
kw___is_rvalue_reference,
kw___is_fundamental,
kw___is_object,
kw___is_scalar,
kw___is_compound,
kw___is_pointer,
kw___is_member_object_pointer,
kw___is_member_function_pointer,
kw___is_member_pointer,
kw___is_const,
kw___is_volatile,
kw___is_standard_layout,
kw___is_signed,
kw___is_unsigned,
kw___is_same,
kw___is_convertible,
kw___array_rank,
kw___array_extent,
kw___private_extern__,
kw___module_private__,
kw___declspec,
kw___cdecl,
kw___stdcall,
kw___fastcall,
kw___thiscall,
kw___vectorcall,
kw___forceinline,
kw___unaligned,
kw___super,
kw___global,
kw___local,
kw___constant,
kw___private,
kw___generic,
kw___kernel,
kw___read_only,
kw___write_only,
kw___read_write,
kw___builtin_astype,
kw_vec_step,
kw___builtin_omp_required_simd_align,
kw_pipe,
kw___pascal,
kw___vector,
kw___pixel,
kw___bool,
kw_half,
kw___bridge,
kw___bridge_transfer,
kw___bridge_retained,
kw___bridge_retain,
kw___covariant,
kw___contravariant,
kw___kindof,
kw__Nonnull,
kw__Nullable,
kw__Null_unspecified,
kw___ptr64,
kw___ptr32,
kw___sptr,
kw___uptr,
kw___w64,
kw___uuidof,
kw___try,
kw___finally,
kw___leave,
kw___int64,
kw___if_exists,
kw___if_not_exists,
kw___single_inheritance,
kw___multiple_inheritance,
kw___virtual_inheritance,
kw___interface,
kw___builtin_convertvector,
kw___unknown_anytype,
annot_cxxscope,
annot_typename,
annot_template_id,
annot_primary_expr,
annot_decltype,
annot_pragma_unused,
annot_pragma_vis,
annot_pragma_pack,
annot_pragma_parser_crash,
annot_pragma_captured,
annot_pragma_dump,
annot_pragma_msstruct,
annot_pragma_align,
annot_pragma_weak,
annot_pragma_weakalias,
annot_pragma_redefine_extname,
annot_pragma_fp_contract,
annot_pragma_ms_pointers_to_members,
annot_pragma_ms_vtordisp,
annot_pragma_ms_pragma,
annot_pragma_opencl_extension,
annot_pragma_openmp,
annot_pragma_openmp_end,
annot_pragma_loop_hint,
annot_module_include,
annot_module_begin,
annot_module_end
Related
How to parse output of unknown type in Julia
I am running a function from an external library here: https://github.com/baggepinnen/SingularSpectrumAnalysis.jl When running, I get this output printed in the console: LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}} U factor: 3465×10 Array{Float64,2}: -0.0176092 0.0162669 -0.0286626 … -0.0123348 -0.00889247 0.0149834 -0.0176079 0.023189 -0.00313753 0.0234491 -0.000954835 0.0237124 -0.0175925 0.0216939 0.0187119 0.0418525 -0.0296555 0.0665848 -0.0175613 0.0146738 0.0288932 -0.0266382 0.0127913 0.00602873 -0.0175472 0.0072105 0.0349358 0.0225667 -0.0167306 -0.02098 -0.0175337 -3.25703e-5 0.0304511 … -0.0229247 -0.00725249 -0.00814757 -0.0175243 -0.0070557 0.0154106 0.0424862 -0.0206749 -0.0115423 ⋮ ⋱ -0.0124291 0.0454897 -0.0153655 -0.019238 0.00716989 0.00251159 -0.0122423 0.0435812 -0.0148046 … -0.0139234 -0.0187464 0.00739847 -0.0121735 0.0346687 0.00278444 -0.00218233 0.0110443 -0.00929289 -0.0121211 0.0290382 0.0110726 0.0107806 -0.00106763 0.0317442 -0.0120607 0.0194982 0.0217969 0.00578442 -0.0117156 -0.00232344 -0.0120144 0.0126667 0.0164779 -0.0106475 0.00061507 -0.00797532 singular values: 10-element Array{Float64,1}: 23396.412954604883 89.77233712912785 22.739080907231845 6.7695870707469386 1.3883392478470917 2.8068174835480837e-12 8.400039642654283e-13 8.317837915706779e-13 8.065049690243271e-13 7.945414181455442e-13 Vt factor: 10×10 Array{Float64,2}: -0.316135 -0.316298 -0.316323 … -0.316206 -0.31614 -0.315936 0.408793 0.414593 0.333553 -0.33494 -0.41203 -0.407189 0.363599 0.306612 0.0442128 0.0537767 0.309203 0.350534 -0.314074 -0.183525 0.255186 -0.246725 0.186592 0.323551 -0.295283 -0.0484189 0.451627 0.462796 -0.0426143 -0.315369 0.455353 -0.378881 -0.296744 … 0.171696 0.409193 -0.445054 -0.326408 0.586176 -0.435839 0.420663 -0.0126025 -0.143472 0.114133 0.0540144 -0.485488 -0.155216 -0.293115 0.27703 -0.0714588 0.00900755 0.0450054 -0.018732 -0.13437 0.165802 0.287902 -0.328451 0.0436344 0.528208 -0.571383 0.290639 How do I parse this output to get the "U factor:", "singular values:" and "Vt factor:" output as arrays I can work with in a notebook? I've tried indexing by position and by name ([1] or ["U factor"]). Both result in errors such as this: MethodError: no method matching getindex(::LinearAlgebra.SVD{Float64,Float64,Array{Float64,2}}, ::Int64)
It is a part of a standard library, so it can be found in documentation: https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svd Output is only visual representation of the data, so it can't be used to access data programmatically. You should use docs or introspection functions like fieldnames to understand how to work with the object. In this case, you should use fields U, S and Vt of an SVD object.
(f)lex the difference between PRINTA$ and PRINT A$
I am parsing BASIC: 530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I The patterns that are used in this case are: FOR { return TOK_FOR; } TO { return TOK_TO; } NEXT { return TOK_NEXT; } (many lines later...) [A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? { yylval.s = g_string_new(yytext); return IDENTIFIER; } (many lines later...) [ \t\r\l] { /* eat non-string whitespace */ } The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is: 530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER. The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit. Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique. Here's a simple pattern for recognising keywords, using trailing context: tail [[:alnum:]]*[$%!#]? %% FOR/{tail} { return TOK_FOR; } TO/{tail} { return TOK_TO; } NEXT/{tail} { return TOK_NEXT; } /* etc. */ [[:alpha:]]{tail} { /* Handle an ID */ } Effectively, that just extends the keyword match without extending the matched token. But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?
Parsing custom serialized object in Rails
I am able to export a serialized text representation of an object from our proprietary CNC programming software and need to parse it to import objects in my Rails app. Example serialized output: Header { code "Centric 20170117 16gaHRS" label "Centric 20170117 16gaHRS" lccShortname "Centric 20170117 16gaHRS" jobgroup "20170117 - Pike Sign" waste 97.5516173272 unit INCH Material { code "HRS" label "HRS" labelDIN "HRS" density 0.283647787542 thickness 0.125 } } Rawmaterials { Rawmaterial { id 52312 format 120 48.25 stock +999 used +1 } } Parts { Part { id 1 code "8581-Sign" label "8581-Sign" need +2 used +2 priority +1 turnAngleIncrement +180 ccAllowed +0 filler +0 area 141.761356753 positioningTime 10.369402427 cuttingTime 346.222969467 piercingTime 35.5976025504 positioningWay 1949.56 cuttingWay 9249.13 countPiercingNormal +75 countPiercingPuls +4 } } Plans { Plan { id 52313 label "Centric 20170117 16gaHRS 1" filename "Centric 20170117 16gaHRS01" border 0.5 0.5 0.5 0.5 cycleCount +1 waste 97.5516173272 positioningTime 11.9357066923 cuttingTime 345.629256802 piercingTime 35.5976025504 auxiliaryProcessTime 79.2405450926 positioningWay 1954.13 cuttingWay 9215.92 countPiercingNormal +75 countPiercingPuls +4 RawmaterialReference 52312 PartReferences { PartReference { id 1 layer 21 partId 1 insert -128.833464567 -97.2358267717 } } } Plan { id 52314 label "Centric 20170117 16gaHRS 2" filename "Centric 20170117 16gaHRS02" border 0.5 0.5 0.5 0.5 cycleCount +1 waste 97.5516173272 positioningTime 11.9357066923 cuttingTime 345.629256802 piercingTime 35.5976025504 auxiliaryProcessTime 79.2405450926 positioningWay 1954.13 cuttingWay 9215.92 countPiercingNormal +75 countPiercingPuls +4 RawmaterialReference 52312 PartReferences { PartReference { id 1 layer 21 partId 1 insert -128.833464567 -97.2358267717 } } } } To start with, I would like to extract the code attribute from the Header section, and the filename attribute for each Plan. I could iterate through the file keeping note of curly braces and which section we are currently processing, but it seems as though there must be a simpler way. I could easily parse it if it were JSON or XML data, but I am at a loss as to the simplest way to parse this non-standard format.
There is no simple way. A json and xml parser does exactly the same, going through the file character by character and keeping track of everything, just that someone else wrote that code for you. I see 5 options you do as suggested, reading line by line and partially parsing the file. That is called an "island grammar" parser you use a series of regular expressions to turn the file into a valid JSON file and then parse that, the formats look similar enough that it might be possible you reverse engineer the format and write your own complete parser you get the name of the file format from the proprietary vendor and search for a gem that implements a parser. Most likely there will be none you get the proprietary vendor to export the data in a different format. Most likely they will charge an astronomic price or just say no I would give the first two options a try …
How to convert this code into Antlr Groovy Grammar v4?
I am using Groovy grammar with antlr to parse Groovy files. I am using the following Groovy grammar: https://github.com/groovy/groovy-core/blob/master/src/main/org/codehaus/groovy/antlr/groovy.g I converted the entire Grammar into antlr v4 though it took me 2 complete days. But still I am wondering how to convert this into antlr v4 grammar : tokens { BLOCK; MODIFIERS; OBJBLOCK; SLIST; METHOD_DEF; VARIABLE_DEF; INSTANCE_INIT; STATIC_INIT; TYPE; CLASS_DEF; INTERFACE_DEF; TRAIT_DEF; PACKAGE_DEF; ARRAY_DECLARATOR; EXTENDS_CLAUSE; IMPLEMENTS_CLAUSE; PARAMETERS; PARAMETER_DEF; LABELED_STAT; TYPECAST; INDEX_OP; POST_INC; POST_DEC; METHOD_CALL; EXPR; IMPORT; UNARY_MINUS; UNARY_PLUS; CASE_GROUP; ELIST; FOR_INIT; FOR_CONDITION; FOR_ITERATOR; EMPTY_STAT; FINAL="final"; ABSTRACT="abstract"; UNUSED_GOTO="goto"; UNUSED_CONST="const"; UNUSED_DO="do"; STRICTFP="strictfp"; SUPER_CTOR_CALL; CTOR_CALL; CTOR_IDENT; VARIABLE_PARAMETER_DEF; STRING_CONSTRUCTOR; STRING_CTOR_MIDDLE; CLOSABLE_BLOCK; IMPLICIT_PARAMETERS; SELECT_SLOT; DYNAMIC_MEMBER; LABELED_ARG; SPREAD_ARG; SPREAD_MAP_ARG; //deprecated - SCOPE_ESCAPE; LIST_CONSTRUCTOR; MAP_CONSTRUCTOR; FOR_IN_ITERABLE; STATIC_IMPORT; ENUM_DEF; ENUM_CONSTANT_DEF; FOR_EACH_CLAUSE; ANNOTATION_DEF; ANNOTATIONS; ANNOTATION; ANNOTATION_MEMBER_VALUE_PAIR; ANNOTATION_FIELD_DEF; ANNOTATION_ARRAY_INIT; TYPE_ARGUMENTS; TYPE_ARGUMENT; TYPE_PARAMETERS; TYPE_PARAMETER; WILDCARD_TYPE; TYPE_UPPER_BOUNDS; TYPE_LOWER_BOUNDS; CLOSURE_LIST;MULTICATCH;MULTICATCH_TYPES; } Until these lexer tokens are not correct I won't get the correct output. Now the token file I got is like this: T__0=1 T__1=2 T__2=3 T__3=4 T__4=5 T__5=6 T__6=7 T__7=8 T__8=9 T__9=10 T__10=11 T__11=12 T__12=13 T__13=14 T__14=15 T__15=16 T__16=17 T__17=18 T__18=19 T__19=20 T__20=21 T__21=22 T__22=23 T__23=24 T__24=25 T__25=26 T__26=27 T__27=28 T__28=29 T__29=30 T__30=31 T__31=32 T__32=33 T__33=34 T__34=35 T__35=36 T__36=37 T__37=38 T__38=39 T__39=40 T__40=41 T__41=42 T__42=43 T__43=44 T__44=45 T__45=46 T__46=47 T__47=48 T__48=49 T__49=50 T__50=51 T__51=52 QUESTION=53 LPAREN=54 RPAREN=55 LBRACK=56 RBRACK=57 LCURLY=58 RCURLY=59 COLON=60 COMMA=61 DOT=62 ASSIGN=63 COMPARE_TO=64 EQUAL=65 IDENTICAL=66 LNOT=67 BNOT=68 NOT_EQUAL=69 NOT_IDENTICAL=70 PLUS=71 PLUS_ASSIGN=72 INC=73 MINUS=74 MINUS_ASSIGN=75 DEC=76 STAR=77 STAR_ASSIGN=78 MOD=79 MOD_ASSIGN=80 SR=81 SR_ASSIGN=82 BSR=83 BSR_ASSIGN=84 GE=85 GT=86 SL=87 SL_ASSIGN=88 LE=89 LT=90 BXOR=91 BXOR_ASSIGN=92 BOR=93 BOR_ASSIGN=94 LOR=95 BAND=96 BAND_ASSIGN=97 LAND=98 SEMI=99 RANGE_INCLUSIVE=100 RANGE_EXCLUSIVE=101 TRIPLE_DOT=102 SPREAD_DOT=103 OPTIONAL_DOT=104 ELVIS_OPERATOR=105 MEMBER_POINTER=106 REGEX_FIND=107 REGEX_MATCH=108 STAR_STAR=109 STAR_STAR_ASSIGN=110 CLOSABLE_BLOCK_OP=111 WS=112 NLS=113 SL_COMMENT=114 SH_COMMENT=115 ML_COMMENT=116 STRING_LITERAL=117 REGEXP_LITERAL=118 DOLLAR_REGEXP_LITERAL=119 DOLLAR_REGEXP_CTOR_END=120 IDENT=121 AT=122 FINAL=123 ABSTRACT=124 UNUSED_GOTO=125 UNUSED_DO=126 STRICTFP=127 UNUSED_CONST=128 'package'=1 'import'=2 'static'=3 'def'=4 'class'=5 'interface'=6 'enum'=7 'trait'=8 'extends'=9 'super'=10 'void'=11 'boolean'=12 'byte'=13 'char'=14 'short'=15 'int'=16 'float'=17 'long'=18 'double'=19 'as'=20 'private'=21 'public'=22 'protected'=23 'transient'=24 'native'=25 'threadsafe'=26 'synchronized'=27 'volatile'=28 'default'=29 'implements'=30 'this'=31 'throws'=32 'if'=33 'else'=34 'while'=35 'switch'=36 'for'=37 'in'=38 'return'=39 'break'=40 'continue'=41 'throw'=42 'assert'=43 'case'=44 'try'=45 'finally'=46 'catch'=47 'false'=48 'instanceof'=49 'new'=50 'null'=51 'true'=52 '?'=53 '('=54 ')'=55 '['=56 ']'=57 '{'=58 '}'=59 ':'=60 ','=61 '.'=62 '='=63 '<=>'=64 '=='=65 '==='=66 '!'=67 '~'=68 '!='=69 '!=='=70 '+'=71 '+='=72 '++'=73 '-'=74 '-='=75 '--'=76 '*'=77 '*='=78 '%'=79 '%='=80 '>>'=81 '>>='=82 '>>>'=83 '>>>='=84 '>='=85 '>'=86 '<<'=87 '<<='=88 '<='=89 '<'=90 '^'=91 '^='=92 '|'=93 '|='=94 '||'=95 '&'=96 '&='=97 '&&'=98 ';'=99 '..'=100 '..<'=101 '...'=102 '*.'=103 '?.'=104 '?:'=105 '.&'=106 '=~'=107 '==~'=108 '**'=109 '**='=110 '->'=111 '#'=122 'final'=123 'abstract'=124 'goto'=125 'do'=126 'strictfp'=127 'const'=128 All these tokens starting with T are wrong and corresponding value in Groovy Lexer is null.
PEGKit combine matched symbols on stack
I'm writing a grammar for PEGKit to parse a Twine exported Twee file. This is my first time using PEGKit and I'm trying to get to grips with how it works. I have this twee source file that I'm parsing :: Passage One P1 Line One P1 Line Two :: Passage Two P2 Line One P2 Line Two Currently I've worked out how to parse the above using the following grammar #before { PKTokenizer *t = self.tokenizer; [t.symbolState add:#"::"]; [t.commentState addSingleLineStartMarker:#"::"]; // New lines as symbols [t.whitespaceState setWhitespaceChars:NO from:'\n' to:'\n']; [t.whitespaceState setWhitespaceChars:NO from:'\r' to:'\r']; [t setTokenizerState:t.symbolState from:'\n' to:'\n']; [t setTokenizerState:t.symbolState from:'\r' to:'\r']; } start = passage+; passage = passageTitle contentLine*; passageTitle = passageStart Word+ eol+; contentLine = singleLine eol+; singleLine = Word+; passageStart = '::'!; eol = '\n'! | '\r'!; and the result I get is [Passage, One, P1, Line, One, P1, Line, Two, Passage, Two, P2, Line, One, P2, Line, Two]::/Passage/One/ /P1/Line/One/ /P1/Line/Two/ / /::/Passage/Two/ /P2/Line/One/ /P2/Line/Two/ ^ Ideally, I'd like the parser to combine the words matched for the passageTitle into a single string similar to how the built in PEGKit QuotedString grammar works. I would also like the words matched for a contentLine to be combined as well. So, eventually, I would have this on the stack [Passage One, P1 Line One, P1 Line Two, Passage Two, P2 Line One, P2 Line Two] Any thoughts on how to achieve this would be appreciated.
Creator of PEGKit here. I understand your ultimate strategy (to collect/combine lines as single string objects), and agree that it makes sense, however, I disagree with your proposed tactic to achieve that (to alter tokenization to try to combine what are essentially multiple separate tokens into single tokens). Combining lines into convenient string objects makes sense, but altering tokenization to achieve that, doesn't make sense IMO (at least not with a recursive descent parsing kit PEGKit) when the lines in question don't have obvious 'bracketing' characters like quotes or brackets. You could treat the passageTitle lines starting with :: as single-line Comment tokens, but I probably wouldn't since I gather they are semantically not comments. So instead of merging multiple tokens via the tokenizer, you should merge multiple tokens in the more natural way for PEGKit: in the parser delegate callbacks. We have two different cases to deal with here: The passageTitle lines The contentLine lines In your grammar, remove this line so we won't be treating passageTitles as Comment tokens (you didn't have that completely correctly configured anyhow, but never mind that): [t.commentState addSingleLineStartMarker:#"::"]; And also in your grammar, remove the ! from your passageStart rule so that those tokens won't be discarded: passageStart = '::'; That's all for the grammar. Now in your Parser Delegate callbacks, implement the two necessary callback methods for the title and content lines. And in each callback, pull all of the necessary tokens off the PKAssembly's stack, and merge them into a single string (in reverse). #interface TweeDelegate : NSObject #end #implementation TweeDelegate - (void)parser:(PKParser *)p didMatchPassageTitle:(PKAssembly *)a { NSArray *toks = [a objectsAbove:[PKToken tokenWithTokenType:PKTokenTypeSymbol stringValue:#"::" doubleValue:0.0]]; [a pop]; // discard `::` NSMutableString *buf = [NSMutableString string]; for (PKToken *tok in [toks reverseObjectEnumerator]) { [buf appendFormat:#"%# ", tok.stringValue]; } CFStringTrimWhitespace((CFMutableStringRef)buf); NSLog(#"Title: %#", buf); // Passage One } - (void)parser:(PKParser *)p didMatchContentLine:(PKAssembly *)a { NSArray *toks = [a objectsAbove:nil]; NSMutableString *buf = [NSMutableString string]; for (PKToken *tok in [toks reverseObjectEnumerator]) { [buf appendFormat:#"%# ", tok.stringValue]; } CFStringTrimWhitespace((CFMutableStringRef)buf); NSLog(#"Content: %#", buf); // P1 Line One } #end I receive the following output: Title: Passage One Content: P1 Line One Content: P1 Line Two Title: Passage Two Content: P2 Line One Content: P2 Line Two As for what to do with these strings once you have created them, I'll leave up to you :). Hope that helps.