Parsing custom serialized object in Rails - ruby-on-rails

I am able to export a serialized text representation of an object from our proprietary CNC programming software and need to parse it to import objects in my Rails app.
Example serialized output:
Header {
code "Centric 20170117 16gaHRS"
label "Centric 20170117 16gaHRS"
lccShortname "Centric 20170117 16gaHRS"
jobgroup "20170117 - Pike Sign"
waste 97.5516173272
unit INCH
Material {
code "HRS"
label "HRS"
labelDIN "HRS"
density 0.283647787542
thickness 0.125
Rawmaterials {
Rawmaterial {
id 52312
format 120 48.25
stock +999
used +1
Parts {
Part {
id 1
code "8581-Sign"
label "8581-Sign"
need +2
used +2
priority +1
turnAngleIncrement +180
ccAllowed +0
filler +0
area 141.761356753
positioningTime 10.369402427
cuttingTime 346.222969467
piercingTime 35.5976025504
positioningWay 1949.56
cuttingWay 9249.13
countPiercingNormal +75
countPiercingPuls +4
Plans {
Plan {
id 52313
label "Centric 20170117 16gaHRS 1"
filename "Centric 20170117 16gaHRS01"
border 0.5 0.5 0.5 0.5
cycleCount +1
waste 97.5516173272
positioningTime 11.9357066923
cuttingTime 345.629256802
piercingTime 35.5976025504
auxiliaryProcessTime 79.2405450926
positioningWay 1954.13
cuttingWay 9215.92
countPiercingNormal +75
countPiercingPuls +4
RawmaterialReference 52312
PartReferences {
PartReference {
id 1
layer 21
partId 1
insert -128.833464567 -97.2358267717
Plan {
id 52314
label "Centric 20170117 16gaHRS 2"
filename "Centric 20170117 16gaHRS02"
border 0.5 0.5 0.5 0.5
cycleCount +1
waste 97.5516173272
positioningTime 11.9357066923
cuttingTime 345.629256802
piercingTime 35.5976025504
auxiliaryProcessTime 79.2405450926
positioningWay 1954.13
cuttingWay 9215.92
countPiercingNormal +75
countPiercingPuls +4
RawmaterialReference 52312
PartReferences {
PartReference {
id 1
layer 21
partId 1
insert -128.833464567 -97.2358267717
To start with, I would like to extract the code attribute from the Header section, and the filename attribute for each Plan.
I could iterate through the file keeping note of curly braces and which section we are currently processing, but it seems as though there must be a simpler way. I could easily parse it if it were JSON or XML data, but I am at a loss as to the simplest way to parse this non-standard format.

There is no simple way.
A json and xml parser does exactly the same, going through the file character by character and keeping track of everything, just that someone else wrote that code for you.
I see 5 options
you do as suggested, reading line by line and partially parsing the file. That is called an "island grammar" parser
you use a series of regular expressions to turn the file into a valid JSON file and then parse that, the formats look similar enough that it might be possible
you reverse engineer the format and write your own complete parser
you get the name of the file format from the proprietary vendor and search for a gem that implements a parser. Most likely there will be none
you get the proprietary vendor to export the data in a different format. Most likely they will charge an astronomic price or just say no
I would give the first two options a try …


Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
the output should be:
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
this is a function where the user wants to return:
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return => {
return line[0].replace(re, '');
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

How can I efficiently parse formatted text from a file in Qt?

I would like to get efficient way of working with Strings in Qt. Since I am new in Qt environment.
So What I am doing:
I am loading a text file, and getting each lines.
Each line has text with comma separated.
Line schema:
Fname{limit:list:option}, Lname{limit:list:option} ... etc.
John{0:0:0}, Lname{0:0:0}
Notes:limit can be 1 or 0 and the same as others.
So I would like to get Fname and get limit,list,option values from {}.
I am thinking to write a code with find { and takes what is inside, by reading symbol by symbol.
What is the efficient way to parse that?
The following snippet will give you Fname and limit,list,option from the first set of brackets. It could be easily updated if you are interested in the Lname set as well.
QFile file("input.txt");
if (! | QIODevice::Text))
qDebug() << "Failed to open input file.";
QRegularExpression re("(?<name>\\w+)\\{(?<limit>[0-1]):(?<list>[0-1]):(?<option>[0-1])}");
while (!file.atEnd())
QString line = file.readLine();
QRegularExpressionMatch match = re.match(line);
QString name = match.captured("name");
int limit = match.captured("limit").toInt();
int list = match.captured("list").toInt();
int option = match.captured("option").toInt();
// Do something with values ...

insert numerical sequence in large text file

I need to create a file in this format :
I was doing this with UltraEdit which has column mode including insert number ( start + increment including leading zeros ).
Unfortunately, UltraEdit bombs out above 1 million rows.
Does anyone know of a text editor with large file capacity that has a similar operation?
BaltoStar has not written which version of UltraEdit was used and how exactly he has tried to create the file.
However, here is an UltraEdit script which can be used to create a file with lines containing an incrementing number with leading zeros according to last number.
To use that script with UltraEdit v14.20 or UEStudio v9.00 or any later version, copy the code block of the script and paste it into a new ASCII file with DOS line terminations in UltraEdit/UEStudio. Save the file for example as CreateLinesWithIncrementingNumber.js into your preferred directory of UE/UES scripts.
Now run the script by clicking on menu item Run Active Script in menu Scripting.
The script prompts the user for first and last value of the incrementing number, and for strings left and right of the incrementing number which can be both also empty strings.
Then lean back and see how the script writes the lines with the incrementing number into a new file in blocks. I created a file with more than 150 MB with an incrementing number from 0 to 5.000.000 within a few seconds using this UltraEdit script.
if (typeof(UltraEdit.clipboardContent) == "string")
// Write in blocks of not more than 4 MB into the file. Do not increase
// this value too much as during the script execution much more free
// RAM in a continous block is necessary than the value used here for
// joining the lines in user clipboard 9. A too large value results
// in a memory exception during script execution and the user of the
// script also does not see for several seconds what is going on.
var nBlockSize = 4194304;
// Create a new file and make sure it uses DOS/Windows line terminations
// independent on the user configuration for line endings of new files.
var sLineTerm = "\r\n"; // Type of line termination is DOS/Windows.
// Ask user of script for the first value to write into the file.
var nFirstNumber = UltraEdit.getValue("Please enter first value of incrementing number:",1);
if (nFirstNumber < 0)
UltraEdit.messageBox("Sorry, but first value cannot be negative.");
while (nFirstNumber < 0);
// Ask user of script for the last value to write into the file.
var nLastNumber = UltraEdit.getValue("Please enter last value of incrementing number:",1);
if (nFirstNumber >= nLastNumber)
UltraEdit.messageBox("Sorry, but last value must be greater than "+nFirstNumber.toString(10)+".");
while (nFirstNumber >= nLastNumber);
var sBeforeNumber = UltraEdit.getString("Please enter string left of the incrementing number:",1);
var sAfterNumber = UltraEdit.getString("Please enter string right of the incrementing number:",1);
// Convert the highest number to a decimal string and get a copy
// of this string with every character replaced by character '0'.
// With last number being 39428 the created string is "00000".
var sLeadingZeros = nLastNumber.toString(10).replace(/./g,"0");
// Instead of writing the string with the incrementing number line
// by line to file which would be very slow and which would create
// lots of undo records, the lines are collected first in an array of
// strings whereby the number of strings in the array is determined
// by value of variable nBlockSize. The lines in the array are
// concatenated into user clipboard 9 and written as block to the
// file using paste command. That is much faster and produces just
// a few undo records even on very large files.
// Calculate number of lines per block which depends on the
// lengths of the 4 strings which build a line in the file.
var nLineLength = sBeforeNumber.length + sLeadingZeros.length +
sAfterNumber.length + sLineTerm.length;
var nRemainder = nBlockSize % nLineLength;
var nLinesPerBlock = (nBlockSize - nRemainder) / nLineLength;
var asLines = [];
var nCurrentNumber = nFirstNumber;
while (nLastNumber >= nCurrentNumber)
// Convert integer number to decimal string.
var sNumber = nCurrentNumber.toString(10);
// Has the decimal string of the current number less
// characters than the decimal string of the last number?
if (sNumber.length < sLeadingZeros.length)
// Build decimal string new with X zeros from the alignment string
// and concatenate this leading zero string with the number string.
sNumber = sLeadingZeros.substr(0,sLeadingZeros.length-sNumber.length) + sNumber;
asLines.push(sBeforeNumber + sNumber + sAfterNumber);
if (asLines.length >= nLinesPerBlock)
asLines.push(""); // Results in a line termination at block end.
UltraEdit.clipboardContent = asLines.join(sLineTerm);
asLines = [];
// Output also the last block.
if (asLines.length)
UltraEdit.clipboardContent = asLines.join(sLineTerm);
// Reselect the system clipboard and move caret to top of new file.
else if(UltraEdit.messageBox)
UltraEdit.messageBox("Sorry, but you need a newer version of UltraEdit/UEStudio for this script.");
UltraEdit.activeDocument.write("Sorry, but you need a newer version of UltraEdit/UEStudio for this script.");

ANTLR best way to include meta-data in lexing/parsing (custom objects, kind of annotation)

I plan to include text metadata (like bold, font-size, etc.) in the process of parsing to achieve better recognition.
For instance, I have a given structure, where a word on its own line word/r/n which is bold and sized 24px, is the title for some article. In order to get better recognition results, I want to take the characters as well as the metadata in account. In terms of ANTRL I'm not sure how this could be done best. I'd like to do something like:
Wrap each character of the original text into a custom object with fields for the metadata and pass that to ANTLR.
Preprocess the text and insert at specific places annotations for the metadata which is considered by the grammer.
I really like to take option 1. but I'm not sure which part from ANTLR I need to subclass etc. Do I have to start at the ANTLRInputStream-Object, in order to get a proper stream for a subclassed Lexer to get custom Tokens for a subclassed Parser etc. Is there a more elegant way, especially in querying the tokens while parsing with actions in a {} block ?
If anyone has some hints and/or experiences this would be great!
Here is a more specific simple example: I have a file wich includes the encoding of metadata which I parse forehand. the actual text including newline look like the following:
Here is some content one.
Here is some content two.
Where the titlesentryOneand entryTwo are originally font-size of 24px and the content is font-size of 12px (as exemplary given values). Char by char I create a new instance of a custom object encapsulating the character as String and the font-size.
I initialize respective objects for each of the characters with fields of the font-size, e.g for the first letter of entryOne like
MyChar aTitelChar = new MyChar("e", 24);
For the content, like the second line Here is some content one. I create instances of MyChar like:
MyChar aContentChar= new MyChar("H", 12);
All characters of the texts are wrapped in instances of the below MyChar-Class and added to a List<MyChar> in order to produce a new input for ANTLR.
below is the Java Class for the characters:
public class MyChar {
private int fontSizePx;
private String text;
public MyChar(String text, int fontSizePx) {
this.text = text;
this.fontSizePx = fontSizePx;
public int getFontSizePx() {
return fontSizePx;
public String getText() {
return text;
I want that my grammar matches the above two entries (or more formatted this way) which in turn consist each of a title and a content which is terminated with a fullstop. This grammar could look like this:
rule: entry+ NEWLINE
letters NEWLINE
(letters)+ '.' NEWLINE
('a'..'z' | 'A'..'Z')+
(' ' | '\t' | 'f' ) + {$channel = HIDDEN;};
NEWLINE:'\r'? '\n';
Now, for instance, what I want to do is to find out if it's really a title of an entry by checking the font-size of all letters encompassing the title-token before titel-rule returns. In case the input conforms to the grammar but is actually some kind of mistake (the original metadata-encoded file starts with something that conforms to the title-rule but its actually the content) the author of the grammar could sort that out if he knows that the original font-size for titles is 24 and check this. If one of the letter-tokens doesn't equal to font-size 24 throw an exception/don't return/do smthg. appropriate.
The thing I'm pondering on is where to plug in the List<MyChar> to provide this functionality (to query kinds of metadata while parsing in context of ANTLR). I'm experimenting with ANTLR's Classes but as I'm new to ANTLR I thought probably some of the experienced users can point me in the right direction, like where would be a good insertion points for custom objects? should I start by implenting CharStream and override some methods? Probably there is something which ANTLR provides which I haven't found yet?
Here's one way to accomplish what I think you're going for, using the parser to manage matching input to metadata. Note that I made whitespace significant because it's part of the content and can't be skipped. I also made periods part of content to simplify the example, rather than using them as a marker.
grammar SysEx;
#header {
import java.util.List;
#parser::members {
private List<MyChar> metadata;
private int curpos;
private boolean isTitleInput(String input) {
return isFontSizeInput(input, 24);
private boolean isContentInput(String input){
return isFontSizeInput(input, 12);
private boolean isFontSizeInput(String input, int fontSize){
List<MyChar> sublist = metadata.subList(curpos, curpos + input.length());
System.out.println(String.format("Testing metadata for input=\%s, font-size=\%d", input, fontSize));
int start = curpos;
//move our metadata pointer forward.
for (int i = 0, count = input.length(); i < count; ++i){
MyChar chardata = sublist.get(i);
char c = input.charAt(i);
if (chardata.getText().charAt(0) != c){
//This character doesn't match the metadata (ERROR!)
System.out.println(String.format("Content mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
} else if (chardata.getFontSizePx() != fontSize){
//The font is wrong.
System.out.println(String.format("Format mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
//All characters check out.
return true;
private void skipInput(String str){
curpos += str.length();
System.out.println("\t\tMoving metadata pointer ahead by " + str.length() + " to " + curpos);
rule[List<MyChar> metadata]
#init {
this.metadata = metadata;
: entry+ EOF
: title content
{System.out.println("Finished reading entry.");}
: line {isTitleInput($line.text)}? newline {System.out.println("Finished reading title " + $line.text);}
: line {isContentInput($line.text)}? newline {System.out.println("Finished reading content " + $line.text);}
: (NEWLINE{skipInput($NEWLINE.text);})+
line returns [String text]
#init {
StringBuilder builder = new StringBuilder();
#after {
$text = builder.toString();
: (ANY{builder.append($ANY.text);})+
NEWLINE:'\r'? '\n';
ANY: .; //whitespace can't be skipped because it's content.
A title is a line that matches the title metadata (size 24 font) followed by one or more newline characters.
A content is a line that matches the content metadata (size 12 font) followed by one or more newline characters. As mentioned above, I removed the check for a period for simplification.
A line is a sequence of characters that does not include newline characters.
A validating semantic predicate (the {...}? after line) is used to validate that the line matches the metadata.
Here is the code I used to test the grammar (minus imports, for brevity):
public class SysExGrammar {
public static void main(String[] args) throws Exception {
//Create some metadata that matches our input.
List<MyChar> matchingMetadata = new ArrayList<MyChar>();
appendMetadata(matchingMetadata, "entryOne\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(matchingMetadata, "entryTwo\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content two.\r\n", 12);
System.out.println("Finished example #1");
//Create some metadata that doesn't match our input (negative test).
List<MyChar> mismatchingMetadata = new ArrayList<MyChar>();
appendMetadata(mismatchingMetadata, "entryOne\r\n", 24);
appendMetadata(mismatchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(mismatchingMetadata, "entryTwo\r\n", 12); //content font size!
appendMetadata(mismatchingMetadata, "Here is some content two.\r\n", 12);
System.out.println("Finished example #2");
private static void parseInput(List<MyChar> metadata) throws Exception {
//Test setup
InputStream resource = SysExGrammar.class.getResourceAsStream("SysExTest.txt");
CharStream input = new ANTLRInputStream(resource);
SysExLexer lexer = new SysExLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SysExParser parser = new SysExParser(tokens);
System.out.println("Parsing encountered " + parser.getNumberOfSyntaxErrors() + " syntax errors");
private static void appendMetadata(List<MyChar> metadata, String string,
int fontSize) {
for (int i = 0, count = string.length(); i < count; ++i){
metadata.add(new MyChar(string.charAt(i) + "", fontSize));
SysExTest.txt (note this uses Windows newlines (\r\n)
Here is some content one.
Here is some content two.
Test output (trimmed; the second example has deliberately-mismatched metadata):
Parsing encountered 0 syntax errors
Finished example #1
Parsing encountered 2 syntax errors
Finished example #2
This solution requires that each MyChar corresponds to a character in the input (including newline characters, although you can remove that limitation if you like -- I would remove it if I didn't already have this answer written up ;) ).
As you can see, it's possible to tie the metadata to the parser and everything works as expected. I hope this helps.
