PugiXML Preserve Whitespace, but not EOL - html-parsing

I'm converting html into xml, changing tag names and other stuff, but i have problems with preserving whitespaces.
This is how I'm loading file:
xml_parse_result check = doc.load_file(sourcePath.c_str(), (parse_default | parse_ws_pcdata), encoding_auto);
But if I use it that way, also '\n' and '\r' are preserved. I can't understant why, because parse_escapes and parse_eol are on by default.
parse_ws_pcdata_single, doesn't fit me, because whitespaces I wan't to preserve have sibling.

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.
The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).
A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

Replacing part of LaTeX command using BBedit grep

How can I use the BBedit grep option to replace LaTeX commands like
\textcolor{blue}{Some text}
by the contents of the second set of braces, so
Some text
?
The BBEdit Grep Tutorial gives a lot of information and good examples on using the grep option in BBEdit. What you are trying to achieve is actually a variation of one of the examples. The solution is to enter the following:
Find: \\textcolor\{blue\}\{([^\}]*)\}
Replace: \1
The relevant part is the "Find" section. The first part: \\textcolor\{blue\}\{ basically searches for the content \textcolor{blue}{. You need the \s to escape special characters.
Next, we have the cryptic sequence ([^\}]*): The (...) saves everything inside the parentheses into the variable \1, which you can use in the "Replace" section to insert the content. The [^\}]* consists of ^\} which means match all characters which are not ^ a closing brace \}. With [...]* we say, match any number of "not brace" characters. Overall, this expression makes the grep match all characters which are not closing braces, and saves them into \1.
Finally, the expression ends with a \}, i.e. a closing brace, which is the end of what we want to find.
The "Replace" only contains \1, which is everything inside the parentheses (...) in the "Find" field.

Flex prints newline to stdout on default rule match - want to alter that behavior

I have the following flex rules in place.
"#"{name} {printf(" HASH | %s\n", yytext);}
. {}
It works great for my purposes and outputs upon a match to the first rule;
HASH | some matched string
What's bothering me is that flex is also printing a newline on each match of the second rule. So I get a stdout filled with newlines. Is there a do nothing OP in C? Am I implicitly telling flex to print a newline with a empty rule action? Omitting the "{}" results in the same behavior. I can use sed or whatever to filter out the newlines, but I'd rather just tell flex to stop printing newlines.
I'm happy to provide follow-up examples and data.
You need to add \n to your default rule:
.|\n {}

How to create a parser which means any char not in ['(',')','{','}'], in PetitParserDart?

I want to define a parser which accept any char except ['(', ')', '{', '}'] in PetitParserDart.
I tried:
char('(').not() & char(')').not() & char('{').not() & char('}')
I'm not sure if it's correct, and is it any simple way to do this? (something like chars('(){}').neg()) ?
This matches anything, but the characters listed after the caret ^. It is the character class of all characters without the listed ones:
pattern('^(){}');
This also works (note the .not() on the last character, and the any() to actually consume the character):
char('(').not() & char(')').not() & char('{').not() & char('}').not() & any()
And this one works as well:
anyIn('(){}').neg()
Which is equivalent to:
(anyIn('(){}').not() & any()).pick(1)
And another alternative is:
(char('(') | char(')') | char('{') | char('}')).neg()
Except for the second example, all examples return the parsed character (this can be easily fixed, but I wanted to stay close to your question). The first example is probably the easiest to understand, but depending on context you might prefer one of the alternatives.

Character column parsing in Boost::Spirit

I'm working on a Boost Spirit 2.0 based parser for a small subset of Fortran 77. The issue I'm having is that Fortran 77 is column oriented, and I have been unable to find anything in Spirit that can allow its parsers to be column-aware. Is there any way to do this?
I don't really have to support the full arcane Fortran syntax, but it does need to be able to ignore lines that have a character in the first column (Fortran comments), and recognize lines with a character in the sixth column as continuation lines.
It seems like folks dealing with batch files would at least have the same first-column problem as me. Spirit appears to have an end-of-line parser, but not a start-of-line parser (and certianly not a column(x) parser).
Well, since I now have an answer to this, I guess I should share it.
Fortran 77, like probably all other languages that care about columns, is a line-oriented language. That means your parser has to keep track of the EOL and actually use it in its parsing.
Another important fact is that in my case, I didn't care about parsing the line numbers that Fortran can put in those early control columns. All I need is to know when it is telling me to scan rest of the line differently.
Given those two things, I could entirely handle this issue with a Spirit skip parser. I wrote mine to
skip the entire line if the first (comment) column contains an alphabetic charater.
skip the entire line if there is nothing on it.
ignore the preceeding EOL and everything up to the fifth column if the fifth column contains a '.' (continuation line). This tacks it to the preceeding line.
skip all non-eol whitespace (even spaces don't matter in Fortran. Yes, it's a wierd language.)
Here's the code:
skip =
// Full line comment
(spirit::eol >> spirit::ascii::alpha >> *(spirit::ascii::char_ - spirit::eol))
[boost::bind (&fortran::parse_info::skipping_line, &pi)]
|
// remaining line comment
(spirit::ascii::char_ ('!') >> *(spirit::ascii::char_ - spirit::eol)
[boost::bind (&fortran::parse_info::skipping_line_comment, &pi)])
|
// Continuation
(spirit::eol >> spirit::ascii::blank >>
spirit::qi::repeat(4)[spirit::ascii::char_ - spirit::eol] >> ".")
[boost::bind (&fortran::parse_info::skipping_continue, &pi)]
|
// empty line
(spirit::eol >>
-(spirit::ascii::blank >> spirit::qi::repeat(0, 4)[spirit::ascii::char_ - spirit::eol] >>
*(spirit::ascii::blank) ) >>
&(spirit::eol | spirit::eoi))
[boost::bind (&fortran::parse_info::skipping_empty, &pi)]
|
// whitespace (this needs to be the last alternative).
(spirit::ascii::space - spirit::eol)
[boost::bind (&fortran::parse_info::skipping_space, &pi)]
;
I would advise against blindly using this yourself for line-oriented Fortran, as I ignore line numbers, and different compilers have different rules for valid comment and continuation characters.

Resources