Parsing a file with following a pattern - parsing

I need to parse a file with a mediawiki syntax (tables).
I know sed or awk could do it but I'm no expert of these.
I need to find the following pattern :
beginning_of_line| [[text]] || random_stuff_until_newline
There might be (or not) space between the pipes and the brakets. And I need a output the text
Any solutions for me ?
Thx

Parsing text like that is like parsing XML or HTML. Regexes aren't really well suited for that type of document. You should try to find a Python or Perl module that is suited for the job.
However, here is a sed command that will work in the simple case you provided as an example.
sed 's/^[^|]*|[[:space:]]*\[\[\([^]]\+\)\]\].*/\1/' inputfile

I would look for a Mediawiki parser. It must exist somewhere.
Failing that, if you have a grammar for mediawiki you can generate a parser using ANTLR or similiar depending on what kind of grammar it is.
If you don't have a grammar or don't want to do that because of the learning curve then you need some reliable way to distinguish between what yo're calling "text and what you're calling random stuff. Are the pipes guaranteed to be there? If so in Java you can just do String.split() using the pipes as the argument to split.
Is this what you mean?

This might work for you (GNU sed):
sed 's/^[^|]*|\s*\[\[\([^]]*\(][^]]*\)*\)]]\s*||.*/\1/;t;d' file

Related

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik
The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

POSIX sh EBNF grammar

Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.

VBScript Partial Parser

I am trying to create a VBScript parser. I was wondering what is the best way to go about it. I have researched and researched. The most popular way seems to be going for something like Gold Parser or ANTLR.
The feature I want to implement is to do dynamic checking of Syntax Errors in VBScript. I do not want to compile the entire VBS every time some text changes. How do I go about doing that? I tried to use Gold Parser, but i assume there is no incremental way of doing parsing through it, something like partial parse trees...Any ideas on how to implement a partial parse tree for such a scenario?
I have implemented VBscript Parsing via GOLD Parser. However it is still not a partial parser, parses the entire script after every text change. Is there a way to build such a thing.
thks
If you really want to do incremental parsing, consider this paper by Tim Wagner.
It is brilliant scheme to keep existing parse trees around, shuffling mixtures of string fragments at the points of editing and parse trees representing the parts of the source text that hasn't changed, and reintegrating the strings into the set of parse trees. It is done using an incremental GLR parser.
It isn't easy to implement; I did just the GLR part and never got around to the incremental part.
The GLR part was well worth the trouble.
There are lots of papers on incremental parsing. This is one of the really good ones.
I'd first look for an existing VBScript parser instead of writing your own, which is not a trivial task!
Theres a VBScript grammar in BNF format on this page: http://rosettacode.org/wiki/BNF_Grammar which you can translate into a ANTLR (or some other parser generator) grammar.
Before trying to do fancy things like re-parsing only a part of the source, I recommend you first create a parser that actually works.
Best of luck!

Writing a subshell parsing rule on ANTLR

I'm trying to create a simple BaSH-like grammar on ANTLRv3 but haven't been able to parse (and check) input inside subshell commands.
Further explanation:
I want to parse the following input:
$(command parameters*)
`command parameters`
"some text $(command parameters*)"
And be able to check it's contents as I would with simple input such as: command parameters.
i.e.:
Parsing it would generate a tree like (SUBSHELL (CMD command (PARAM parameters*))) (tokens are in upper-case)
I'm able to ignore '$('s and '`'s, but that won't cover the cases where the subshells are used inside double-quoted strings, like:
$ echo "String test $(ls -l) end"
So... any tips on how do I achieve this?
I'm not very familiar with the details of Antlr v3, but I can tell you that you can't handle bash-style command substitution inside double-quoted strings in a traditional-style lexer, as the nesting cannot be expressed using a regular grammar. Most traditional compiler-compilers restrict lexers to use regular grammars so that efficient DFAs can be constructed for them. (Lexers, which irreducibly have to scan every single character of the source, have historically been one of the slowest parts of a compiler.)
You must either parse " as a token and (ideally) use a different lexer or lexer mode for the internals of strings, so that most shell metacharacters, e.g. '{', aren't parsed as tokens but as text; or alternatively, do away with the lexer-parser division and use a scannerless approach, so that the "lexer" rule for double-quoted strings can call into the "parser" rule for command substitutions.
I would favour the scannerless approach. I would investigate how well Antlr v3 supports writing grammars that work directly over a character stream, rather than using a token stream.

Tools for command line file parsing in cygwin

I have to deal with text files in a motley selection of formats. Here's an example (Columns A and B are tab delimited):
A B
a Name1=Val1, Name2=Val2, Name3=Val3
b Name1=Val4, Name3=Val5
c Name1=Val6, Name2=Val7, Name3=Val8
The files could have headers or not, have mixed delimiting schemes, have columns with name/value pairs as above etc.
I often have the ad-hoc need to extract data from such files in various ways. For example from the above data I might want the value associated with Name2 where it is present. i.e.
A B
a Val2
c Val7
What tools/techniques are there for performing such manipulations as one line commands, using the above as an example but extensible to other cases?
I don't like sed too much, but it works for such things:
var="Name2";sed -n "1p;s/\([^ ]*\) .*$var=\([^ ,]*\).*/\1 \2/p" < filename
Gives you:
A B
a Val2
c Val7
You have all the basic bash shell commands, for example grep, cut, sed and awk at your disposal. You can also use Perl or Ruby for more complex things.
From what I've seen I'd start with Awk for this sort of thing and then if you need something more complex, I'd progress to Python.
I would use sed:
# print section of file between two regular expressions (inclusive)
sed -n '/Iowa/,/Montana/p' # case sensitive
Since you have cygwin, I'd go with Perl. It's the easiest to learn (check out the O'Reily book: Learning Perl) and widely applicable.
I would use Perl. Write a small module (or more than one) for dealing with the different formats. You could then run perl oneliners using that library. Example for what it would
look like as follows:
perl -e 'use Parser;' -e 'parser("in.input").get("Name2");'
Don't quote me on the syntax, but that's the general idea. Abstract the task at hand to allow you to think in terms of what you need to do, not how you need to do it. Ruby would be another option, it tends to have a cleaner syntax, but either language would work.

Resources