Feed the Stanford Parser with a formatted text - parsing

i have a phrase in the format "Word_POS-TAG_Lemma Word_POS-TAG_Lemma Word_POS-TAG_Lemma Word_POS-TAG_Lemma....." is there a way to feed the stanford parser with this kind of formatted input? Moreover these is a way to obtain a tree in the standard dependencies way?
Thank you in advance

See the FAQ: Can I give the parser part-of-speech (POS) tagged input and force the parser to use those tags?
It's definitely possible, though it would probably help to strip off / ignore the lemma forms to make things easier.

Related

Parsec or happy (with alex) or uu-parsinglib

I am going to write a parser of verilog (or vhdl) language and will do a lot of manipulations (sort of transformations) of the parsed data. I intend to parse really big files (full Verilog designs, as big as 10K lines) and I will ultimately support most of the Verilog. I don't mind typing but I don't want to rewrite any part of the code whenever I add support for some other rule.
In Haskell, which library would you recommend? I know Haskell and have used Happy before (to play). I feel that there are possibilities in using Parsec for transforming the parsed string in the code (which is a great plus). I have no experience with uu-paringlib.
So to parse a full-grammar of verilog/VHDL which one of them is recommended? My main concern is the ease and 'correctness' with which I can manipulate the parsed data at my whim. Speed is not a primary concern.
I personally prefer Parsec with the help of Alex for lexing.
I prefer Parsec over Happy because 1) Parsec is a library, while Happy is a program and you'll write in a different language if you use Happy and then compile with Happy. 2) Parsec gives you context-sensitive parsing abilities thanks to its monadic interface. You can use extra state for context-sensitive parsing, and then inspect and decide depending on that state. Or just look at some parsed value before and decide on next parsers etc. (like a <- parseSomething; if test a then ... do ...) And when you don't need any context-sensitive information, you can simply use applicative style and get an implementation like implemented in YACC or a similar tool.
As a downside of Parsec, you'll never know if your Parsec parser contains a left recursion, and your parser will get stuck in runtime (because Parsec is basically a top-down recursive-descent parser). You have to find left recursions and eliminate them. YACC-style parsers can give you some static guarantees and information (like shift/reduce conflicts, unused terminals etc.) that you can't get with Parsec.
Alex is highly recommended for lexing in both situations (I think you have to use Alex if you decide to go on with Happy). Because even if you use Parsec, it really simplifies your parser implementation, and catches a great deal of bugs too (for example: parsing a keyword as an identifier was a common bug I did while I was using Parsec without Alex. It's just one example).
You can have a look at my Lua parser implemented in Alex+Parsec And here's the code to use Alex-generated tokens in Parsec.
EDIT: Thanks John L for corrections. Apparently you can do context-sensitive parsing with Happy too. Also, Alex for lexing is not required in Happy, though it's recommended.

Parsing binary data

I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!
I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.
I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.

VBScript Partial Parser

I am trying to create a VBScript parser. I was wondering what is the best way to go about it. I have researched and researched. The most popular way seems to be going for something like Gold Parser or ANTLR.
The feature I want to implement is to do dynamic checking of Syntax Errors in VBScript. I do not want to compile the entire VBS every time some text changes. How do I go about doing that? I tried to use Gold Parser, but i assume there is no incremental way of doing parsing through it, something like partial parse trees...Any ideas on how to implement a partial parse tree for such a scenario?
I have implemented VBscript Parsing via GOLD Parser. However it is still not a partial parser, parses the entire script after every text change. Is there a way to build such a thing.
thks
If you really want to do incremental parsing, consider this paper by Tim Wagner.
It is brilliant scheme to keep existing parse trees around, shuffling mixtures of string fragments at the points of editing and parse trees representing the parts of the source text that hasn't changed, and reintegrating the strings into the set of parse trees. It is done using an incremental GLR parser.
It isn't easy to implement; I did just the GLR part and never got around to the incremental part.
The GLR part was well worth the trouble.
There are lots of papers on incremental parsing. This is one of the really good ones.
I'd first look for an existing VBScript parser instead of writing your own, which is not a trivial task!
Theres a VBScript grammar in BNF format on this page: http://rosettacode.org/wiki/BNF_Grammar which you can translate into a ANTLR (or some other parser generator) grammar.
Before trying to do fancy things like re-parsing only a part of the source, I recommend you first create a parser that actually works.
Best of luck!

Choosing a Haskell parser

There are many open sourced parser implementations available to us in Haskell. Parsec seems to be the standard for text parsing and attoparsec seems to be a popular choice for binary parsing but I don't know much beyond that. Is there a particular decision tree that you follow for choosing a parser implementation? Have you learned anything interesting about the strengths or weaknesses of the libraries?
You have several good options.
For lightweight parsing of String types:
parsec
polyparse
For packed bytestring parsing, e.g. of HTTP headers.
attoparsec
For actual binary data most people use either:
binary -- for lazy binary parsing
cereal -- for strict binary parsing
The main question to ask yourself is what is the underlying string type?
String?
bytestring (strict)?
bytestring (lazy)?
unicode text
That decision largely determines which parser toolset you'll use.
The second question to ask is: do I already have a grammar for the data type? If so, I can just use happy
The Happy parser generator
And obviously for custom data types there are a variety of good existing parsers:
XML
haxml
xml-light
hxt
hexpat
CSV
bytestring-csv
csv
JSON
json
rss/atom
feed
Just to add to Don's post: Personally, I quite like Text.ParserCombinators.ReadP (part of base) for no-nonsense quick and easy stuff. Particularly when Parsec seems like overkill.
There is a bytestringreadp library for the bytestring version, but it doesn't cover Char8 bytestrings, and I suspect attoparsec would be a better choice at this point.
I recently converted some code from Parsec to Attoparsec. Both are quite capable.
Attoparsec wins on performance and memory footprint, but Parsec provides better error reporting and has more complete documentation.
Bryan O’Sullivan’s blog post What’s in a parser? Attoparsec rewired (2/2) includes a nice performance benchmark comparing several implementations along with some comments comparing memory usage.

Approaching Text Parsing in Scala

I'm making an application that will parse commands in Scala. An example of a command would be:
todo get milk for friday
So the plan is to have a pretty smart parser break the line apart and recognize the command part and the fact that there is a reference to time in the string.
In general I need to make a tokenizer in Scala. So I'm wondering what my options are for this. I'm familiar with regular expressions but I plan on making an SQL like search feature also:
search todo for today with tags shopping
And I feel that regular expressions will be inflexible implementing commands with a lot of variation. This leads me to think of implementing some sort of grammar.
What are my options in this regard in Scala?
You want to search for "parser combinators". I have a blog post using this approach (http://cleverlytitled.blogspot.com/2009/04/shunting-yard-algorithm.html), but I think the best reference is this series of posts by Stefan Zieger (http://szeiger.de/blog/2008/07/27/formal-language-processing-in-scala-part-1/)
Here are slides from a presentation I did in Sept. 2009 on Scala parser combinators. (http://sites.google.com/site/compulsiontocode/files/lambdalounge/ImplementingExternalDSLsUsingScalaParserCombinators.ppt) An implementation of a simple Logo-like language is demonstrated. It might provide some insights.
Scala has a parser library (scala.util.parsing.combinator) which enables one to write a parser directly from its EBNF specification. If you have an EBNF for your language, it should be easy to write the Scala parser. If not, you'd better first try to define your language formally.

Resources