Modern dependency parser for Russian - parsing

Is there any modern part-of-speech tagger + dependency parser for Russian language?
I need a tool or service that will be able to process plain text and output:
division into sentences
division into tokens
part-of-speech tags (fine-grained MSD tags are welcome)
lemmas (base forms)
dependency role labels
I need the tool for commercial purposes.
It could be either an open-source project with a trained statistical model that can be used for commercial purposes (purchased if needed) or an web API.
Eventually it could be a proprietary closed-source binary with a proprietary model.
The parsing models for Russian than I've found online all require the use of TreeTagger, which 1) has a very unfriendly licence, 2) is over 20 years old.

In order to build a (good) dependency parser you require a dependency treebank. All the teams who build their dependency parsers have access to such treebanks, but they would not be allowed to pass on the data. Therefore you can get the parser, but usually not the pretrained model.
That is why you have to train a model yourself. For Russian there exists a dependency treebank (SynTagRus). I don't know whether you will be able to get it for commercial purposes. Maybe these sites will help you:
https://github.com/UniversalDependencies/UD_Russian-SynTagRus
https://habrahabr.ru/post/148124/
http://www.ruscorpora.ru/index.html
If you manage to get the data, training your own model is a very easy task. Either ask here again or you will definitely find enough guides on the internet (training a parser is quite the same whether it is Russian or any other language)

Related

Add a language in the Stanford parser

I would like to use the Stanford parser in another language not already implemented.
I looked on the website but found nothing that could help me with that.
I guess what I have to do is "just" create a new languagePCFG.ser but to do that?
Also, if anyone knows if French and Spanish are supposed to be released?
Several things are needed:
You need a treebank (set of hand-parsed trees) from which the probabilities used in the parser are calculated
You need language-specific files (like xLanguagePack, xTreebankParserParams, which specify things about the language, treebank encoding, and parsing options
You then train the parser on the treebank to produce the grammar file (see makeSerialized.csh in the distribution)
You might need a language-specific tokenizer to divide text into tokens
If you want Stanford Dependencies output, then there is also a rule-based layer that defines the dependencies
Starting in 2011, we did start distributing a French model with the Stanford Parser. And starting in 2015, we have begun distributing a Spanish model.

Text parsing library

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.
You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

Looking for an information retrival / text mining application or library

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.
Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
In addition, R provides many tools for parsing HTML or XML. Have a look at this question for an example using the RCurl and XML packages.
Edit: You can integrate R with Java with JRI. It's a very widely used package, with many examples. You can also see these related questions.
Have a look at:
LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Just wanted to update - our final decision was to implement the parsing in groovy, and to add some required functionality (html to text, pdf to text, clean whitespace, etc.) either by implementing it in Java ot by relying on 3rd party libraries.
I use a custom parser made with Flex and C++ for similar purposes. I'd suggest you take a look at parser generators in java (javaCC .jj files) javacc-faq Nutch does it this way. (NutchAnalysis.jj)

Standard format for concrete and abstract syntax trees

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.
Is there a standard format for describing either concrete or abstract syntax trees?
Do any widely-used tool chains support outputting to these formats?
I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.
I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.
Thanks!
The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another.
See http://www.gupro.de/GXL
More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees.
See http://www.omg.org/spec/ASTM/1.0/Beta1/
This problem seems to get solved over and over again.
There's half a dozen "tool bus" proposals made over the years
that all solved it, with no one ever overtaking the industry.
The problem is that a) it is easy to represent ASTs using
any kind of nestable notation [parentheses like LISP,
like XML, ...] so people roll their own solution easily,
and b) for one tool to exchange an AST with another, they
both have to agree essentially on what the AST nodes mean;
but most ASTs are rather accidentally derived from the particular
grammar/parsing technology used by each tool, and there's
almost always disagreement about that between tools.
So, I've seen very few tools that exchange ASTs meaningfully.
If you're doing a hobby thing, I'd stick with a lisp-like
encoding of trees, where each node has the following format:
( ... )
Its easy to generate, and easy to read.
I work on a professional tool to manipulate programs. If we
have print out the AST, we do the above. Mostly individual
ASTs are far too complicated to look at in practice,
so we hardly ever print out the entire AST, at best only
a node and a few children deep. Our tool doesn't exchange
ASTs with anybody (see above reasons :) but does just
fine building it in memory, doing whizzy things with it
for analysis reasons or transformation reasons, and then
either just deleteing it (no need to send it anywhere)
or regenerating the original language text from the tree.
[The latter means you need anti-parsing or "prettyprinting"
technology]
In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.
This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.
I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.
You might also want to look at ANTLR's tree grammars.
Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.
For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.
Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML
.
I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.

Resources