Multi-(programming)-language tokenization - parsing

Is there a tool (or the like) which does multi (programming) language tokenization? So input should be a source code file, the tool should then auto detect the language, tokenize the file and output the tokens as xml/json/..

Related

how to convert antlr4 grammar file to tree-sitter grammar file?

Does anyone know of any tool(s) that can convert ANTLR v4 grammar files (.g4 extension) to tree-sitter grammar files (.js extension)? It would also be fine if I had to chain a couple conversion tools together. For example, going from foo.g4 (antlr4) to foo.ebnf (intermediary format) to foo.js (tree-sitter). Thank you!
I tried using this tool to go from g4 to ebnf, and then this tool to go from ebnf to tree-sitter js, but to no avail. The first tool seemed to create some junk at the bottom of the file which gave the second tool trouble. Additionally, the second tool seems to expect each definition to be completely on one line (and the first tool breaks each definition up into multiple lines for readability).

How to generate a parser generator using Xtext?

I am planning to implement a meta language on top of Xtext. In other words, I am using the Xtext grammar to define my own meta language. This meta language can then be used to define a language (using the syntax that I defined). Using the defined language, a model can be created by the user.
Hence, I would like to use Xtext/Xtend as a generator for parser generators. This would enable me to add as many meta levels as I like. My understanding is, that Xtext itself is defined using Xtext, so this should be possible?
The problem is that I don't know how to approach this, as I am not an expert in Xtext or parser generator frameworks in general. Any solutions/approaches/hints are welcomed.
Update (more details and motivation)
Xtext can be used to generate anything, so I could write a generator based on Xtext that generates a parser. This could be done by specifying my meta language's grammar, using Xtext to generate a parser for that grammar, so I would have access to an AST that represents a model written in my meta language. However, from here on, I would be left alone to do whatever I want with the AST, e.g. generate a parser (because the AST represents the grammar of a user-defined language). But as Xtext has the specific ability to generate parsers, I was thinking of reusing this feature instead of implementing my own parser generator based on the AST of a grammar.
My motivation is the wish to define my own DSL grammar language (as a replacement for Xtext), while still being able to use the infrastructure provided by the Xtext project.
I came to the following solution:
A grammar that was written using my grammar language will be parsed by Xtext. Next, the resulting AST is transformed to the Xtext grammar language AST, which can be used as input for the existing parser generator.
In general, given some grammar language l1, a model written in this language will be parsed and the resulting AST will be transformed to the AST of the grammar language l2 that was used to specify l1. This step is repeated until we have an AST representing a model of the Xtext grammar language, which will be used to generate the new parser.
Naturally, any information added with the definition of a new grammar language will be lost in each transformation step. Therefore, the infrastructure that is developed around a grammar language has the responsibility to create some kind of functionality that makes this information available to a higher language developed using the grammar language.
For a different approach, see:
WWW.XTRAN-LLC.com/xtran.html#parse-gen
In a nutshell, I got tired of creating parsers for XTRAN, our Expert System whose rules language manipulates computer languages, data, and text, so I created a parsing engine that directly executes EBNF at parse time (as opposed to creating parsing code, e.g. Lexx/YACC and ANTLR). Since XTRAN must also render code content represented in its Internal Representation / AST (after it's manipulated) as source code text, I created a corresponding rendering engine that executes (a much simpler form of) EBNF at render time.

How to ask GNU GetText (dxGetText) to ignore certain properties of Delphi components (especially SQL texts)?

I am using GNU GetText for Delphi (http://dxgettext.po.dk/download https://sourceforge.net/projects/dxgettext/ with corrections for Windows 10 dxgettext and Windows 10) for extraction of translation .po files from Delphi source code (.pas and *.dfm files). Usually GetText grabs the texts from SQL queries as well, splits them into individual strings. The *.po file becomes messy and I am not sure whether translation will not be applied for the (undesirable) modification of SQL texts as well.
E.g. I am not using sole word "where" in captions in my program, but GetText extracts "where" occurences from alomst every SQL query text.
So - how can I ask GetText, that e.g. it does not extract text from TIBQuery.SelectSQL?
I am aware of the API functions like procedure TP_GlobalIgnoreClass (IgnClass:TClass);, but I guess that these functions acts only in runtime. But I would like to apply ignore to *.po file already.

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik
The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Resources