What approach graphical DSL workbenches use: Parsers or projections? - parsing

To my knowledge there are 2 approches that DSL editors use:
1- Parser based approach to develop textual DSLs: The user specifies a grammar and the workbench generates a parser that recognize this grammar. The parser builds an Abstract Syntax Tree that is used by code generators and so on.
2- Projectional approach: here there is no parser. The Abstract Syntax Tree is directly edited by user's gestures and projection rules specify how The Abstract Syntax Tree is rendered. This allow the use of different notations (Textual, graphical, tabular... ) at the same time.
Now When I look at graphical only DSL workbenches (such as DSL-tools from Microsoft) I wonder what approach they use and what are the steps involved behind the definition of the DSL. If it's the projectional approach so why is it limited to graphical notation only?
My idea is that it uses both. The projectional approach to make the notation graphical but the models are saved in a specific format (XML for exemple) and parsed.
Thank you.

Well, strictly speaking, any "graphical" editor is projectional. The ability of a language workbench to have different notations, like in MPS, is enabled by the fact that the tool has these notations built in, together with the ability to define several editors for the same piece of model. In the case of MPS it's even possible to create new notations as a plugin (so without having to change MPS itself).
I would say that saving the models to any storage medium can't ultimately be anything else than textual or binary. Any editor that wants to save models will serialize to one of those two options, even MPS. So since it wouldn't make sense to say that there is a projectional way to save models, you can either say that both DSL-tools and MPS have the textual approach for saving and a projectional editor, or (my preferred option) simply that both DSL-tools and MPS can produce projectional editors.
Also, I wouldn't agree to call DSL-tools a language workbench. As you can read in https://homepages.cwi.nl/~storm/publications/lwc13paper.pdf a program has to meet a bunch of criteria (more than DSL-tools can meet in my opinion) to be a language workbench.
In general, I would say that any "graphical" language workbench (i.e. a language workbench that produces editors that are not plain text) uses the projectional approach.

An important difference between source and projectional editing environment is the split between persistent storage and editing. Projectional editing systems can choose any persistence mechanism that they choose, while source systems need to have some universal storage mechanism - which is why they are almost always text files Martin Fowler
So if what you are editing is not in the same format as it is stored in, you are using a projectional editor. All non-textual notations (tabular, symbolic, graphical) inherently cannot be stored exactly as they look like, so they must be projected.
example: Markdown on this website
An example featuring a commonly used tool that technically (you wouldn't think of it that way) uses a projectional editor could be MS Word, because you can't just open your .docx file quickly in notepad and change the size of the header. You always edit the abstract representation shown to you through a projection.
WYSIWYG word processing systems such as Word, which appear to edit formatted text directly, are essentially structure editors for the underlying marked-up text. [wikipedia]
A tangentially related term is Illustrative programming [Fowler] which features the so-called most common "programming language" in the world Excel.


Extracting Specific Information from Scientific Papers

I am looking for specific information that I need to extract from scientific papers. The information mostly resides in the "Evaluation" or "Implementation" sections of the papers. I need to extract any function name, parameter, file name, application name, application version in the content.
Is there any NLP technique/machine learning algorithm to do this type of information extraction from scientific papers?
I'm not aware of any off-the-shelf applications that do this specific task (although that does not mean there isn't one, and there may be commercial solutions to do this). But there are open source options that would probably allow you to do what you want with a bit of work (annotation and/or rule-writing):
GATE (has a "user-friendly" graphical interface so you don't need to code if you don't want to)
Stanford OpenIE
Canary (geared towards clinical NLP by the looks of it, but could be more generally applicable)
GROBID (this looks like it could be of use to segment the articles into sections)
Alternatively, you could build your own solution on top of libraries like NLTK or spaCy (if you code in Python) or Stanford CoreNLP (Java). It sounds like you would need to first identify document sections and then search for patterns within them. Whether you adopt a machine learning or rule-based approach, this will probably take a fair bit of work. If you have a predefined list of items you are looking for, that will make your life far easier!

Does it make sense to interrogate structured data using NLP?

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

Parsing a Programming Language and Identifying Components of it

I'm looking for steps/libraries/approaches to solve this Problem statement.
Given a source file of a Programming language, I need to parse it and Subdivide it into components.
Given a Java File, I need to find the following in it.
list of Imports
Classes present in it
Attributes in the Class
Methods in it - along the Parameters if any.
I need to extract these and store it separately.
Reason Why I want to do it?
I want to build an Inverted Index on the top of these Components.
Example queries to Inverted index
1. Find the list of files with Class name: Sample
2. Find the positions where variable XXX is used within the class AAA.
I need to support queries likes the above
So, my plan is given a file, if I build these components from it, It would be easy to build an Inverted index on the top of it.
Example: Sample -- Class - Sample.java(Keyword - Component - FileName )
I want to build an Inverted index like above.
I see it is being implemented in many IDEs like IntelliJ.What I'm interested it how much effort it would take to build something like this. And I want to try implementing the same for at least one language.
Thanks in advance.
You can try to do this "just" a parser; for your specific example, that might be enough.
But you'll need a parser for each language. If you stick to just Java, you can find Java parsers pretty easily; just reuse one, there is little point in you reinventing one more set of grammar rules to describe Java.
For more than one language, this starts to get tricky. You can:
try to find a separate parser for each language. This may be sort of successful for mainstream languages. As you get to less well known languages, these get a lot harder to find. If you succeed, you'll have the problem that the parsers are likely incompatible technology; now gluing them together to collectively collect your index information is going to be a mess.
pick one parsing technology and get grammars for all the languages you care about. You have only two realistic choices: YACC/Bison, and ANTLR.
As a practical matter the YACC and Bison have been used to implement LOTS of languages... but the grammar files are not collected in one place, so they are hard to find. ANTLR at least has a single repository you can find at their web site. So that might kind of work.
Its going to be quite the effort to assemble all these into an integrated whole.
A complication is that you may want more than just raw syntax; you might want to know the meaning of the symbols, and for each symbol, precisely where it is defined in which file. After all, you want your index to be accurate at scale, and this will require differentiating foo the variable name from foo the function name. Arguably you need symbol tables.
As a general rule, this is where pure-parsing of languages breaks down;
there is serious Life After Parsing.
In that case, you want an integrated set of tools for extracting information from the different languages.
Our DMS Software Reengineering Toolkit is such a framework, and has some 40 languages predefined for it. We use something like OP's suggested process to build indexes of a code base for search tools based on DMS. Building something like DMS is an enormous effort.

Framework for building structured binary data parsers?

I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.
Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.

What do people do with Parsers, like antlr javacc?

Out of curiosity, I wonder what can people do with parsers, how they are applied, and what do people usually create with it?
I know it's widely used in programming language industry, however I think this is just a tiny portion of it, right?
Besides special-purpose languages, my most ambitious use of a parser generator yet (with good old yacc back in C, and again later with pyparsing in Python) was to extract, validate and possibly alter certain meta-info from SQL queries -- parsing SQL properly is a real challenge (especially if you hope to support more than one dialect!-), a parser generator (and a lexer it sits on top of) at least remove THAT part of the job!-)
They are used to parse text....
To give a more concrete example, where I work we use lexx/yacc to parse strings coming over sockets.
Also from the name it should give you an idea what javacc is used for (java compiler compiler!)
Generally to parse Domain Specific Languages or scripting languages, or similar support for code snipits.
Previously I have seen it used to parse the command line based output of another software tool. This way the outer tool (VPN software) could re-use the base router IPSec code without modification. As lots of what was being parsed was IP Route tables and other structured repeated text.
Using a parser allowed simple changes when the formatting changed, instead of trying to find and tweak the a hand written parser. And the output did change a few times of the life of the product.
I used parsers to help process +/- 800 Clipper source files into similar PRGs that could be compiled with Alaksa Xbase 32.
You can use it to extend your favorite language by getting its language definition from their repository and then adding what you've always wanted to have. You can pass the regular syntax to your application and handle the extension in your own program.
