Does it make sense to interrogate structured data using NLP? - machine-learning

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.

In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

Related

Naming relationships in ontologies / knowledge graphs

I'm writing game worlds and I've started working on representing the worlds not just as text and images but as a graph of topics and associations. In other words, an ontology representing the game world's characters, places, events, concepts, terms and so on.
Where I've got a bit stuck is in defining and naming the relationships between topics. It's easy enough to come up with things like "is a", "part of", "located in" etc, but as the work goes on, I realize that using the terms loosely will not work well, there are many relationships that overlap in meaning and you start wondering if this hasn't already been done. I've looked into OWL for creating ontologies, and topic maps, but what I lack is an actual data set of named associations (predicates in RDF) that I can build on, that have been vetted and used for larger projects.
What is a good strategy and resource to describe relationships between concepts in an ontology?
Schema.org is probably the definitive resource when it comes to widely accepted nomenclature for ontologies. See their page for Organizations for a good example of their property names.
I found a similar question that was addressed to the Neo4j/Cypher graph database audience, but may provide some good insights as well. A key one is the trade-offs between fine-grained relationships ('LIKES_POST') vs course-grained relationships ('LIKES'). The finer the grain, the greater the specificity, but the greater the overall complexity of the graph. In general, these trade-offs depend on your use case, where you can determine which direction best suits your needs.

Extracting Specific Information from Scientific Papers

I am looking for specific information that I need to extract from scientific papers. The information mostly resides in the "Evaluation" or "Implementation" sections of the papers. I need to extract any function name, parameter, file name, application name, application version in the content.
Is there any NLP technique/machine learning algorithm to do this type of information extraction from scientific papers?
I'm not aware of any off-the-shelf applications that do this specific task (although that does not mean there isn't one, and there may be commercial solutions to do this). But there are open source options that would probably allow you to do what you want with a bit of work (annotation and/or rule-writing):
GATE (has a "user-friendly" graphical interface so you don't need to code if you don't want to)
Reverb
Stanford OpenIE
Canary (geared towards clinical NLP by the looks of it, but could be more generally applicable)
GROBID (this looks like it could be of use to segment the articles into sections)
Alternatively, you could build your own solution on top of libraries like NLTK or spaCy (if you code in Python) or Stanford CoreNLP (Java). It sounds like you would need to first identify document sections and then search for patterns within them. Whether you adopt a machine learning or rule-based approach, this will probably take a fair bit of work. If you have a predefined list of items you are looking for, that will make your life far easier!

Functional Decomposition Diagrams and Data Flow Diagrams

What is the relationship between an FDD and a DFD of the same system?
Functional decomposition is about partitioning the functionality of a big complicated system, into smaller, preferably simpler parts. The FDD is a tool that aids you in this process. Basically, you are breaking down of the capabilities of a complicated system, into a set of more specific logically grouped capabilities.
Now, a data flow diagram deals with how data flows through a system for a specific function of the system. So, each of the above mentioned capabilities might very well have their own unique data flows.
For example, if you have an FDD diagram describing a blogging system. You might have functions for say, displaying a blog post, editing a blog post and potentially sending a link to a blog post to a friend.
These three functions will all have fairly different data flows, which can be modelled separately with DFDs. So, I'd say the relationship between these two types of diagrams are that one can help identify the individual functions, which might, or might not, need to have an associated dataflow mapped.
Hope that is helpful.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

Comparison of OData and Semantic Web/Linked Data

I'm trying to get my head around two very different approaches to data sharing: OData and Semantic Web/Linked Data. Is there a good comparison of the two?
As I understand it, OData combines syndication/CRUD (AtomPub), serialisation formats (XML, JSON), a data model, a query language, and some semantics/conventions governing use of those existing technologies. It's primarily intended for exposing data from one system so that others can consume it.
Linked Data is a data model, a rigorous commitment to URIs, an (optional?) serialisation format (RDF/XML), but (correct me if I'm wrong) doesn't say anything about transport, CRUD, etc. It seems intended to allow inferencing across lots of little chunks of data drawn from a wide variety of sources. (Not something of major importance to us right now - we would be synchronising large slabs of data between a small number of sources, and wanting to preserve provenance information).
I'm interested in technologies for sharing data between certain data management platforms, some of which I work on directly. OData seems more appealing as it's very straightforward to explain to developers: implement this API, follow that Atom standard, serialise the data like this. We're already doing something very similar for one platform: sharing XML-serialised data on an Atom feed, with URL parameters used to filter.
By contrast, my past experiences working with RDF have given me an impression of brittle, opaque (massive slabs of RDF/XML), inaccessible (using SPARQL vs SQL) technology - but perhaps I'm confusing the experience of working with a triplestore like Jena with simply exposing an existing database via a linked data API.
Any pointers, comments etc on the differences and similarities between these two approaches in terms of scope, technologies, ease, future potential etc would be great.
I think discussing this in depth is not really what Stackoverflow is meant for, but just to give you some pointers to interesting discussions about differences and overlap:
Oh - it is data on the Web
Microsoft, OData and RDF
One of the key differences seems to be that OData has no means to link data from different sources to each other. Essentially, you're still stuck in a silo.
It might also be interesting to check out various attempts to convert data between the two approaches. See a.o. http://answers.semanticweb.com/questions/1298/has-anyone-written-a-mapping-from-odata-to-rdf .
OData may be easier, but its not better, by any means. SPARQL and RDF (forget RDF/XML, better to look at Turtle) satisfies everything in OData along with providing many more cutting edge features such as:
Federation Extensions
Linked Data
Reasoning and Inference (for the more brave)
Equally, the software supporting the standards is actually quite sophisticated. Most people interested in OData generally come from a Microsoft background, so take a look at dotNetRdf
Here's a comparison matrix:
http://uoccou.wordpress.com/2011/02/17/linked-data-odata-gdata-datarss-comparison-matrix/
Unfortunately the table formatting is pretty horrible, but the content is useful.

Resources