Extracting Specific Information from Scientific Papers - machine-learning

I am looking for specific information that I need to extract from scientific papers. The information mostly resides in the "Evaluation" or "Implementation" sections of the papers. I need to extract any function name, parameter, file name, application name, application version in the content.
Is there any NLP technique/machine learning algorithm to do this type of information extraction from scientific papers?

I'm not aware of any off-the-shelf applications that do this specific task (although that does not mean there isn't one, and there may be commercial solutions to do this). But there are open source options that would probably allow you to do what you want with a bit of work (annotation and/or rule-writing):
GATE (has a "user-friendly" graphical interface so you don't need to code if you don't want to)
Reverb
Stanford OpenIE
Canary (geared towards clinical NLP by the looks of it, but could be more generally applicable)
GROBID (this looks like it could be of use to segment the articles into sections)
Alternatively, you could build your own solution on top of libraries like NLTK or spaCy (if you code in Python) or Stanford CoreNLP (Java). It sounds like you would need to first identify document sections and then search for patterns within them. Whether you adopt a machine learning or rule-based approach, this will probably take a fair bit of work. If you have a predefined list of items you are looking for, that will make your life far easier!

Related

Does it make sense to interrogate structured data using NLP?

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

spell checker uses language model

I look for spell checker that could use language model.
I know there is a lot of good spell checkers such as Hunspell, however as I see it doesn't relate to context, so it only token-based spell checker.
for example,
I lick eating banana
so here at token-based level no misspellings at all, all words are correct, but there is no meaning in the sentence. However "smart" spell checker would recognize that "lick" is actually correctly written word, but may be the author meant "like" and then there is a meaning in the sentence.
I have a bunch of correctly written sentences in the specific domain, I want to train "smart" spell checker to recognize misspelling and to learn language model, such that it would recognize that even thought "lick" is written correctly, however the author meant "like".
I don't see that Hunspell has such feature, can you suggest any other spell checker, that could do so.
See "The Design of a Proofreading Software Service" by Raphael Mudge. He describes both the data sources (Wikipedia, blogs etc) and the algorithm (basically comparing probabilities) of his approach. The source of this system, After the Deadline, is available, but it's not actively maintained anymore.
One way to do this is via a character-based language model (rather than a word-based n-gram model). See my answer to Figuring out where to add punctuation in bad user generated content?. The problem you're describing is different, but you can apply a similar solution. And, as I noted there, the LingPipe tutorial is a pretty straightforward way of developing a proof-of-concept implementation.
One important difference - to capture more context, you may want to train a larger n-gram model than the one I recommended for punctuation restoration. Maybe 15-30 characters? You'll have to experiment a little there.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

How can I select a FAQ entry from a user's natural-language inquiry?

I am working on an app where the user submits a series of questions. These questions are freeform text, but are based on a specific product, so I have a general understanding of the context. I have a FAQ listing, and I need to try to match the user's question to a question in the FAQ.
My language is Delphi. My general thought approach is to throw out small "garbage words", a, an, the, is, of, by, etc... Run a stemming program over these words to get the root words, and then try to match as many of the remaining words as possible.
Is there a better approach? I have thought about some type of natural language processing, but I am afraid that I would be looking at years of development, rather than a week or two.
You don't need to invent a new way of doing this. It's all been done before. What you need is called a FAQ finder, introduced by Hammond, et al in 1995 (FAQ finder: a case-based approach to knowledge navigation, 11th Conference on Artificial Intelligence for Applications).
AI Magazine included a paper by some of the same authors as the first paper that evaluated their implementation. Burke, et al, Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System, 1997. It describes two stages for how it works:
First, they use Smart, an information-retrieval system, to generate an initial set of candidate questions based on the user's input. It looks like it works similarly to what you described, stemming all the words and omitting anything on the stop list of short words.
Next, the candidates are scored against the user's query according to statistical similarity, semantic similarity, and coverage. (Read the paper for details.) Scoring semantic similarity relies on WordNet, which groups English words into sets of distinct concepts. The FAQ finder reviewed here was designed to cover all Usenet FAQs; since your covered domain is smaller, it might be feasible for you to apply more domain knowledge than the basics that WordNet provides.
Not sure if this solution is precisely what you're looking for, but if you're looking to parse natural language, you could use the Link-Grammar Parser.
Thankfully, I've translated this for use with Delphi (complete with a demo), which you can download (free and 100% open source) from this page on my blog.
In addition to your stemming approach, I suggest that you are going to need to look into one or more of the following:
Recognize important pairs or phrases (2 or more words). For example if your domain is a technical field, an important pair that should be automatically considered as a pair instead of individual words, where the pair of words means something special (in programming, "linked list", "serial port", etc, are more important in their meaning as a pair of words than individually).
A large list of synonyms ("turn == rotate", "open == access", etc ).
I would be tempted to tear apart "search engine" open source software in whatever language it was written in, and see what general techniques they use.

Tool to parse text for possible Wikipedia links

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.
There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.
You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.
Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Resources