Heuristic Approaches to Finding Main Content - parsing

Wondering if anybody could point me in the direction of academic papers or related implementations of heuristic approaches to finding the real meat content of a particular webpage.
Obviously this is not a trivial task, since the problem description is so vague, but I think that we all have a general understanding about what is meant by the primary content of a page.
For example, it may include the story text for a news article, but might not include any navigational elements, legal disclaimers, related story teasers, comments, etc. Article titles, dates, author names, and other metadata fall in the grey category.
I imagine that the application value of such an approach is large, and would expect Google to be using it in some way in their search algorithm, so it would appear to me that this subject has been treated by academics in the past.
Any references?

One way to look at this would be as an information extraction problem.
As such, one high-level algorithm would be to collect multiple examples of the same page type and deduce parsing (or extraction) rules for the parts of the page which are different (this is likely to be the main topic). The intuition is that common boilerplate (header, footer, etc) and ads will eventually appear on multiple examples of those web pages, so by training on a few of them, you can quickly start to reliably identify this boilerplate/additional code and subsequently ignore it. It's not foolproof, but this is also the basis of web scraping technologies, both commercial and academic, like RoadRunner:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf
The citation is:
Valter Crescenzi, Giansalvatore Mecca,
Paolo Merialdo: RoadRunner: Towards
Automatic Data Extraction from Large
Web Sites. VLDB 2001: 109-118
There's also a well-cited survey of extraction technologies:
Alberto H. F. Laender , Berthier A.
Ribeiro-Neto , Altigran S. da Silva ,
Juliana S. Teixeira, A brief survey of
web data extraction tools, ACM SIGMOD
Record, v.31 n.2, June 2002
[doi>10.1145/565117.565137]

Related

References for Z3 - how does it work[internal theory]?

I am interested in reading the internal theory behind Z3. Specifically I want to read, how Z3 SMT solver works, and how it is able to find counterexamples for an incorrect model. I wish to be able to manually work out a trace for some very simple example.
However, all Z3 references seem to be how to code in it; or a very high level description of their algorithm. I am unable to find a description of the algorithms used. Is this information not made public by Microsoft?
Could anyone quote any references (papers/books) which give a comprehensive insight into Z3's theory + working?
My personal opinion is that the best reference to start with is Kroening and Strichman's Decision Procedures book. (Make sure to get the 2nd edition as it has good updates!) It covers almost all topics of interest to certain depth, and has many references at the back for you to follow up on. The book also has a website http://www.decision-procedures.org with extra readings, slides, and project ideas.
Another book of interest in this field is Bradley and Manna's The Calculus of Computation. While this book isn't specific to SAT/SMT, it covers many of the similar topics and how these ideas play out in the realm of program verification. Also see http://theory.stanford.edu/~arbrad/pivc/index.html for the associated software/tools.
Of course neither of these books are specific to z3, so you won't find anything detailed about how z3 itself is constructed in them. For programming z3 and some of the theory behind it, the "tutorial" paper by Bjørner, de Moura, Nachmanson, and Wintersteiger is a great read.
Once you go through these, I suggest reading individual papers by the developers, depending on where your interests are:
Bjørner: https://www.microsoft.com/en-us/research/people/nbjorner/publications/
de Moura: https://www.microsoft.com/en-us/research/people/leonardo/publications/
Wintersteiger: https://www.microsoft.com/en-us/research/people/cwinter/publications/
Nachmanson: https://www.microsoft.com/en-us/research/people/levnach/publications/
And there's of course a plethora of resources on the internet, many papers, presentations, slide-decks etc. Feel free to ask specific questions directly in this forum, or for truly z3 internal specific questions, you can use their discussions forum.
Note Regarding the differences between editions of Kroening and Strichman's book, here’s what the authors have to say:
The first edition of this book was adopted as a textbook in courses worldwide. It was published in 2008 and the field now called SMT was then in its infancy, without the standard terminology and canonic algorithms it has now; this second edition reflects these changes. It brings forward the DPLL(T) framework. It also expands the SAT chapter with modern SAT heuristics, and includes a new section about incremental satisfiability, and the related Constraints Satisfaction Problem (CSP). The chapter about quantifiers was expanded with a new section about general quantification using E-matching and a section about Effectively Propositional Reasoning (EPR). The book also includes a new chapter on the application of SMT in industrial software engineering and in computational biology, coauthored by Nikolaj Bjørner and Leonardo de Moura, and Hillel Kugler, respectively.

Does it make sense to interrogate structured data using NLP?

I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.

How to categorize urls using machine learning?

I'm indexing websites' content and I want to implement some categorization based solely on the urls.
I would like to tell appart content view pages from navigation pages.
By 'content view pages' I mean webpages where one can typically see the details of a product or a written article.
By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.
Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.
In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how.
Machine learning appear to be a broad topic, what should I start reading about in particular?
Which concepts, which algoritms, which tools?
If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.
After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.
You firstly need to collect a dataset of navigation / content pages and label them. After that its quite straight forward.
What language will you be using? I'd suggest you try Weka which is a java based tool in which you can simply press a button and get back performance measures of 50 odd algorithms from. After that you will know which is the most accurate and can deploy that.
I feel like you are trying to classify the Authority and Hub in a HITS algorithm.
Hub is your navigation page;
Authority is your content view page.
By doing a link analysis of every web pages, you should be able to find out the type of page by performing HITS on all the webpages in a domain. As shown in below graphs, the left graph shows the link relation between webpages. The right graph shows the scoring with respective to hub/authority after running HITS. HITS does not need any label to start. The updating rule is simple: basically just one update for authority score and another update for hub score.
Here is a tutorial discussing pagerank/HITS where I borrowed the above two graphs.
Here is an extended version of HITS to combine HITS and information retrieval methods (TF-IDF, vector space model, etc). This looks much more promising but certainly it needs more work. I suggest you start with naive HITS and see how good it is. On top of that, try some techniques mentioned in BHITS to improve your performance.

How can I select a FAQ entry from a user's natural-language inquiry?

I am working on an app where the user submits a series of questions. These questions are freeform text, but are based on a specific product, so I have a general understanding of the context. I have a FAQ listing, and I need to try to match the user's question to a question in the FAQ.
My language is Delphi. My general thought approach is to throw out small "garbage words", a, an, the, is, of, by, etc... Run a stemming program over these words to get the root words, and then try to match as many of the remaining words as possible.
Is there a better approach? I have thought about some type of natural language processing, but I am afraid that I would be looking at years of development, rather than a week or two.
You don't need to invent a new way of doing this. It's all been done before. What you need is called a FAQ finder, introduced by Hammond, et al in 1995 (FAQ finder: a case-based approach to knowledge navigation, 11th Conference on Artificial Intelligence for Applications).
AI Magazine included a paper by some of the same authors as the first paper that evaluated their implementation. Burke, et al, Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System, 1997. It describes two stages for how it works:
First, they use Smart, an information-retrieval system, to generate an initial set of candidate questions based on the user's input. It looks like it works similarly to what you described, stemming all the words and omitting anything on the stop list of short words.
Next, the candidates are scored against the user's query according to statistical similarity, semantic similarity, and coverage. (Read the paper for details.) Scoring semantic similarity relies on WordNet, which groups English words into sets of distinct concepts. The FAQ finder reviewed here was designed to cover all Usenet FAQs; since your covered domain is smaller, it might be feasible for you to apply more domain knowledge than the basics that WordNet provides.
Not sure if this solution is precisely what you're looking for, but if you're looking to parse natural language, you could use the Link-Grammar Parser.
Thankfully, I've translated this for use with Delphi (complete with a demo), which you can download (free and 100% open source) from this page on my blog.
In addition to your stemming approach, I suggest that you are going to need to look into one or more of the following:
Recognize important pairs or phrases (2 or more words). For example if your domain is a technical field, an important pair that should be automatically considered as a pair instead of individual words, where the pair of words means something special (in programming, "linked list", "serial port", etc, are more important in their meaning as a pair of words than individually).
A large list of synonyms ("turn == rotate", "open == access", etc ).
I would be tempted to tear apart "search engine" open source software in whatever language it was written in, and see what general techniques they use.

Tool to parse text for possible Wikipedia links

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.
There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.
You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.
Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Resources