Parsing html data with nutch 1.0 and a custom plugin - html-parsing

I am currently trying to write a custom plugin for nutch 1.0. This plugin is supposed to parse html data and filter out relevant information from documents. I have a basic plugin working, it extends the HtmlParserResult object and is executed each time I do a parse.
My problems are two faced at the moment:
I do not understand the workflow/pipline of the nutch parsing good enough. I do not find the information about this on the nutch site.
I do not understand how the DOM parsing is done, I see that Nutch have set of DOM objects and that the HtmlParser plugin does some DOM parsing, still I have not figured out how this is best done.

I remember making a nutch HTML parsing plugin for a past work. I don't have access to how I did it exactly, but here are the basic points. We wanted to do the following:
parse an HTML page but conditionally use a H1 tag or a tag with a certain class as the page title rather than the actual //html/head/title
There were some special pieces of data that were sometimes on the page (ie what tab was selected, which would tell us if this was a retail customer, a bank customer, or a corporate customer).
etc.
What I did was just find the html-parse plugin class (I'm having trouble finding the actual class name), and extend it. Then override the parsing function. The new function should call the super function and then can walk the DOM tree to find the special data you are looking for. In my case I'd look for a better title and then override the value that the super function came up with.
For your second question, I'm not clear what you are asking about. I think you are asking what happens when the DOM isn't well formed? I would just dig through the nutch code (http://grepcode.com/snapshot/repo1.maven.org/maven2/org.apache.nutch/nutch/1.3/) and find out how the parsing is done (i'm sure they use a library to do it). That should tell you more about if things are greedy, or what.
Holler if you have questions.

Related

How to scrape different URLs from database with Nokogiri with different requirements

I tried using Feedjira to assist with content analysis from newsfeeds, but it appears that RSS feeds now only link to content rather than including them with RSS as I found out in "Feedjira not adding content and author". I plan to use Feedjira to get the URL for the article, but then use Nokogiri to scrape the article and pick out the relevant parts.
The problem is that each media outlet will have a different format for their pages and I need to know the best way for Nokogiri to take the URL from the database (supplied by Feedjira) and depending on the associated feed title (also the database from Feedjira sync) scrape the page in a specific way and save it to a separate table in the database. Anyone got any suggestions?
I don't know your special use case but I'm also doing content analysis using news feeds.
Maybe you'll have a look on Readability which provides you a generic content scraper.
The problem you've encountered is that every feed generator does it a bit differently, just as with HTML generators. You can assume certain fields are going to be in place in an RDF, RSS or ATOM feed, however the author of the feed could use optional tags that you could find very useful, so you have to write code to look for them.
I wrote several feed aggregators in the past, including one that was handling well over 1000 feeds daily. By sniffing out the feed type, ATOM vs. RSS vs RDF, then I could make sensible checks for fields that were interesting given that format, and extract the data if it was available.
Pre-canned parsers get it wrong too often, either grabbing data you don't want and making a mess of the output, or skipping data you do want leaving gaps in the output, so be prepared to write code if you want it done correctly.
You'll probably want to take advantage of a backing database too, to keep track of what you looked at last and when you're supposed to look at it again; That's part of being a good network citizen. You'll also want to keep track whether a feed was down the last n times you looked so you can trim out dead sites.

Website fetch, using NSURLSession and changing INPUT Field value on this site

I wanna fetch the content of a website. But to get the correct content, it is necessary to change a Input Html sroll field on the side?
Many idea how to manage with xcode?
Thanks a lot!!
Lars
If you want to retrieve the HTML that you get after filling in a HTML form, you have to identify precisely what the series of requests looks like to fetch the data. And be careful because it's often not as simple as just looking at the request that the HTML in question generates: unfortunately, it is sometimes a complex series of requests (e.g. retrieving the original HTML is often seamlessly retrieving some critical hidden form fields and/or cookies).
Bottom line, to reverse engineer the required HTTP requests, you often have to pour through HTML code and/or watch the requests with something like Charles. It often takes quite a bit of time to do this with complicated sites.
Before you invest a lot of time here, though, you should first see if the web site provider's Terms of Service permit such usage. They often strictly prohibit this sort of practice. It's much better to contact the web site provider and see if they provide a web service to retrieve the data. That's far easier and will result in a far more robust interface for your app.
But if you're forced programmatically parsing HTML, I'd refer you to How to Parse HTML on iOS on Ray Wenderlich's site.

Prevent XSS attacks and still use Html.Raw

I have CMS system where I am using CK Editor to enter data. Now if user types in <script>alert('This is a bad script, data');</script> then CKEditor does the fair job and encodes it correctly and passes <script>alert('This is a bad script, data')</script> to server.
But if user goes into browser developer tools (using Inspect element) and adds this inside it as shown in the below screen shot then this is when all the trouble starts. Now after retrieving back from DB when this is displayed in Browser it presents alert box.
So far I have tried many different things one them is
Encode the contents using AntiXssEncoder [HttpUtility.HtmlEncode(Contents)] and then store it in database and when displaying back in browser decode it and display it using MvcHtmlString.Create [MvcHtmlString.Create(HttpUtility.HtmlDecode(Contents))] or Html.Raw [Html.Raw(Contents)] as you may expect both of them displays JavaScript alert.
I don't want to replace the <script> manually thru code as it is not comprehensive solution (search for "And the encoded state:").
So far I have referred many articles (sorry not listing them all here but just adding few as proof to show I have put sincere efforts before writing this question) but none of them have code which shows the answer. May be there is some easy answer and I am not looking in right direction or may be it is not that simple at all and I may need to use something like Content Security Policy.
ASP.Net MVC Html.Raw with AntiXSS protection
Is there a risk in using #Html.Raw?
http://blog.simontimms.com/2013/01/21/content-security-policy-for-asp-net-mvc/
http://blog.michaelckennedy.net/2012/10/15/understanding-text-encoding-in-asp-net-mvc/
To reproduce what I am saying go to *this url and in the text box type <script>alert('This is a bad script, data');</script> and click the button.
*This link is from Michael Kennedy's blog
It isn't easy and you probably don't want to do this. May I suggest you use a simpler language than HTML for end user formatted input? What about Markdown which (I believe) is used by Stackoverflow. Or one of the existing Wiki or other lightweight markup languages?
If you do allow Html, I would suggest the following:
only support a fixed subset of Html
after the user submits content, parse the Html and filter it against a whitelist of allowed tags and attributes.
be ruthless in filtering and eliminating anything that you aren't sure about.
There are existing tools and libraries that do this. I haven't used it, but I did stumble on http://htmlpurifier.org/. I assume there are many others. Rick Strahl has posted one example for .NET, but I'm not sure if it is complete.
About ten years ago I attempted to write my own whitelist filter. It parsed and normalized the entered Html. Then it removed any elements or attributes that were not on the allowed whitelist. It worked pretty well, but you never know what vulnerabilities you've missed. That project is long dead, but if I had to do it over I would have used an existing simpler markup language rather than Html.
There are so many ways for users to inject nasty stuff into your pages, you have to be fierce to prevent this. Even CSS can be used to inject executable expressions into your page, like:
<STYLE type="text/css">BODY{background:url("javascript:alert('XSS')")}</STYLE>
Here is a page with a list of known attacks that will keep you up at night. If you can't filter and prevent all of these, you aren't ready for untrusted users to post formatted content viewable by the public.
Right around the time I was working on my own filter, MySpace (wow I'm old) was hit by an XSS Worm known as Samy. Samy used Style attributes with embedded background Url that had a javascript payload. It is all explained by the author.
Note that your example page says:
This page is meant to accept and display raw HTML by trusted
editors.
The key issue here is trust. If all of your users are trusted (say employees of a web site), then the risk here is lower. However, if you are building a forum or social network or dating site or anything that allows untrusted users to enter formatted content that will be viewable by others, you have a difficult job to sanitize Html.
I managed to resolve this issue using the HtmlSanitizer in NuGet:
https://github.com/mganss/HtmlSanitizer
as recommended by the OWASP Foundation (as good a recommendation as I need):
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.236_-_Sanitize_HTML_Markup_with_a_Library_Designed_for_the_Job
First, add the NuGet Package:
> Install-Package HtmlSanitizer
Then I created an extension method to simplify things:
using Ganss.XSS;
...
public static string RemoveHtmlXss(this string htmlIn, string baseUrl = null)
{
if (htmlIn == null) return null;
var sanitizer = new HtmlSanitizer();
return sanitizer.Sanitize(htmlIn, baseUrl);
}
I then validate within the controller when the HTML is posted:
var cleanHtml = model.DodgyHtml.RemoveHtmlXss();
AND for completeness, sanitise whenever you present it to the page, especially when using Html.Raw():
<div>#Html.Raw(Model.NotSoSureHtml.RemoveHtmlXss())</div>

How to determine event options?

I have a simple question that lead to a larger question. I have an OverlayPanel and have to guess what the "hideEvent" option is. The documentation doesn't give out that detail. I downloaded the source and have been filtering through that but so far have been unable to find any list.
Where can I find a list of what the available options are for a primefaces argument when it's not in the documentation?
It's any standard HTML DOM event, just part of basic HTML. You know, JSF is merely a HTML code generator.

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.
I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.
Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.
When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.
Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).
All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.
If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.

Resources