Apache Nutch title parsing issue for Language specific websites - parsing

I have configured apache Nutch 2.3.1 with Hadoop 2.7.5 and Hbase 0.98. I have to crawl some Urdu websites. I am using its default parsers i.e., html, tika. Some documents have title in Urdu that are ok but some documents have title in Urdu and their heading 1 i.e., h1 have the original title e.g. bbc-page. Similarly, there are some cases where meta tags have relvement title. Is there any builtin option (parser) that can handle this option so that it should select h1 for title if available.
Or if I have to do it, what are possible ways for this purpose.

Nutch will use the title tag if present found in the DOM tree (https://github.com/apache/nutch/blob/bb2a7adddbc5c780151bb9957d68af52be7339ca/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L251) for this you would need to write a custom logic in a parser plugin. But the real question would be how would you identify the "bad" title tag? Would be some specific content (like the URL).
In any case, you'll need to write your own plugin either in the parser or in an indexing plugin (like taking a field and copying it over to the title field in certain conditions).

Related

How does Scrapy (Open Source Web Scraping Framework) works?

Quoting from Scrapy Official Documentation :
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. Source
After reading this, I'm still not sure whether Scrapy works by directly selecting parts of the HTML document by using XPath/CSS expressions or selecting nodes from DOM Tree which is rendered by the browser?
Still confused whether DOM Parsing and HTML Parsing is the same or not...
After reading this, I'm still not sure whether Scrapy works by directly selecting parts of the HTML document by using XPath/CSS expressions or selecting nodes from DOM Tree which is rendered by the browser?
For sure the former, as there is definitely no browser involved. Even the "CSS" part is just syntatic sugar for the XPath part -- which one can see by printing out an "in progress" Selector:
>>> print(Selector(text="<html><div class='foo'></div></html>").css(".foo"))
[<Selector xpath="descendant-or-self::*[#class and contains(concat(' ', normalize-space(#class), ' '), ' foo ')]" data='<div class="foo"></div>'>]
Still confused whether DOM Parsing and HTML Parsing is the same or not...
Strictly speaking, I believe they are different. For example, lxml is able to parse HTML, but it does so in its own way, and materializes an object tree that is xml.etree compatible, and not that of the DOM. There is a minimal DOM library that html5lib can target, which is about the closest you'll get to "what a browser would build"

simple formatting/parsing in markdown for blockquotes

I'm using markdown in my site and I would like to do some simple parsing for news articles.
How can I parse markdown to pull all blockquotes and links, so I can highlight them separately from the rest of the document
For example I would like to parse the first blockquote ( >) in the document so I can push it to the top no matter where it occurs in the document. (Similar to what many news sites do, to highlight certain parts of an article.) but then de-blockquote it for the main body. So it occurs twice (once in the highlighted always at the top and then normally as it occurs in the document).
I will assume you're trying to do this at render-time, when the markdown is going to be converted to HTML. To point you in the right direction, one way you could go about this would be to
Convert the markdown to HTML
Pass the HTML to Nokogiri
Grab the first <blockquote>, copy it, and inject it into the top of the Nokogiri node tree
The result would be a duplicate of the first <blockquote>.
Redcarpet 2 is a great gem for converting Markdown to HTML. Nokogiri is your best bet for HTML parsing.
I can write sample code if necessary, but the documentation for both gems is thorough and this task is trivial enough to just piece together bits from examples within the docs. This at least answers your question of how to go about doing it.
Edit
Depending on the need, this could be done with a line of jQuery too.
$('article').prepend($($('article blockquote').get(0)).clone())
Given the <article> DOM element for an article on your page, grab the first <blockquote>, clone it, and prepend it to the top of the <article>.
I know wiki markup (i.e. wikicloth for ruby) has similar implementations as you're after for parsing links, categories, and references. Though I'm not sure about block quotes, but it may be better suited.
Something like:
data = "[[ this ]] is a [[ link ]] and another [http://www.google.com Google]. This is a <ref>reference</ref>, but this is a [[Category:Test]]. This is in another [[de:Sprache]]"
wiki = WikiCloth::Parser.new(:data => data)
wiki.to_html
puts "Internal Links: #{wiki.internal_links.size}"
puts "External Links: #{wiki.external_links.size}"
puts "References: #{wiki.references.size}"
puts "Categories: #{wiki.categories.size} [#{wiki.categories.join(",")}]"
puts "Languages: #{wiki.languages.size} [#{wiki.languages.keys.join(",")}]"
I haven't seen any such parsers available for markdown. Using redcarpet, converting to HTML, then using Nokogiri does seem a bit convoluted.

Custom URL format for news in Expression Engine

Our site is migrating from MovableType to ExpressionEngine, and there is one small issue we are having. MT uses a date based URL structure, e.g. www.site.com/2012/03/post-title.html, while EE uses a category based structure, e.g. www.site.com/index.php/news/comments/post-title. The issue is that our MT page used Disqus for comments, and as such comments are tied to a specific URL, meaning that we'd lose all of our comments if we were to migrate. I am wondering if there's a way to change the URL structure in EE to match MT's, thus allowing us to keep the comments. Thanks in advance.
Correction: EE uses a Template Group/Template based structure for URLs, not categories - just to clarify.
You've got a couple of options here.
One is to create an .htaccess rule which internally redirects all requests matching YYYY/MM/ to your EE template which displays your posts (say, /news/entry/). I don't know exactly what those rewrite rules would look like off the top of my head, my mod_rewrite-fu is pretty shallow. But it could definitely work.
Another is to export all of your comments from Disqus via their XML export tool, then do a grep-based find and replace using something like BBEdit, replacing all /YYYY/MM/ strings in that file with /news/entry/; delete all of your existing comments on Disqus; then import your newly-modifed XML file.

Linking on a Redmine Wiki

I'm writing a wiki on Redmine for the program my company just developed. I've been reading Redmine Wiki formatting pages but I simply can't find how to link to headers on a page that hold spaces.
For example:
This works [[Setup#Oracle|Oracle Setup]]
This does not work [[Setup#Oracle DB|Oracle DB Setup]]
The second I have a header with a space, hyphen, underscore... ANYTHING more than one word, Redmine is unable to link.
Any ideas how to link correctly?
Hyphens worked for me using the textile formatting.
[[Wiki#Test-link-target|a link]]
If you open the wiki page you should see a little paragraph symbol next to each header that appears when you hover your mouse there. That should give you the (semi-)permalink you can use. You can always look at the wiki pages source for the link names.
One problem I remember when working on the Markdown filter was that each text formatter would create it's table of contents separately. So the anchor links for textile might be different than the ones for plain text or Markdown.

What does <a:theme> mean in OpenXML?

I'm try to understand OpenXML spreadsheet inner file content.
IN some file I found this string . Other tags has same prefix.
Also tags may have prefixes p: w: etc.
Can you help me undestend the meaning of these prefixes in tags?
You can search for each tag and the full specification of Open XML at DII or download the PDF from the ISO site to read offline. All of these tags have a specific meaning in the construction of one or more formats for Word/Excel/PowerPoint 2007/2010 documents, spreadsheets and presentations.
The one that you mentioned above, <a:theme> is the parent tag for the construction of different templated looks/feel documents, such as their fonts, font sizes, color schemas, etc. See here for a description.
If you're looking to get a little more familiar with the standard, there is a great eBook that can be downloaded and read: Open XML Markup Explained.

Resources