simple formatting/parsing in markdown for blockquotes - ruby-on-rails

I'm using markdown in my site and I would like to do some simple parsing for news articles.
How can I parse markdown to pull all blockquotes and links, so I can highlight them separately from the rest of the document
For example I would like to parse the first blockquote ( >) in the document so I can push it to the top no matter where it occurs in the document. (Similar to what many news sites do, to highlight certain parts of an article.) but then de-blockquote it for the main body. So it occurs twice (once in the highlighted always at the top and then normally as it occurs in the document).

I will assume you're trying to do this at render-time, when the markdown is going to be converted to HTML. To point you in the right direction, one way you could go about this would be to
Convert the markdown to HTML
Pass the HTML to Nokogiri
Grab the first <blockquote>, copy it, and inject it into the top of the Nokogiri node tree
The result would be a duplicate of the first <blockquote>.
Redcarpet 2 is a great gem for converting Markdown to HTML. Nokogiri is your best bet for HTML parsing.
I can write sample code if necessary, but the documentation for both gems is thorough and this task is trivial enough to just piece together bits from examples within the docs. This at least answers your question of how to go about doing it.
Edit
Depending on the need, this could be done with a line of jQuery too.
$('article').prepend($($('article blockquote').get(0)).clone())
Given the <article> DOM element for an article on your page, grab the first <blockquote>, clone it, and prepend it to the top of the <article>.

I know wiki markup (i.e. wikicloth for ruby) has similar implementations as you're after for parsing links, categories, and references. Though I'm not sure about block quotes, but it may be better suited.
Something like:
data = "[[ this ]] is a [[ link ]] and another [http://www.google.com Google]. This is a <ref>reference</ref>, but this is a [[Category:Test]]. This is in another [[de:Sprache]]"
wiki = WikiCloth::Parser.new(:data => data)
wiki.to_html
puts "Internal Links: #{wiki.internal_links.size}"
puts "External Links: #{wiki.external_links.size}"
puts "References: #{wiki.references.size}"
puts "Categories: #{wiki.categories.size} [#{wiki.categories.join(",")}]"
puts "Languages: #{wiki.languages.size} [#{wiki.languages.keys.join(",")}]"
I haven't seen any such parsers available for markdown. Using redcarpet, converting to HTML, then using Nokogiri does seem a bit convoluted.

Related

JSON LD recognized by Google, but not Facebook pixel ( Ruby On Rails)

I have implemented a Json-ld dynamic creation process to boost my SEO. The JSON is created through the use of Jbuilder ( code is in a partial), rendered in a script tag with a type of "application/ld+json". All of it is wrapped up in a content_for, so that I can reuse the logic.
Once it has been implemented, I started to get this error in my console: "[Facebook Pixel] - Unable to parse JSON-LD tag. Malformed JSON found: ' "
I tested my Json-LD on the google structured data tool and everything came back ok.
I've added an hand written JSON-LD in my script tag, instead of my aforementioned logic,
everything looked ok. No error was displayed in the console, and Chrome Facebook Pixel
Helper was able to find my JSON-LD.
Bottom line, it appears that using my dynamic logic with the partials create a random " ' ", which makes no sense for me.
Any of you ever had the same issue, or something similar ?
May be templating engine is messing you up. You might consider using the json-ld gem to validate the output as part of continuous integration (you can also semantically validate the content using other gems).
I’ve had success using JSON-LD in Haml, but I just use to_json from a Hash hierarchy which has always worked well for me.

How does Scrapy (Open Source Web Scraping Framework) works?

Quoting from Scrapy Official Documentation :
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. Source
After reading this, I'm still not sure whether Scrapy works by directly selecting parts of the HTML document by using XPath/CSS expressions or selecting nodes from DOM Tree which is rendered by the browser?
Still confused whether DOM Parsing and HTML Parsing is the same or not...
After reading this, I'm still not sure whether Scrapy works by directly selecting parts of the HTML document by using XPath/CSS expressions or selecting nodes from DOM Tree which is rendered by the browser?
For sure the former, as there is definitely no browser involved. Even the "CSS" part is just syntatic sugar for the XPath part -- which one can see by printing out an "in progress" Selector:
>>> print(Selector(text="<html><div class='foo'></div></html>").css(".foo"))
[<Selector xpath="descendant-or-self::*[#class and contains(concat(' ', normalize-space(#class), ' '), ' foo ')]" data='<div class="foo"></div>'>]
Still confused whether DOM Parsing and HTML Parsing is the same or not...
Strictly speaking, I believe they are different. For example, lxml is able to parse HTML, but it does so in its own way, and materializes an object tree that is xml.etree compatible, and not that of the DOM. There is a minimal DOM library that html5lib can target, which is about the closest you'll get to "what a browser would build"

Format dart code as html

I am knocking together a quick debugging view of a backend, as a small set of admin HTML pages (driven by angulardart, but not sure that is critical).
I get back from my XHR call a complex JSON object. I want to see that on the HTML page formatted nicely. It doesn't have to be a great implementation, as its just a debug ui, but the goal is to format the object instead of having it be one long string with no newlines.
I looked at trying to pretty print JSON in dart then putting that inside <pre></pre> tags, as well as just dumping the dart Map object to string (again, inside or not inside <pre></pre> tags. But not getting to where I want.
Even searched pub for something similar, such as a syntax highlighter that would output html, but didn't find something obvious.
Any recommendations?
I think what you're looking for is:
Format your JSON so it's readable
Have syntax highlight
For 1 - This can be done with JsonEncoder with indent
For 2 - You can use the JS lib called HighlightJs pretty easily by appending your formatted json into a marked-up div. (See highlightjs' doc to see what I mean)

How to get http tag text by id using lua

There is a webpage parser, which takes a page contains several tags, in a certain structure, where divs are badly nested. I need to extract a certain div element, and copy it and all its content to a new html file.
Since I am new to lua, I may need basic clarification for things might seem simple.
Thanks,
The ease of extraction of data is going to largely depend on the page itself. If the page uses the exact same tag information throughout its entirety, it'll be much more difficult to extract than it would if it has named tags.
If you're able to find a version of the page that returns json format, then you're that much better off. Here's a snippet of code on something I wrote to grab definitions from a webpage that did not have json format:
local actualword, definition = string.match(wayup,"<html.-<td class='word'>%c(.-)%c</td>.-<div class=\"definition\">(.-)</div>")
Essentially, this code searched down the page until it found the class "word", and took the word after it (%c is the pattern for control characters). It continued on to "definition" and captured that, as well.
As you can see, it's a bit convoluted, but I had the luck of having specifically named tags for what I wanted.
This is edited to fit your comment. As a side note that I should have mentioned before, if you're familiar with regular expressions, you can use its model to capture what you need. In this case, it's capturing the string in its totality:
local data = string.match(page, "(<div id=\"aa\"><div>.-</div>.-</div>)")
It's rarely the fault of the language, but rather the webpage itself, that makes it hard to data mine anything. Since webpages could literally have hundreds of lines of code, it's hard to pinpoint exactly what you want without coming across garbage information. It's why I prefer a simplified result such as json, since Lua has a json module that can encode/decode and you can get your precise information.

Linking on a Redmine Wiki

I'm writing a wiki on Redmine for the program my company just developed. I've been reading Redmine Wiki formatting pages but I simply can't find how to link to headers on a page that hold spaces.
For example:
This works [[Setup#Oracle|Oracle Setup]]
This does not work [[Setup#Oracle DB|Oracle DB Setup]]
The second I have a header with a space, hyphen, underscore... ANYTHING more than one word, Redmine is unable to link.
Any ideas how to link correctly?
Hyphens worked for me using the textile formatting.
[[Wiki#Test-link-target|a link]]
If you open the wiki page you should see a little paragraph symbol next to each header that appears when you hover your mouse there. That should give you the (semi-)permalink you can use. You can always look at the wiki pages source for the link names.
One problem I remember when working on the Markdown filter was that each text formatter would create it's table of contents separately. So the anchor links for textile might be different than the ones for plain text or Markdown.

Resources