I'm attempting to parse Media RSS feeds that contain media:* elements, but it seems as though all of the standard RSS parsing libraries for Ruby only support enclosures, not MRSS elements.
I've tried:
SimpleRSS
RSS::Parser
Syndication:RSS::Parser
Ideally, I'd like something that makes it simple to extract elements such as media:thumbnail, similar to how I can extract an entry's enclosure.
http://github.com/cardmagic/simple-rss seems to support Media RSS to some degree.
For example:
pp rss.entries.last
{
...
:media_content_url=>"...",
:media_content_type=>"image/jpeg",
:media_content_height=>"426",
:media_content_width=>"640",
:media_thumbnail_url=>"...",
:media_thumbnail_height=>"133",
:media_thumbnail_width=>"200"}
}
(Unfortunately, with the feed I'm testing it with, it seems to be only taking the first media:content tag inside of the media:group, even though the media:group has 2 media:content tags.)
Related
I am knocking together a quick debugging view of a backend, as a small set of admin HTML pages (driven by angulardart, but not sure that is critical).
I get back from my XHR call a complex JSON object. I want to see that on the HTML page formatted nicely. It doesn't have to be a great implementation, as its just a debug ui, but the goal is to format the object instead of having it be one long string with no newlines.
I looked at trying to pretty print JSON in dart then putting that inside <pre></pre> tags, as well as just dumping the dart Map object to string (again, inside or not inside <pre></pre> tags. But not getting to where I want.
Even searched pub for something similar, such as a syntax highlighter that would output html, but didn't find something obvious.
Any recommendations?
I think what you're looking for is:
Format your JSON so it's readable
Have syntax highlight
For 1 - This can be done with JsonEncoder with indent
For 2 - You can use the JS lib called HighlightJs pretty easily by appending your formatted json into a marked-up div. (See highlightjs' doc to see what I mean)
There is a webpage parser, which takes a page contains several tags, in a certain structure, where divs are badly nested. I need to extract a certain div element, and copy it and all its content to a new html file.
Since I am new to lua, I may need basic clarification for things might seem simple.
Thanks,
The ease of extraction of data is going to largely depend on the page itself. If the page uses the exact same tag information throughout its entirety, it'll be much more difficult to extract than it would if it has named tags.
If you're able to find a version of the page that returns json format, then you're that much better off. Here's a snippet of code on something I wrote to grab definitions from a webpage that did not have json format:
local actualword, definition = string.match(wayup,"<html.-<td class='word'>%c(.-)%c</td>.-<div class=\"definition\">(.-)</div>")
Essentially, this code searched down the page until it found the class "word", and took the word after it (%c is the pattern for control characters). It continued on to "definition" and captured that, as well.
As you can see, it's a bit convoluted, but I had the luck of having specifically named tags for what I wanted.
This is edited to fit your comment. As a side note that I should have mentioned before, if you're familiar with regular expressions, you can use its model to capture what you need. In this case, it's capturing the string in its totality:
local data = string.match(page, "(<div id=\"aa\"><div>.-</div>.-</div>)")
It's rarely the fault of the language, but rather the webpage itself, that makes it hard to data mine anything. Since webpages could literally have hundreds of lines of code, it's hard to pinpoint exactly what you want without coming across garbage information. It's why I prefer a simplified result such as json, since Lua has a json module that can encode/decode and you can get your precise information.
I'm wondering if there's an easy way to parse an XML document in rails without loading it all into ram.
I've been using (depending on the XML) a combination of Nokogiri and the standard Hash.from_xml to pull get the contents of the XML.
That is all well and good when I'm dealing with (attempting to import) 100 or even 1000 products. When however the XML doc has 16,000 or 40,000 products in it.... well my Dino starts to really really feel it.
So I'm wondering if there's a way to walk the XML without pulling it all into memory.
Sorry I don't have code.... I'm attempting to avoid writing anything new. I mean who wants to write their own XML parser eh?
I came to this...
reader = Nokogiri::JSON::Reader(File.open('test.xml'))
reader.each do |node|
if(node.name == 'Product')
hash = Hash.from_xml(node.outer_xml).values.first
break;
end
end
I watched my memory load while I ran this across a 60 meg file. It accomplished my goal. I'd love to see other answers. Perhaps something even lighter.
Because XML is hierarchical the parser needs to know the whole structure to parse it correctly. You could feed well formed fragments to Nokogiri::HTML::Document.parse but you'd need to get those fragments out some other way.
Let's say you have a huge xml document:
<products>
<product>stuff</product>
<product>...</product>
... and so on
</products>
The actual products are enveloped within <products>, strip out the envelope part and then using string splitting to get an array of each <product> and its contents. Then parse each of these as XML fragments. Just a thought.
This might help, although I've never used it: https://github.com/soulcutter/saxerator
I'm using markdown in my site and I would like to do some simple parsing for news articles.
How can I parse markdown to pull all blockquotes and links, so I can highlight them separately from the rest of the document
For example I would like to parse the first blockquote ( >) in the document so I can push it to the top no matter where it occurs in the document. (Similar to what many news sites do, to highlight certain parts of an article.) but then de-blockquote it for the main body. So it occurs twice (once in the highlighted always at the top and then normally as it occurs in the document).
I will assume you're trying to do this at render-time, when the markdown is going to be converted to HTML. To point you in the right direction, one way you could go about this would be to
Convert the markdown to HTML
Pass the HTML to Nokogiri
Grab the first <blockquote>, copy it, and inject it into the top of the Nokogiri node tree
The result would be a duplicate of the first <blockquote>.
Redcarpet 2 is a great gem for converting Markdown to HTML. Nokogiri is your best bet for HTML parsing.
I can write sample code if necessary, but the documentation for both gems is thorough and this task is trivial enough to just piece together bits from examples within the docs. This at least answers your question of how to go about doing it.
Edit
Depending on the need, this could be done with a line of jQuery too.
$('article').prepend($($('article blockquote').get(0)).clone())
Given the <article> DOM element for an article on your page, grab the first <blockquote>, clone it, and prepend it to the top of the <article>.
I know wiki markup (i.e. wikicloth for ruby) has similar implementations as you're after for parsing links, categories, and references. Though I'm not sure about block quotes, but it may be better suited.
Something like:
data = "[[ this ]] is a [[ link ]] and another [http://www.google.com Google]. This is a <ref>reference</ref>, but this is a [[Category:Test]]. This is in another [[de:Sprache]]"
wiki = WikiCloth::Parser.new(:data => data)
wiki.to_html
puts "Internal Links: #{wiki.internal_links.size}"
puts "External Links: #{wiki.external_links.size}"
puts "References: #{wiki.references.size}"
puts "Categories: #{wiki.categories.size} [#{wiki.categories.join(",")}]"
puts "Languages: #{wiki.languages.size} [#{wiki.languages.keys.join(",")}]"
I haven't seen any such parsers available for markdown. Using redcarpet, converting to HTML, then using Nokogiri does seem a bit convoluted.
There is a website that displays a lot of data in html tables. They have paged the data so there are around 500 pages.
What is the most convenint (easy) way of getting the data in those tables and download it a CSV, on Windows?
Basically I need to write a script that does something like this but is overkilling to write in in C# and I am looking for other solutions that people with web experience use:
for(i=1 to 500)
load page from http://x/page_i.html;
parse the source and get the data in table with id='data'
save results in csv
Thanks!
I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.
The easiest non-C# way I can think of is to use Wget to download the page, then run HTMLTidy to convert it to XML/XHTML and then transform the resulting XML to CSV with an XSLT (run with MSXSL.exe)
You will have to write some simple batch files and an XSLT with a basic XPath selector.
If you feel it would be easier to just do it in C#, you can use SgmlReader to read the HTML DOM and do an XPath query to extract the data. It should not take more than about 20 lines of code.