Convert Nokogiri XML Document into Array of Strings? - ruby-on-rails

I'm creating a Ruby on Rails application and using Nokogiri to parse an XML file. I'm trying to parse the XML file into mutable strings which I can manipulate to create other content.
Here's a sample XML I'm using
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
</feed>
This is what I've done so far relating to my problem
In my controller -
def index
#blog_title, #blog_post = parse_xml
end
private
def parse_xml
#xml_doc = Nokogiri::XML(open("atom.xml"))
titles = #xml_doc.css("entry title")
post = #xml_doc.css("content")
return titles, post
end
In my view -
<% for i in 1..#blog_title.length %>
<li><%= #blog_title[i-1] %></li>
<li><%= #blog_post[i-1] %></li>
<% end %>
A sample output from the view (it returns a Nokogiri Element) -
<title type="html"><![CDATA[First Post!]]></title>
So ideally, I'd like to make all the Nokogiri::Element inside the Nokogiri::Document a string or make the entire array a String array.
I've tried iterating through each element and calling .to_s but it doesn't seem to work.
I've also tried calling Ruby::String methods such as slice and that doesn't work (for obvious reasons).
The end result I'm trying to get at (using the sample output on my view) is to return only the following and none of the rest.
First Post!
Can anyone help me? If I'm not clear enough or if someone needs to see more work, please feel free to ask!

For your case you should simply use .text to extract the content of tags. Something like titles.text would work.

You're dealing with RSS/Atom feeds which can contain multiple title tags. You need to iterate over all title nodes and extract their content separately, in a way that lets you keep track of their order and what article they're attached to:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
</feed>
EOT
doc.search('title').map(&:text)
# => ["\n First Post! \n "]
This returns an array of the text inside the title nodes. From there you can easily clean up each string, manipulate them, reuse them, whatever.
doc.search('title').map{ |s| s.text.strip }
# => ["First Post!"]
search returns a NodeSet, which is akin to an array of title nodes found in the document. If you don't iterate over them you'll get a concatenated string containing all their text, which is usually NOT what you want:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<foo>
<title>this</title>
<title>is</title>
<title>what</title>
<title>you'd</title>
<title>get</title>
</foo>
EOT
doc.search('title').text
# => "thisiswhatyou'dget"
versus:
doc.search('title').map(&:text)
# => ["this", "is", "what", "you'd", "get"]
Trying to tear apart the first result is impossible unless you have prior knowledge of the document's structure which is usually not true. Iterating over the returned NodeSet will yield very usable results.
To maintain consistency with the various title tags in a feed, you need to loop over the entries, then extract the embedded titles which is a bit different than what your sample XML and code shows:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
<entry>
<title type="html">
<![CDATA[ Second Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>blah</p>]]>
</content>
</entry>
</feed>
EOT
titles = doc.search('entry').map { |entry|
entry.at('title').text.strip
}
titles # => ["First Post!", "Second Post!"]
Or perhaps more usable:
titles_and_content = doc.search('entry').map { |entry|
[
entry.at('title').text.strip,
entry.at('content').text.strip
]
}
titles_and_content
# => [["First Post!",
# "<p>I’m very excited to have finally got my site up and running along with this blog!</p>"],
# ["Second Post!", "<p>blah</p>"]]
which returns the title and the content for each entry. From this you can easily build up code to extract the links to the articles, date of publishing, refresh-rates, original site, everything you'd want to know about an individual article and its source, then store it in a database for later regurgitation when requested.
There are gems and scripts available for processing RDF, RSS and Atom feeds, however, years ago, when I had to write a huge aggregator for feeds, nothing was available that met my needs and I wrote one from scratch. I'd recommend trying to find one rather than reinvent that wheel, otherwise look through their source and learn from their experience. There are a number of things to do in code to be a good network-citizen that doesn't swamp the servers and get you banned.
See "How to avoid joining all text from Nodes when scraping" also.

Related

What is the structure of wikipedia dumps?

I need the list of Hungarian words for a project and the only possible source I found is wikipedia XML dumps. They are really big, I guess I could parse them with a read stream and a SAX parser, but it would be nice to know more about the structure so I could test the code on a small example before running it on the big files. Is there a description somewhere about what structure they use and what the different XML gzip files contain? https://dumps.wikimedia.org/enwiki/latest/ https://dumps.wikimedia.org/huwiki/latest/
The format is documented here: https://www.mediawiki.org/wiki/Help:Export It looks like this:
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor><username>Foobar</username></contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[Special:MyLanguage/text|text]] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[Special:MyLanguage/revision|revision]].</text>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>
</mediawiki>

Parse HTML stored as string in Database in ColdFusion

I have taken over this ColdFusion project and found that I need a value out of a database field that includes HTML. The field data looks like this (without the new lines):
<wddxPacket version="1.0">
<header />
<data>
<struct>
<var name="en">
<string>3 Nights' Lodging</string>
</var>
<var name="sp">
<string>3 Noches alojamiento</string>
</var>
</struct>
</data>
</wddxPacket>
I am wanting to use this data but I only need the text between the:
<var name='en'><string>3 Nights' Lodging</string></var>
I used a function that ColdFusion has to remove HTML:
#REReplaceNoCase(pkg.title, "<[^><]*>", '', 'ALL')#
But when I use that, I get something like this:
3 Nights' Lodging3 Noches alojamiento
All I want is:
3 Nights' Lodging
Examining the beginning of the string, ie <wddxPacket ...> it is actually WDDX.
If you do a search for ColdFusion + WDDX you will find the documentation for CFWDDX. It is a built in tag which supports conversions of WDDX strings to CFML objects (and vice versa) for easier manipulation. In your case use action="wddx2cfml" to convert the string back into a CF structure.
<cfwddx action="wddx2cfml" input="#text#" output="result">
<cfdump var="#result#" label="Raw object">
Then use the key #result.en# to grab the string you want.

How to get the language specific messages using google closure template

Am trying to implement internationalization support to my project for this people suggested google Closure Templates.but am very new to closure templates.am trying to get the language specific messages using closure template but am not getting in xlf file.If any one knows how to generate language specific messages using closure template, please tell me the steps.that's great help to me.
My .soy file code as bellow.
{namespace poc}
/**
*Testing message translation
*#param pageTitle
*/
{template .translate}
<HTML>
<Head>
<title>{$pageTitle}
</title>
</head>
<div>
{msg desc="Hello"}Hello{/msg}
</div>
</html>
{/template}
and generated .xlf content as bellow
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="SoyMsgBundle" datatype="x-soy-msg-bundle" xml:space="preserve" source-language="en" target-language="pt-BR">
<body>
<trans-unit id="2286494898080570401" datatype="html">
<source>Thanks</source>
<target/>
<note priority="1" from="description">Says thanks</note>
</trans-unit>
</body>
</file>
</xliff>
I see you already used the SoyMsgExtractor to create the base xlf. Next you need to make translations of this base xlf to the languages you want to support. A file for each language is created. I used the xliff exitor from Translution. http://sourceforge.net/projects/eviltrans.
Next, using the SoyToJsSrcCompiler a translation soy can be made per language:
java -jar SoyToJsSrcCompiler.jar --shouldGenerateGoogMsgDefs --bidiGlobalDir 1 --messageFilePathFormat Filename_en-us.xliff --outputPathFormat FileName_fr.js *.soy
This will create a Filename._fr.js file that contains the compiled soy file.
Including this file instead of the original soy (or compiled) will create a localized version.
Good luck!
\Rene
i think the easiest way is to make (i.e. generate from whatever source) a separate js file which contains one messages object and reference it through an extern declared function.
it justs works and has no complicated dependencies.

Uncaught exception 'DOMException' with message 'Not Found Error'

Bascially I'm writing a templating system for my CMS and I want to have a modular structure which involves people putting in tags like:
<module name="news" /> or <include name="anotherTemplateFile" /> which I then want to find in my php and replace with dynamic html.
Someone on here pointed me towards DOMDocument, but I've already come across a problem.
I'm trying to find all <include /> tags in my template and replace them with some simple html. Here is my template code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>CMS</title>
<include name="head" />
</head>
<body>
<include name="header" />
<include name="content" />
<include name="footer" />
</body>
</html>
And here is my PHP:
$template = new DOMDocument();
$template->load("template/template.tpl");
foreach( $template->getElementsByTagName("include") as $include ) {
$element = '<input type="text" value="'.print_r($include, true).'" />';
$output = $template->createTextNode($element);
$template->replaceChild($output, $include);
}
echo $template->saveHTML();
Now, I get the fatal error Uncaught exception 'DOMException' with message 'Not Found Error'.
I've looked this up and it seems to be that because my <include /> tags aren't necessarily DIRECT children of $template its not replacing them.
How can I replace them independently of descent?
Thank you
Tom
EDIT
Basically I had a brainwave of sorts. If I do something like this for my PHP I see its trying to do what I want it to do:
$template = new DOMDocument();
$template->load("template/template.tpl");
foreach( $template->getElementsByTagName("include") as $include ) {
$element = '<input type="text" value="'.print_r($include, true).'" />';
$output = $template->createTextNode($element);
// this line is different:
$include->parentNode->replaceChild($output, $include);
}
echo $template->saveHTML();
However it only seems to change 1 occurence in the <body> of my HTML... when I have 3. :/
This is a problem with your DOMDocument->load, try
$template->loadHTMLFile("template/template.tpl");
But you may need to give it a .html extension.
this is looking for a html or an xml file. also, whenever you are using DOMDocument with html it is a good idea to use libxml_use_internal_errors(true); before the load call.
OKAY THIS WORKS:
foreach( $template->getElementsByTagName("include") as $include ) {
if ($include->hasAttributes()) {
$includes[] = $include;
}
//var_dump($includes);
}
foreach ($includes as $include) {
$include_name = $include->getAttribute("name");
$input = $template->createElement('input');
$type = $template->createAttribute('type');
$typeval = $template->createTextNode('text');
$type->appendChild($typeval);
$input->appendChild($type);
$name = $template->createAttribute('name');
$nameval = $template->createTextNode('the_name');
$name->appendChild($nameval);
$input->appendChild($name);
$value = $template->createAttribute('value');
$valueval = $template->createTextNode($include_name);
$value->appendChild($valueval);
$input->appendChild($value);
if ($include->getAttribute("name") == "head") {
$template->getElementsByTagName('head')->item(0)->replaceChild($input,$include);
}
else {
$template->getElementsByTagName("body")->item(0)->replaceChild($input,$include);
}
}
//$template->load($nht);
echo $template->saveHTML();
However it only seems to change 1 occurence in the of my HTML... when I have 3. :/
DOM NodeLists are ‘live’: when you remove an <include> element from the document (by replacing it), it disappears from the list. Conversely if you add a new <include> into the document, it will appear in your list.
You might expect this for a NodeList that comes from an element's childNodes, but the same is true of NodeLists that are returned getElementsByTagName. It's part of the W3C DOM standard and occurs in web browsers' DOMs as well as PHP's DOMDocument.
So what you have here is a destructive iteration. Remove the first <include> (item 0 in the list) and the second <include>, previously item 1, become the new item 0. Now when you move on to the next item in the list, item 1 is what used to be item 2, causing you to only look at half the items.
PHP's foreach loop looks like it might protect you from that, but actually under the covers it's doing exactly the same as a traditional indexed for loop.
I'd try to avoid creating a new templating language for PHP; there are already so many, not to mention PHP itself. Creating one out of DOMDocument is also going to be especially slow.
eta: In general regex replace would be faster, assuming a simple match pattern that doesn't introduce loads of backtracking. However if you are wedded to an XML syntax, regex isn't very good at parsing that. But what are you attempting to do, that can't already be done with PHP?
<?php function write_header() { ?>
<p>This is the header bit!</p>
<? } ?>
<body>
...
<?php write_header(); ?>
...
</body>

Parsing atom/rss feed containing multiple <link> tags with Haml on RoR

So, firstly, here's an Atom feed snippet which I am trying to parse:
// http://somelink.com/atom
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Title Here</title>
<link href="http://somelink.com/link1&amp;ref=rss" rel="alternate" />
<link href="http://somelink.com/link2&amp;ref=rss" rel="tag:somelink.com/apply_url"/>
...
</entry>
I pull the Atom feed like so,
// In controller index method
#rss = SimpleRSS.parse open('http://somelink.com/atom')
Then I output the response in the view, which I am writing using Haml, as follows:
- #rss.entries.each do |item|
.title-div
= item.title
.title-link
= item.link //outputs the first link
I could run a second loop for the links but is there a way to get the second link without it? Like reading the "rel" attribute and outputting the correct link? How do I do this in Haml/Rails?
EDIT: The gem i am using: http://simple-rss.rubyforge.org/
I'm not familiar with that gem, but have you tried item.links to see if each item provides a collection of links?
I have never used SimpleRSS but maybe you could give Nokogiri or Hpricot a try? You can than run an XPath query to only select the link with the right attribute. An example with Nokogiri:
atom_doc = Nokogiri::XML(open("http://www.example.com/atom.xml"))
atom_doc.xpath("/xmlns:feed/xmlns:entry/xmlns:link[#rel='tag:somelink.com/apply_url']")
Don't forget the namespaces if you are parsing an Atom feed.

Resources