Trouble Extracting Data with Nokogiri - ruby-on-rails

I'm practicing extracting data from an XML site and I'm using Nokogiri to read and parse. I need to analyze the data but for now, I'm just trying to get an output with no success.
I have the following code:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.ibiblio.org/xml/examples/shakespeare/macbeth.xml"))
doc.xpath('//PERSONA').each do |char_element|
puts char_element.text
end
I'm simply trying to read the characters off the XML website, but I'm not getting any results when I run it in the terminal. I also tried just writing a simple xpath call such as the one below:
doc.xpath("//PERSONA")
or
doc.xpath("PLAY TITLE")
And I get either an error or it simply acts as if nothing was entered.
I have put a simple function to test it so I know it's reading it. Can anyone tell me what I'm doing wrong?

You're trying to read XML file as a HTML one.
Please try that example:
doc = Nokogiri::XML(open("http://www.ibiblio.org/xml/examples/shakespeare/macbeth.xml"))
doc.xpath('//PERSONA').each{|ce| p ce.text }
"DUNCAN, king of Scotland."
"MALCOLM"
"DONALBAIN"
"MACBETH"
"BANQUO"
"MACDUFF"
"LENNOX"
"ROSS"
"MENTEITH"
"ANGUS"
"CAITHNESS"
"FLEANCE, son to Banquo."
"SIWARD, Earl of Northumberland, general of the English forces."
"YOUNG SIWARD, his son."
"SEYTON, an officer attending on Macbeth."
"Boy, son to Macduff. "
"An English Doctor. "
"A Scotch Doctor. "
"A Soldier."
"A Porter."
"An Old Man."
"LADY MACBETH"
"LADY MACDUFF"
"Gentlewoman attending on Lady Macbeth. "
"HECATE"
"Three Witches."
"Apparitions."
"Lords, Gentlemen, Officers, Soldiers, Murderers, Attendants, and Messengers. "
Please be sure you're using Nokogiri::XML instead of Nokogiri::HTML

Related

How to Parse LARGE JSON files with formating error

I have a bunch of large JSON files (> 500MB) which I would like to parse with the ruby script (I am trying. to parse it with YAJL gem).
I have noticed that JSON files have formatting errors such that all the files composed of "multiple" JSON objects without a proper tree-like structure or array. Below you can find how the JSON file looks like:
testfile.json:
{title: "Don Quixote", author: "Miguel de Cervantes", printyear: 2010}
{title: "Great Gatsby", author: "F. Scott Fitzgerald", printyear: 2014}
{title: "Ulysses", author: "James Joyce", printyear: 2010}
This is the script to parse file:
require 'yajl'
json = File.new('testfile.json', 'r')
hash = Yajl::Parser.parse(json)
Here is the error message I get:
Yajl::ParseError: Found multiple JSON objects in the stream but no block or the on_parse_complete callback was assigned to handle them.
I will appreciate if you can guide me on how to solve this issue.
The error message you got ("Found multiple JSON objects in the stream …") implies that your input contains multiple but valid JSON objects, so I assume your actual file looks more like this:
{"title":"Don Quixote","author":"Miguel de Cervantes","printyear":2010}
{"title":"Great Gatsby","author":"F. Scott Fitzgerald","printyear":2014}
{"title":"Ulysses","author":"James Joyce","printyear":2010}
One of YAJL's feature is to:
Parse and encode multiple JSON objects to and from streams or strings continuously.
So given above input (as a file or string), you can pass a block to parse which will be called for each parsed object:
require 'yajl'
io = File.open('testfile.json')
Yajl::Parser.parse(io) do |book|
puts "“#{book['title']}” by #{book['author']} (#{book['printyear']})"
end
Output:
“Don Quixote” by Miguel de Cervantes (2010)
“Great Gatsby” by F. Scott Fitzgerald (2014)
“Ulysses” by James Joyce (2010)
Don't use JSON.parse, beacuse the file's content is not a JSON. Each line from this file looks like a Ruby hash, so different parsing method could be used.
You should be able to parse each line by using: YAML.load(line).
Also, beacuse the file is big, don't load the whole file into memory. Use File.foreach to load line by line.
require 'yaml'
lines = []
File.foreach('testfile.json') do |line|
lines << YAML.load(line)
end

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

How to translate language packs in SocialEngine

I know that SocialEngine stores language files as CSV files in application/languages. The common format in the CSV files is as follows:
"Source word"; "Translated word"
But, this sometimes gets very complicated, especially when special characters are used in some parts, e.g.:
"Total Credits : %s";"Total Credits : %s"
"_EMAIL_SITEGROUP_BADGEREQUEST_APPROVED_EMAIL_TITLE";"Group Badge Request Approved"
"Video conversion failed. Please try uploading %1$sagain%2$s.";"Video conversion failed. Please try uploading %1$sagain%2$s."
"{item:$subject} replied to a comment on {item:$owner}\'\'s page offer {item:$object:$title}: {body:$body}";"{item:$subject} replied to a comment on {item:$owner}\'\'s page offer {item:$object:$title}: {body:$body}"
"3%s Level Category:";"3%s Level Category:"
"I have read and agree to the <a href='javascript:void(0);' onclick=window.open('%s','mywindow','width=500,height=500')>terms of service</a>.";"I have read and agree to the <a href='javascript:void(0);' onclick=window.open('%s','mywindow','width=500,height=500')>terms of service</a>."
%s
-this is variable, and this means vareables are few:
%1$, %2$s
and so on... any number in X:
%X$
This is the key (in your case):
"Total Credits : %s"
This is delimeter:
;
and this is your translation:
"Total Credits : %s"
Cheers;)
Without worrying about any of that, you can use this plugin: Language Translator / Multilingual Plugin

Prestashop product upload with csv error

I am trying to upload csv file of products in prestashop. Below are the errors that I am getting :
No Name (ID: 61,1,Orous Women's A Line Dress,Home,1399,IN Reduced Rate (4%),0,0,,,,,D2_Yellow,,,,,,,,,,,2,,,,,,"Fabric: Crepe A-line
Exquisite style patterns Gentle machine wash, dry clean, do not
bleach",,,,,,,,,1,,,,http://www.spademark.com/1154X1500/orous/D2_Yellow-_1.jpg,,,,New,,,,,,,,)
cannot be saved
and
Property Product->name is empty
What am I doing wrong?
docs http://doc.prestashop.com/display/PS16/CSV+Import+Parameters
Please use this structure for successful upload:
"Enabled";"Name";"Categories";"Price";"Tax rule ID";"Buying price";"On sale";"Reference";"Weight";"Quantity";"Short desc.";"Long desc";"Images URL"
1;"Test";"1,2,3";130;1;75;0;"PROD-TEST";"0.500";10;"'Tis a short desc.";"This is a long description.";"http://www.myprestashop/images/product1.gif"

How to parse og meta tags using httparty for rails 3

I am trying to parse og meta tags using the HTTParty gem using this code:
link = http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
# link = http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
resp = HTTParty.get(link)
ret_body = resp.body
# title
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"\/\>/)
og_title = og_title[1].to_s
The problem is that it worked on some sites (yahoo!) but not others (usa today)
Don't parse HTML with regular expressions, because they're too fragile for anything but the simplest problems. A tiny change to the HTML can break the pattern, causing you to begin a slow battle of maintaining an ever expanding pattern. It's a war you won't win.
Instead, use a HTML parser. Ruby has Nokogiri, which is excellent. Here's how I'd do what you want:
require 'nokogiri'
require 'httparty'
%w[
http://www.usatoday.com/story/gameon/2013/01/08/nfl-jets-tony-sparano-fired/1817037/
http://news.yahoo.com/chicago-lottery-winners-death-ruled-homicide-181627271.html
].each do |link|
resp = HTTParty.get(link)
doc = Nokogiri::HTML(resp.body)
puts doc.at('meta[property="og:title"]')['content']
end
Which outputs:
Jets fire offensive coordinator Tony Sparano
Chicago lottery winner's death ruled a homicide
Perhaps I can offer an easier solution? Check out the OpenGraph gem.
It's a simple library for parsing Open Graph protocol information from web sites and should solve your problem.
Solution:
og_title = ret_body.match(/\<[Mm][Ee][Tt][Aa] property\=\"og:title\"\ content\=\"(.*?)\"[\s\/\>|\/\>]/)
og_title = og_title[1].to_s
Trailing whitespace messed up the parsing so make sure to check for that. I added an OR clause to the regex to allow for both trailing and non trailing whitespace.

Resources