I have the following code:
page = Net::HTTP.get_response(URI.parse('http://www.enhancetv.com.au/tvguide/rss/melbournerss.php')).body rescue nil
show_info = Hash.from_xml(page).to_json
puts show_info
Which outputs the following JSON:
{"rss":{"version":"2.0","channel":{"title":"EnhanceTV Melbourne TV Guide ","description":null,"link":"http://www.enhancetv.com.au","lastBuildDate":"Wed, 21 Dec
2011 10:25:55 1000","generator":"FeedCreator 1.7.2","item":[{"title":"Toasted TV - TEN - 07:00:00 - 21/12/2011","link":"http://www.enhancetv.com.au/tvguide/","d
escription":"Join the team for the latest in gaming, sport, gadgets, pop culture, movies, music and other seriously fun stuff! Featuring a variety of your favou
rite cartoons."},{"title":"Totally Wild - TEN - 08:00:00 - 21/12/2011","link":"http://www.enhancetv.com.au/tvguide/","description":"The Totally Wild team brings
you the latest in action, adventure and wildlife from Australia and around theglobe."},{"title":"Explore: Africa's Rift Valley - SBS ONE - 19:30:00 - 21/
12/2011","link":"http://www.enhancetv.com.au/tvguide/","description":"Simon Reeve leads a team of journalists on a spectacular journey down East Africa's R
ift Valley. From the tiny country of Djibouti, which is the centre of America's military presence in Africa, to the wide open plains of Kenya, the team enc
ounter awe-inspiring landscapes, rich culture and amazing wildlife."}]}}}
First of all I'd like to actually be able to loop through each of the items.
This is what I have so far, however it's throwing an error. I'm not entirely sure how to even loop through the parsed JSON properly:
result = JSON.parse(show_info) # convert JSON data to native ruby
result["channel"].each do |a|
puts a['title']
puts a['description']
puts a['link']
end
This gives the following error:
scraper.rb:118:in `<main>': undefined method `each' for "rss":String (NoMethodEr
ror)
I'd then want to be able to do a split on the "title" so I can convert the time and date into a DateTime object to use later.
Thanks in advance.
I've decided to use Nokogiri to parse the XML instead of converting the XML to JSON.
Thanks for your help.
Related
I have a JSON response which is stored as a string in "BQresponse"
{"kind":"bigquery#queryResponse", "schema":{"fields":[{"name":"Revenue", "type":"INTEGER", "mode":"NULLABLE"}, {"name":"Country", "type":"STRING", "mode":"NULLABLE"}]}, "jobReference":{"projectId":"curious-idea-532", "jobId":"job_S5rTcY2vwEu-amtrxb8NRPWiynU"}, "totalRows":"3", "rows":[{"f":[{"v":"100"}, {"v":"Ireland"}]}, {"f":[{"v":"200"}, {"v":"Netherlands"}]}, {"f":[{"v":"50"}, {"v":"Singapore"}]}], "totalBytesProcessed":"0", "jobComplete":true, "cacheHit":true}
I am trying to convert this into a two line response (for later export to CSV), looking exactly like this:
Country||Sum of Revenue|,Ireland,Netherlands,Singapore
Revenue,100,200,50
So far, I've extracted the first parts, like so:
puts BQresponse[/#{D1_mark1}(.*?)#{D1_mark2}/m, 1]+"||"+BQresponse[/#{M1_mark1}(.*?)#{M1_mark2}/m, 1]
Next I need to extract "Ireland,Netherlands,Singapore". However I cannot use the same approach as I have done above as there may be more or less values as the string is updated (maybe only 2 or 5 countries).
The string included a part that says "totalRows":"3"," - this 3 is the number of expected countries and I suppose could be used in a loop/for-each of some sort. But I'm not sure how to best approach this.
The number values on the second line face the exact same issue (each country has a number). The "Revenue" on the second line is simply a repeat of "Revenue" on the first line, with "Sum_of_" removed.
Appreciate suggestions on what direction to head in.
Also, this is a valid JSON, if I'm completely off track and it would be easier to convert this string into a JSON first, that's okay too.
Thanks!
There's an awesome gem for this, json2csv here that I've had to use before.
To try it out, I'd save down a sample JSON response into a file called sample.json and then in your terminal you can run:
json2csv convert sample.json
I have bunch of words saved in plain text file and a I would like to import them to Google Translate somehow. They should then be visible in new Google Translate feature, Phrasebook. So what I did so far is that I've opened Google Translate page with enabled FireBug and enter word "feuds". The results are following:
GET /translate/releases/twsfe_w_20130506_RC02/r/js/desktop_module_lazy.js
GET /translate/releases/twsfe_w_20130506_RC02/r/js/desktop_module_lazy.js
GET /translate_a/t?client=t&hl=en&sl=auto&tl=sk&ie=UTF-8&oe=UTF-8&multires=1&ssel=0&tsel=0&uptl=sk&sc=1&q=feuds
POST https://plus.google.com/u/0/_/n/gcosuc?origin=http%3A%2F%2Ftranslate.google.com
200 OK
103ms
>>>
################################################
### AFTER PRESSING SAVE TO PHRASEBOOK BUTTON ###
################################################
POST /translate_a/sg?client=t&cm=a&sl=en&tl=sk&ql=5&hl=en&xt=ALkJrhgAAAAAUZvTtWm7IqJAYJpay1AU8x-VoS_AM0J0
client t
cm a
hl en
ql 5
sl en
tl sk
xt ALkJrhgAAAAAUZvTtWm7IqJAYJpay1AU8x-VoS_AM0J0
200 OK
137ms
>>>
GET /translate_a/sg?client=t&cm=g&tk=8mXp7vd2yN4UVnN8_Bw51LnXE2wqfQI&hl=en&xt=ALkJrhgAAAAAUZvTtWm7IqJAYJpay1AU8x-VoS_AM0J0
client t
cm g
hl en
tk 8mXp7vd2yN4UVnN8_Bw51LnXE2wqfQI
xt ALkJrhgAAAAAUZvTtWm7IqJAYJpay1AU8x-VoS_AM0J0
200 OK
112ms
>>>
You can see that on 3rd GET the word is available "&q=feuds" But what happens when I press "Save to Phrasebook"? It seems that there is sending source language (sl), target language (tl) etc. with some strange string: "ALkJrhgAAAAAUZvTtWm7IqJAYJpay1AU8x-VoS_AM0J0" which might be my "hashed" word. Another idea which comes to my mind is that this strange string did not have to bee "hashed" word necessary, but it might be for example some ID, which refers to word that I have typed in the past (in this case few seconds ago until I hit "Save to Phrasebook" button). Is it possible to somehow "decode" this string?
I have noticed that this "xt" value is received in 1st response from "translate.google.com", look for variable with name "USAGE".
Also I have found that "xt" value appear in all requests related to Phrasebook (example: show phrasebook, add/delete word to/from phrasebook), and in each new google translator page you will have new "xt" value.
Based on this, can assume, that "xt" variable implement function of identification token for Phrasebook.
After upgrading to Ruby-1.9.3-p392 today, REXML throws a Runtime Error when attempting to retrieve an XML response over a certain size - everything works fine and no error is thrown when receiving under 25 XML records, but once a certain XML response length threshold is reached, I get this error:
Error occurred while parsing request parameters.
Contents:
RuntimeError (entity expansion has grown too large):
/.rvm/rubies/ruby-1.9.3-p392/lib/ruby/1.9.1/rexml/text.rb:387:in `block in unnormalize'
I realize this was changed in the most recent Ruby version:
http://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/
As a quick fix, I've changed the size of REXML::Document.entity_expansion_text_limit to a larger number and the error goes away.
Is there a less risky solution?
This issue is generated when you send too much content as XML response.
To fix this issue : You need to restrict the data(< 10k) in the individual node (Instead of sending the whole data, show truncated data and provide a seperate link to view full content)
The error is being raised from the below file :
ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb
# Unescapes all possible entities
def Text::unnormalize( string, doctype=nil, filter=nil, illegal=nil )
sum = 0
string.gsub( /\r\n?/, "\n" ).gsub( REFERENCE ) {
s = Text.expand($&, doctype, filter)
if sum + s.bytesize > Security.entity_expansion_text_limit
raise "entity expansion has grown too large"
else
sum += s.bytesize
end
s
}
end
The limit ruby-2.1.2/lib/ruby/2.1.0/rexml/text.rb defaults to 10240 which means 10k data per node.
REXML already defaults to only allow 10000 entity substitutions per document, so the maximum amount of text that can be generated by entity substitution will be around 98 megabytes. (Refer https://www.ruby-lang.org/en/news/2013/02/22/rexml-dos-2013-02-22/ )
That sounds like a LOT of XML. Do you really need to get all of it? Maybe you can just request certain fields from the remote server? One option might be to try another XML parser (Nokogiri for example). Another option to maybe use something other than XML as a transport (JSON? Binary?).
I'm using ActiveResource to consume a REST webservice provided by Redmine (a bug-tracking tool). That webservice produces XML like the following:
<custom_field name="Issue Owner" id="15">Fred Fake</custom_field>
<custom_field name="Needs Printing" id="16">0</custom_field>
<custom_field name="Review Assignee" id="17">Fran Fraud</custom_field>
<custom_field name="Released On" id="20"></custom_field>
<custom_field name="Client Facing" id="21">0</custom_field>
<custom_field name="Type" id="22">Bug</custom_field>
<custom_field name="QA Assignee" id="23"></custom_field>
<custom_field name="Company Name" id="26"></custom_field>
<custom_field name="QA Notes" id="27"></custom_field>
<custom_field name="Failed QA Attempts" id="28">2</custom_field>
However, when ActiveResource parses that, and I iterate through the results printing them out, I get:
Fred Fake
0
Fran Fraud
#<Redmine::Issue::CustomFields::CustomField:0x5704e95d>
0
Bug
#<Redmine::Issue::CustomFields::CustomField:0x32fd963>
#<Redmine::Issue::CustomFields::CustomField:0x3a68f437>
#<Redmine::Issue::CustomFields::CustomField:0x407964d6>
2
That's right, it throws out all of the attribute info from anything with a value, but keeps the attribute info from the empty elements.
Needless to say, this makes things rather difficult when you're trying to find the value for id 15 (or whatever). Now I can reference things by their position, but that's very brittle, because those elements are likely to change in the future. I assume there has to be some way to make ActiveResource keep the attribute info, but since I'm not doing anything special.
(My ActiveResource extension is just five lines long: it extends ActiveResource, defines the url, username and password of the service, and that's it).
So, does anyone know how I can make ActiveResource not parse this XML so strangely?
This is a known issue with ActiveResource apparently:
https://github.com/rails/rails/issues/588
Unfortunately, nothing appears to be done about it & the issue was closed. If you're feeling up to it, the Rails 3 code for updating ActiveResource and Hash.from_xml to preserve all attributes are all in the gist below and you could create a tailored version in your Redmine module to fix it:
https://gist.github.com/971598
Update:
An alternative, as it appears ActiveResource will not be part of Rails 4 core and will be spun out as a separate gem, would be to use an alternative ORM for REST APIs, like Her.
Her allows you to use a custom parser for your XML. This is an example custom parser called Redmine::ParseXML:
https://gist.github.com/3879418
So then all you need to do is create a file like config/initializers/her.rb:
Her::API.setup :url => "https://api.xxxxx.org" do |connection|
connection.use Faraday::Request::UrlEncoded
connection.use Redmine::ParseXML
connection.use Faraday::Adapter::NetHttp
end
and you get a Hash like the following:
#<Redmine::Issue(issues) issues={:attributes=>{:type=>"array", :count=>1640},
:issue=>{:id=>4326,
:project=>{:attributes=>{:name=>"Redmine", :id=>1}},
:tracker=>{:attributes=>{:name=>"Feature", :id=>2}},
:status=>{:attributes=>{:name=>"New", :id=>1}},
:priority=>{:attributes=>{:name=>"Normal", :id=>4}},
:author=>{:attributes=>{:name=>"John Smith", :id=>10106}},
:category=>{:attributes=>{:name=>"Email notifications", :id=>9}},
:subject=>"\n Aggregate Multiple Issue Changes for Email Notifications\n ",
:description=>"\n This is not to be confused with another useful proposed feature that\n would do digest emails for notifications.\n ",
:start_date=>"2009-12-03",
:due_date=>{},
:done_ratio=>0,
:estimated_hours=>{},
:custom_fields=>{
:custom_field=>[
{:attributes=>{:name=>"Issue Owner", :id=>15}, "value"=>"Fred Fake"},
{:attributes=>{:name=>"Needs Printing", :id=>16}, "value"=>0},
{:attributes=>{:name=>"Review Assignee", :id=>17}, "value"=>"Fran Fraud"},
{:attributes=>{:name=>"Released On", :id=>20}},
{:attributes=>{:name=>"Client Facing", :id=>21}, "value"=>0},
{:attributes=>{:name=>"Type", :id=>22}, "value"=>"Bug"},
{:attributes=>{:name=>"QA Assignee", :id=>23}},
{:attributes=>{:name=>"Company Name", :id=>26}},
{:attributes=>{:name=>"QA Notes", :id=>27}},
{:attributes=>{:name=>"Failed QA Attempts", :id=>28}, "value"=>2}]},
:created_on=>"Thu Dec 03 15:02:12 +0100 2009",
:updated_on=>"Sun Jan 03 12:08:41 +0100 2010"}}>
I am working on Ubuntu 10.04 and I am using feed-zirra to parse RSS feeds and I have MySQL database.
I am trying to parse RSS feeds from Times of India Top Stories. There seems to be problem with the first link, I am sure TOI guys will correct it soon. But anyway, I dont want to face similar error later so thats why I want to ask you guys how to solve this problem.
Just look at this and especially look for link
<item>
<title>CWG: Abhinav Bindra, Gagan Narang win first Gold for India</title
<description>Abhinav Bindra and Gagan Narang on Tuesday bagged Gold for the men's 10 m air rifle pair's event, getting India its first gold in the 19th Commonwealth Games.</description>
<link>/cwgarticleshow/6688747.cms</link>
<guid>/cwgarticleshow/6688747.cms</guid>
<pubDate>Tue, 05 Oct 2010 04:57:46 GMT</pubDate>
</item>
The link is <link>/cwgarticleshow/6688747.cms</link>
Now, when I click the link, in the view.. its getting routed to http://localhost:3000/cwgarticleshow/6688747.cms instead of http://timesofindia.indiatimes.com/cwgarticleshow/6688747.cms
And the error I am getting is
**Routing Error**
No route matches "/cwgarticleshow/6688747.cms" with {:method=>:get}
How do I correct this type of Error?
Looking forward for your help and support
Thanks
You just need to prepend http://timesofindia.indiatimes.com to the link tag value and you'll be ok.
You can use URI class. You can, for example, define following method
require "uri"
def repair_link(feed_link)
uri = URI.parse(feed_link)
uri.scheme ||= "http"
uri.host ||= "timesofindia.indiatimes.com"
uri.to_s
end
It will set the scheme and host part of the URL if they are not already filled. So if you call it for normal link (like http://foo/bar.cms) then nothing will be changed.
And last thing - you probably should catch exception somewhere as the #parse method raises exception InvalidURIError in case of invalid URI. But it's up to you how you will deal with it.