I'm trying to parse quotes from a webpage using Ruby and Nokogiri and am running into issues. I believe it's an encoding issue but I'm not certain.
This is the code:
doc = Nokogiri::HTML(URI.open("https://www.goodreads.com/work/quotes/23218"))
tags = doc.css('div.quoteText')
a = tags[8].to_s
a = a.encode('iso-8859-1').force_encoding('utf-8')
# a = a.encode('cp1252').force_encoding('UTF-8')
puts ActionView::Base.full_sanitizer.sanitize(a[0,a.index('<br>')])
This is what gets returned:
“â¦most of the damage we cause to the planet is the result of our own ignorance.”
I'm not sure how to get this to return the proper text...
“…most of the damage we cause to the planet is the result of our own ignorance.”
Any advice on how to get this to work properly? There are other examples of this happening from other quotes as well.
Related
I'm confused about a difference I'm seeing in Nokogiri commands run from Rails Console and what I get from the same commands run in a Rails Helper.
In Rails Console, I am able to capture the data I want with these commands:
endpoint = "https://basketball-reference.com/leagues/BAA_1947_totals.html"
browser = Watir::Browser.new(:chrome)
browser.goto(endpoint)
#doc_season = Nokogiri::HTML.parse(URI.open("https://basketball-reference.com/leagues/BAA_1947_totals.html"))
player_season_table = #doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove) #THIS WORKED
rows[0].at_css("td").try(:text) # Gets single player name
rows[0].at_css("a").attributes["href"].try(:value) # Gets that player page URL
However, my rails helper that is meant to take those commands and fold them into methods:
module ScraperHelper
def target_scrape(url)
browser = Watir::Browser.new(:chrome)
browser.goto(url)
doc = Nokogiri::HTML.parse(browser.html)
end
def league_year_prefix(year, league = 'NBA')
# aba_seasons = 1968..1976
baa_seasons = 1947..1949
baa_seasons.include?(year) ? league_year = "BAA_#{year}" : league_year = "#{league}_#{year}"
end
def players_total_of_season(year, league = 'NBA')
# always the latter year of the season, first year is 1947 no quotes
# ABA is 1968 to 1976
league_year = league_year_prefix(year, league)
#doc_season = target_scrape("http://basketball-reference.com/leagues/#{league_year}_totals.html")
end
def gather_players_from_season
player_season_table = #doc_season.css("tbody")
rows = player_season_table.css("tr")
rows.search('.thead').each(&:remove)
puts rows[0].at_css("td").try(:text)
puts rows[0].at_css("a").attributes["href"].try(:value)
end
end
On that module, I try to emulate the rails console commands and break them into modules. And to test it out (since I don't have any other functionality or views built yet), I run Rails console, include this helper and run the methods.
But I get wildly different results.
in the gather_players_from_season method, I can see that
player_season_table = #doc_season.css("tbody")
Is no longer grabbing the same data it grabbed when run as a command line by line. It also doesn't like the attributes method here:
puts rows[0].at_css("a").attributes["href"].try(:value)
So my first thought is a difference in gems maybe? Watir is launching the headless browser. Nokogiri isn't causing errors as near as I can tell.
Your first thought of comparing the Gem versions is a great idea, but I am noticing a difference between the two code solutions:
In the Rails Console
the code parses the HTML with URI.open: Nokogiri::HTML.parse(URI.open("some html"))
In the ScraperHelper code
the code does not call URI.open, Nokogiri::HTML.parse("some html")
Perhaps that difference will return different values and make the rest of the ScraperHelper return unexpected results.
I try to parse a string containing multiple "_"s, but I get a CallFailed exception.
I have tried to create a small as possible example of the problem syntax.
layout Layout = WhitespaceAndComment* !>> [\ \t\n\r#];
lexical WhitespaceAndComment = [\ \t\n\r] | #category="Comment" "#" ![\n]* $;
syntax SourceList = sourceList: "$"? "{"? Id sourceFile "}"?;
lexical Id = ([a-zA-Z/.\-][a-zA-Z0-9_/.]* !>> [a-zA-Z0-9_/.]) \ Reserved;
keyword Reserved =
"$" | "{" | "}" ;
I am unable to parse this small example.
rascal>try { parse(#SourceList, "test"); } catch CallFailed(m, e): println("<m> : <e>");
|prompt:///|(25,9,<1,25>,<1,34>) : [type(sort("SourceList"),(sort("SourceList"):choice(sort("SourceList"),{prod(label("sourceList",sort("SourceList")),[opt(lit("$")),layouts("$default$"),opt(lit("{")),layouts("$default$"),label("sourceFile",lex("Id")),layouts("$default$"),opt(lit("}"))],{})}),layouts("$default$"):choice(layouts("$default$"),{prod(layouts("$default$"),[],{})}),empty():choice(empty(),{prod(empty(),[],{})}),lex("Id"):choice(lex("Id"),{prod(lex("Id"),[conditional(seq([\char-class([range(45,47),range(65,90),range(97,122)]),conditional(\iter-star(\char-class([range(46,57),range(65,90),range(95,95),range(97,122)])),{\not-follow(\char-class([range(46,57),range(65,90),range(95,95),range(97,122)]))})]),{delete(keywords("Reserved"))})],{})}),keywords("Reserved"):choice(keywords("Reserved"),{prod(keywords("Reserved"),[lit("$")],{}),prod(keywords("Reserved"),[lit("}")],{}),prod(keywords("Reserved"),[lit("{")],{})}))),"${test}"]
ok
A changed sourcefile from "test" to "${test}" gives exactly the same output.
The complete syntax in which SourceList is embedded has many more rules. But then I get the following results.
set(${TARGET_NAME}_DEPS
GenConfiguration_OBJ_TN_Common # accept
${COMMON_BB_PCMDEPS} # reject
COMMON_BB_PCMDEPS # accept
COMMON_BB_PCM_DEPS # reject
)
for which I want to have a solution.
What is wrong with the minimal example? Why is test or ${test} not accepted?
BTW: I am using the latest unstable. Does it make sense to install and try the stable release?
I've tried to reproduce your problem, but it seems to work here:
rascal>parse(#SourceList, "test")
SourceList: (SourceList) `test`
The unstable version is fine at the moment. In fact it's high time to release a stable version. So for now you're better off with the unstable version.
The CallFailed exception is confusing. It means that a function is called which can not be matched OR not be found. So maybe parse is not in scope by not importing ParseTree, or you have a different function called parse which does not have a type[Tree] and str as the expected parameters in scope. As long as the ParseTree module is imported, you're call to parse should be fine.
Please let me know if you have made progress. Perhaps a restart of Eclipse might clear up something as well.
I've got this query:
https://api-v3.mojepanstwo.pl/dane/krs_podmioty.json?conditions[krs_podmioty.nip]=7282827109
In a browser, it works OK, showing data specific for the given nip number.
But in Indy, I get a response as if the query part was omitted:
https://api-v3.mojepanstwo.pl/dane/krs_podmioty.json
I've tried this so far:
BurL = "https://api-v3.mojepanstwo.pl/dane/krs_podmioty.json?conditions[krs_podmioty.nip]=7282827109";
BurL = TIdURI::URLEncode("https://api-v3.mojepanstwo.pl/dane/krs_podmioty.json?conditions[krs_podmioty.nip]=7282827109");
End even raw urlencoded data:
BurL= "https://api-v3.mojepanstwo.pl/dane/krs_podmioty.json?conditions%5Bkrs_podmioty.nip%5D=7282827109";
Code:
try {
Resp = IdHTTPKrs->Get(BurL);
} catch (EIdHTTPProtocolException& e) {
ShowMessage(e.Message);
}
What's wrong, and how can I fix this? Or, maybe I am too tired already and am missing something obvious?
I suspect there is something with the [] part of the query, but I am just guessing here. Similar queries without the [] work OK.
I am using C++Builder XE6 pro, with Indy 10.6.0.512
Your Indy version is out of date. The latest version, at the time of this writing, is 10.6.2.5448. Using the latest version, I can't reproduce your issue. Both URL encodings return the same data for me. As they should be, since a web server is required to decode urlencoded characters when processing the requested URL. conditions%5Bkrs_podmioty.nip%5D=7282827109 and conditions[krs_podmioty.nip]=7282827109 should be getting processed the exact same way by the server, as they are sematically identical data.
I have tried to decompile one of Lua script just for learning purposes, and I got the original code however the code was obfuscated like below:
local L0_0, L1_1, L2_2, L3_3, L4_4, L5_5, L6_6, L7_7, L8_8, L9_9, L10_10, L11_11, L12_12, L13_13, L14_14, L15_15, L16_16, L17_17, L18_18, L19_19, L20_20, L21_21, L22_22, L23_23, L24_24, L25_25, L26_26, L27_27, L28_28, L29_29
L0_0 = require
L1_1 = "comm.NetworkClock"
L0_0 = L0_0(L1_1)
L1_1 = require
L2_2 = "comm_ads.fullscreenAds"
L1_1 = L1_1(L2_2)
L2_2 = require
L15_15 = L14_14.init
L15_15()
L15_15 = L4_4.log
L16_16 = "IN main"
L15_15(L16_16)
function L15_15()
local L0_30, L1_31
end
print = L15_15
Is there any way to recover these code to reach the original one?
Can you get back to the original source? No, not likely.
Source code is optimized to be read by humans, byte code is optimized to be read by machines. Compiling usually results in a one-way conversion where information required to restore the original source is lost.
Best bet at this point is to simplify it by hand and a bunch of find & replace once you identify what a variable or function actually does.
If you did find (or build) a tool to simplify the decompiled source code to be more human-readable, it still would not really be reproducing the original source code.
I am trying to get Nokogiri to behave when using it with delayed jobs but haven't been very successful so far.
Basically I am trying to run a parsing task in the background, but when the background worker hits my perform method, it fails in the following line:
HTML_page = Nokogiri::HTML(open('http://www.mysite.com'))
The error message is:
Nokogiri::HTML::Document#inspect failed with ArgumentError: Requires a Node, NodeSet or String argument, and cannot accept a Delayed::Backend::ActiveRecord::Job.
This happens with both Delayed::Jobs.enqueue and delay methods.
If I try the line below in the console, I get the same error:
Nokogiri::HTML(open('http://www.mysite.com')).delay
It might be a silly oversight as I am fairly new to Ruby and Rails, so any help would be greatly appreciated.
Since Nokogiri "Requires a Node, NodeSet or String argument", why not give it one?
Instead of:
HTML_page = Nokogiri::HTML(open('http://www.mysite.com'))
try:
HTML_page = Nokogiri::HTML(open('http://www.mysite.com').read)
That will cause IO to read the file handle created by open and pass Nokogiri the string content of the URL being read.
An alternate way to help debug the problem, which I don't think lies within Nokogiri, is to split your command up a bit:
body = open('http://www.mysite.com').read
HTML_page = Nokogiri::HTML(body)