parse web page encoded text with scrapy

parse web page encoded text with scrapy - parsing

i can't extract the content preview of book from Online Bookstore
it banned copying previews of books by encoding the text if i'm not wrong? ,i look for preview of this book
from inspect page looks like this, every word is outside the span tag!,the inside span tag ten digit code corresponding to each word
<span style='color:red;display:none;'>pq8BMvE37g</span>ولا <span style='color:red;display:none;'>G9XGnpBjnY</span>قدرة
i failed after trying with scrapy python :
response.xpath("//*[#class='nabza']").extract()
the to filter text
response.xpath("//*[#class='nabza']/text()").extract()

The fastest way might be to use this XPath :
string(//div[#class='nabza'])
Then a regex ([a-zA-Z0-9]+) to replace the digit codes with blank spaces.
Alternatively you could use this XPath :
//div[#class='nabza']//*[not(self::span)]/text()
No more ten digit code. You probably have to make some cleanup (check if the 473 parts of text are correctly merged, check the \r\n,...) and you should obtain something like this :
https://paste2.org/mWhxzxpj
EDIT : R code :
library(RCurl)
library(XML)
page=getURL("https://www.neelwafurat.com/itempage.aspx?id=lbb179878-143056&search=books", httpheader = c('User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0"),.encoding = 'UTF-8')
parse=htmlParse(page,encoding = "UTF-8")
text=xpathSApply(parse,"//div[#class='nabza']//*[not(self::span)]/text()",xmlValue)
result=paste0(text,collapse = "")
writeLines(result,"result.txt",useBytes=T)

Related

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94

Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Extract JPG urls from Bing or Google search

Is there anyway to use Xidel to query either Bing or Google image search and then extract all the URL link for images from that search? I was interested in doing this via the command line using the Xidel.EXE. Thanks
K

Sure. Great you found Xidel. Great cmdline scraper, but very few people seem to know about it.
Here's a oneliner that scrapes 100 "dogs" image urls of google-images:
xidel -s "https://images.google.com" ^
--user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64;) Firefox/40" ^
-f "form(//form,{'q':'dogs'})" ^
-e "<div class='rg_meta'>{extract(.,'ou.:.(.+?).,',1)}</div>*"
BTW, Google actually wants you to use their API, for which you can request an APIkey, but the above command just pretends to be a browser.
Also, if you add --download at the end, it will download all pics. :-)

Is there a way to emit HTML directly into an FsLab journal from the .fsx file?

I'd like to emit some html (generated from my F# code) into a FsLab journal but cannot seem to find the correct incantation to make it happen.
If I have a function in my code that returns an html snippet is there a way to get this directly into the page without being surrounded by a <pre> tag?
I have tried, for example:
let f () =
"""Some <b>bold</b> sample"""
let htmlContent = f ()
then
(*** include-value:htmlContent ***)
but the output is just the html code itself formatted like output.
I took a dive into the F# formatting GH pages and found the (*** raw ***) command so I also tried:
(*** include-value:htmlContent, raw ***)
but the output still gets surrounded by the <pre> & <code> tags.
Is it possible to simply emit raw html in this way without the <pre> tag?

If you are using the latest version, then you can add custom HTML printers using fsi.AddHtmlPrinter. We need to improve FsLab docs, but this is also used by F# Interactive Service in Atom.
To emit raw HTML, you can include something like this in your script:
(*** hide ***)
type Html = Html of string
#if HAS_FSI_ADDHTMLPRINTER
fsi.AddHtmlPrinter(fun (Html h) ->
seq [], h)
#endif
Then, you should be able to create HTML nodes with:
let b = Html("""Some <b>bold</b> sample""")
(*** include-value:b ***)

Strange URL containing 'A=0 or '0=A in web server logs

During the last weekend some of my sites logged errors implying wrong usage of our URLs:
...news.php?lang=EN&id=23'A=0
or
...news.php?lang=EN&id=23'0=A
instead of
...news.php?lang=EN&id=23
I found only one page originally which mentioned this (https://forums.adobe.com/thread/1973913) where they speculated that the additional query string comes from GoogleBot or an encoding error.
I recently changed my sites to use PDO instead of mysql_*. Maybe this change caused the errors? Any hints would be useful.
Additionally, all of the requests come from the same user-agent shown below.
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)
This lead me to find the following threads:
pt-BR
and
Strange parameter in URL - what are they trying?

It is a bot testing for SQL injection vulnerabilities by closing a query with apostrophe, then setting a variable. There are also similar injects that deal with shell commands and/or file path traversals. Whether it's a "good bot" or a bad bot is unknown, but if the inject works, you have bigger issues to deal with. There's a 99% chance your site is not generating these style links and there is nothing you can do to stop them from crafting those urls unless you block the request(s) with a simple regex string or a more complex WAF such as ModSecurity.
Blocking based on user agent is not an effective angle. You need to look for the request heuristics and block based on that instead. Some examples of things to look for in the url/request/POST/referrer, as both utf-8 and hex characters:
double apostrophes
double periods, especially followed by a slash in various encodings
words like "script", "etc" or "passwd"
paths like dev/null used with piping/echoing shell output
%00 null byte style characters used for init a new command
http in the url more than once (unless your site uses it)
anything regarding cgi (unless your site uses it)
random "enterprise" paths for things like coldfusion, tomcat, etc
If you aren't using a WAF, here is a regex concat that should capture many of those within a url. We use it in PHP apps, so you may/will need to tweak some escapes/looks depending on where you are using this. Note that this has .cgi, wordpress, and wp-admin along with a bunch of other stuff in the regex, remove them if you need to.
$invalid = "(\(\))"; // lets not look for quotes. [good]bots use them constantly. looking for () since technically parenthesis arent valid
$period = "(\\002e|%2e|%252e|%c0%2e|\.)";
$slash = "(\\2215|%2f|%252f|%5c|%255c|%c0%2f|%c0%af|\/|\\\)"; // http://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work
$routes = "(etc|dev|irj)" . $slash . "(passwds?|group|null|portal)|allow_url_include|auto_prepend_file|route_*=http";
$filetypes = $period . "+(sql|db|sqlite|log|ini|cgi|bak|rc|apk|pkg|deb|rpm|exe|msi|bak|old|cache|lock|autoload|gitignore|ht(access|passwds?)|cpanel_config|history|zip|bz2|tar|(t)?gz)";
$cgis = "cgi(-|_){0,1}(bin(-sdb)?|mod|sys)?";
$phps = "(changelog|version|license|command|xmlrpc|admin-ajax|wsdl|tmp|shell|stats|echo|(my)?sql|sample|modx|load-config|cron|wp-(up|tmp|sitemaps|sitemap(s)?|signup|settings|" . $period . "?config(uration|-sample|bak)?))" . $period . "php";
$doors = "(" . $cgis . $slash . "(common" . $period . "(cgi|php))|manager" . $slash . "html|stssys" . $period . "htm|((mysql|phpmy|db|my)admin|pma|sqlitemanager|sqlite|websql)" . $slash . "|(jmx|web)-console|bitrix|invoker|muieblackcat|w00tw00t|websql|xampp|cfide|wordpress|wp-admin|hnap1|tmunblock|soapcaller|zabbix|elfinder)";
$sqls = "((un)?hex\(|name_const\(|char\(|a=0)";
$nulls = "(%00|%2500)";
$truth = "(.{1,4})=\1"; // catch OR always-true (1=1) clauses via sql inject - not used atm, its too broad and may capture search=chowder (ch=ch) for example
$regex = "/$invalid|$period{1,2}$slash|$routes|$filetypes|$phps|$doors|$sqls|$nulls/i";
Using it, at least with PHP, is pretty straight forward with preg_match_all(). Here is an example of how you can use it: https://gist.github.com/dhaupin/605b35ca64ca0d061f05c4cf423521ab
WARNING: Be careful if you set this to autoban (ie, fail2ban filter). MS/Bing DumbBots (and others) often muck up urls by entering things like strange triple dots from following truncated urls, or trying to hit a tel: link as a URi. I don't know why. Here is what i mean: A link with text www.example.com/link-too-long...truncated.html may point to a correct url, but Bing may try to access it "as it looks" instead of following the href, resulting in a WAF hit due to double dots.

since this is a very old version of FireFox, I blocked it in my htaccess file -
RewriteCond %{HTTP_USER_AGENT} Firefox/3\.5\.2 [NC]
RewriteRule .* err404.php [R,L]

Different results between Nokogiri and real browser

The target url is: http://courts.delaware.gov/opinions/List.aspx?ag=all+courts
It seems to only retrieve the first 10 links, while a real browser retrieves 50 links.
Here's some sample code to reproduce the error:
require 'open-uri'
require 'nokogiri'
doc=Nokogiri::HTML(open("http://courts.delaware.gov/opinions/List.aspx?ag=all+courts", 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0'))
p "there are missing links" if doc.css('strong a').size < 50
When opening the file produced by open("http://courts.delaware.gov/opinions/List.aspx?ag=all+courts", 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0'), I see the full and expected HTML.
The resulting doc returned from Nokogiri is truncated, with closing HTML tags after the 10th link and no additional content.
This leads me to believe there's something that is misleading the Nokogiri HTML parser to terminate early.
EDIT: It looks like there's something malformed in the HTML. When I remove the last <tr>...</tr> element, Nokogiri grabs more links. I'm still not sure what the problem is exactly, and how to configure Nokogiri to grab everything.
EDIT2: The problem is that Nokogiri stops parsing after encountering a special character and invalid UTF-8, \x002. There's probably some way to sanitize or force encoding before it is parsed by Nokogiri to fix this bug.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

parse web page encoded text with scrapy - parsing

Related

regex to extract URLs from text - Ruby

Extract JPG urls from Bing or Google search

Is there a way to emit HTML directly into an FsLab journal from the .fsx file?

Strange URL containing 'A=0 or '0=A in web server logs

Different results between Nokogiri and real browser

Categories

Resources