JSoup.clean() is not preserving relative URLs - html-parsing

I have tried:
Whitelist.relaxed();
Whitelist.relaxed().preserveRelativeLinks(true);
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp");
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true);
None of them work: When I try to clean a relative url, like test I get the href attribute removed (<a>test</a>).
I am using JSoup 1.8.2.
Any ideas?

The problem most likely stems from the call of the clean method. If you give the base URI all should work as expected:
String html = ""
+ "test"
+ "<invalid>stuff</invalid>"
+ "<h2>header1</h2>";
String cleaned = Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true));
System.out.println(cleaned);
The above works and keeps the relative links. With String cleaned = Jsoup.clean(html, Whitelist.relaxed().preserveRelativeLinks(true)) however the link is deleted.
Note the documentation of Whitelist.preserveRelativeLinks(true):
Note that when handling relative links, the input document must have
an appropriate base URI set when parsing, so that the link's protocol
can be confirmed. Regardless of the setting of the preserve relative
links option, the link must be resolvable against the base URI to an
allowed protocol; otherwise the attribute will be removed.

Related

Inside Splash, how to use src attribute to append to a url

------------ORIGINAL QUESTION------------------
In my Splash Script, I am trying to use "splash:go" on a new url that is based on the "src" attribute of an "img" tag. How can I access this "src" relative url and join it to a start_url?
For example, imagine that the img element has the following contents:
<img id="ImageViewer1_docImage" onload="BlockerResize('ImageViewer1_ContentBlocker1','ImageViewer1_WaterMarkImage');" src="ACSResource.axd?SCTTYPE=ENCRYPTED&SCTKEY=gMYed5OWqcT9I1Y2fM85DvB48X5U1DQ5mOUiJoUH4rioyau0nJdxt0PHFfGVTMiUsork/YD+Cw0F6ZzcviP4sG09xrqWM8/zJlyEeVRFkKXVnkyHYWgwNJzCSUE4Kh4yCsqw6mCuIxWxPj6BAI7Hbw==&CNTWIDTH=849&CNTHEIGHT=684&FITTYPE=Height&ZOOM=1" alt="Please wait" style="border-width:0px;cursor: url(images/Cursors/hmove.cur); z-index: 1000">
Here I am trying to extract the src attribute and add it to start_url:
https://i2a.uslandrecords.com/ME/Cumberland/D/
I want all of this inside the Splash script. I need it to be done inside of Splash because otherwise I lose my security/encryption or something--it renders "Bad Data" instead of the new webpage. Do you have any recommendations?
------------UPDATE------------------
So I managed to obtain the url I needed from the src attribute using the following code:
var = splash:evaljs("document.getElementById('ImageViewer1_docImage').src;")
splash:go(var)
However, the problem is that this is producing a error message. All I find in the snapshot is a white page with the following message:
Failed loading page (Frame load interrupted by policy change)
https://i2a.uslandrecords.com/ME/Cumberland/D/ACSResource.axd?SCTTYPE=ENCRYPTED&SCTKEY=gMYed5OWqcSvEWOJA6wGVmb642s2oZHqkYmT6VTpORTzMY7CgvDU5jsjJG/xp0X3eQ9BiDnbaTdAmISeLkC3hyjxGjcSnXOKgGDa8cI2fniY0ILT+NqvQToMGIB+/X3ZIs7Q+D4ppTSZGYZ2L4M/
Webkit error #102
Any idea why?
The image src attribute is exactly the URL you need to access or as stated by the question title you need to append it to some other URL parts?
If that is the case, you can do it by '..'
Ex.: splash:go(base_url..var) -- concatenation
ISSUE RESOLVED:
Here is the solution. The GET request was breaking down because it didn't know how to render the image in html given the webkit settings. If you execute the GET request without rendering the page, the response.body has the image.
CODE:
local response = splash:http_get(var)
return {
body = response.body
}

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Ruby/Rails : Get url extension with URI

currently i'm have a bit problem to parse URL using URI
i've tried to use this code :
uri = URI::parse(Model.first.media)
#<URI::HTTPS https://my-bucket.s3.amazonaws.com/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4>
uri.path
"/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
File.basename(Model.first.media, '.mp4')
"cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
but i'm still confused to get path without / as first char in example model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4 and get only the path without domain and the file in example model/media/41
do i must using regex to get above output ? or URI can handle this ?
note:
i've found how to get url extension without first char based on this question Ruby regexp: capture the path of url
URI class helps break apart URLs into components and gives you methods like
[:scheme, :userinfo, :host, :port, :path, :query, :fragment]
If you simply need to get rid of the first slash it's simple as this with no regex.
uri.path[1..-1] #gives all string characters except the 0 index.
But you could probably even get away with:
Model.first.media.split('.com/').last # don't even need URI parse.
For last part of your question you can do:
File.dirname(uri.path) # will return => "/model/media/41"
File.dirname(uri.path)[1..-1] # if you want to remove leading /

Access Denied when creating file in Visual F#

The following code runs without a hitch:
On the other hand, I get an access-denied error with this:
The destination is in my personal folder and I have full control. The directory is not read-only. Anyway, in either of those cases, the first code sample should not run either! I appreciate the help ...
In the second sample, you have two problems:
There are back slashes instead of forward slashes, so some of them may get interpreted as escape sequences.
You completely ignore the first parameter of write and specify what I assume is a folder as destination. You can't open a file stream on a folder, no wonder you get access denied.
This should work:
let write filename (ms:MemoryStream) =
let path = System.IO.Path.Combine( "C:/Users/<whatever>/signal_processor", filename )
use fs = new FileStream( path, FileMode.Create )
ms.WriteTo(fs)

request.serverVariables() "URL" vs "Script_Name"

I am maintaining a classic asp application and while going over the code I came across two similar lines of code:
Request.ServerVariables("URL")
' Output: "/path/to/file.asp"
Request.ServerVariables("SCRIPT_NAME")
' Output: "/path/to/file.asp"
I don't get it... what is the difference? both of them ignore the URL rewriting that I have set up which puts the /path folder as the root document (the above URL is rewritten to "/to/file.asp")
More info:
The site is deployed on IIS 7
URL Gives the base portion of the URL, without any querystring or extra path information. For the raw URL, use HTTP_URL or UNENCODED_URL.
SCRIPT_NAME A virtual path to the script being executed. Can be used for self-referencing URLs.
See, http://www.requestservervariables.com/url
and /script_name for the definitions.
This could be a bug under IIS 7.
I could not get Request.ServerVariables("URL") and Request.ServerVariables("SCRIPT_NAME") to return different values. I've tried the cases where they were called from an included file (<!--#include file="file.asp"-->) or after a Server.Transfer.
Is this maybe there in case of Server.Transfer?
In the case where you do a server.transfer i think you would get different results
i.e. SCRIPT_NAME would be e.g. /path/to.transferredfile.asp whereas URL would remain as /path/to/file.asp

Resources