Regex failing to match the punycode url - ruby-on-rails

I was having the url which on converting to punycode has suffix as xn---- which all the regex present in ruby libraries fails to match.
Currently I am using validates_url_format_of ruby library.
Example Url: "https://www.θεραπευτικη-κανναβη.com.gr"
Punycode url: "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr"
So can you please suggest that is there any issue in the regex in the library or the issue lies in the conversion to punycode.
As per the punycode conversion rules the suffix always is xn--. So can anyone suggest what extra two -- means here

"https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr".match(/https?:\/\/w*\.xn----.*/)
=> #<MatchData "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr">
Note the url matcher is not perfect

When you have a - inside the URL, the algorithm gets it duplicated and moves it to the beginning of the puny code.
For example:
áéíóú.com -> xn--1caqmy9a.com
á-é-í-ó-ú.com -> xn-------4na3c3a3cwd.com
I guess it has to do with the xn-- encoding restrictions.
This one should work for you:
(xn--)(--)*[a-z0-9]+.com.gr
The beginning of the code: (xn--)
An even number (or 0) of --: (--)*
The domain chars/numbers :([a-z0-9]+)
The TLD of the domain : .com.gr
You can add http/https if you wish
Update:
After adding numbers to the URL I found that the regex needs a fix:
(xn--)(-[-0-9]{1})*[a-z0-9]+.com.gr
á-1é-2í-3ó-4ú.gr.com -> xn---1-2-3-4-7ya6f1b6dve.gr.com

Related

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Ruby/Rails : Get url extension with URI

currently i'm have a bit problem to parse URL using URI
i've tried to use this code :
uri = URI::parse(Model.first.media)
#<URI::HTTPS https://my-bucket.s3.amazonaws.com/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4>
uri.path
"/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
File.basename(Model.first.media, '.mp4')
"cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
but i'm still confused to get path without / as first char in example model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4 and get only the path without domain and the file in example model/media/41
do i must using regex to get above output ? or URI can handle this ?
note:
i've found how to get url extension without first char based on this question Ruby regexp: capture the path of url
URI class helps break apart URLs into components and gives you methods like
[:scheme, :userinfo, :host, :port, :path, :query, :fragment]
If you simply need to get rid of the first slash it's simple as this with no regex.
uri.path[1..-1] #gives all string characters except the 0 index.
But you could probably even get away with:
Model.first.media.split('.com/').last # don't even need URI parse.
For last part of your question you can do:
File.dirname(uri.path) # will return => "/model/media/41"
File.dirname(uri.path)[1..-1] # if you want to remove leading /

JSoup.clean() is not preserving relative URLs

I have tried:
Whitelist.relaxed();
Whitelist.relaxed().preserveRelativeLinks(true);
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp");
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true);
None of them work: When I try to clean a relative url, like test I get the href attribute removed (<a>test</a>).
I am using JSoup 1.8.2.
Any ideas?
The problem most likely stems from the call of the clean method. If you give the base URI all should work as expected:
String html = ""
+ "test"
+ "<invalid>stuff</invalid>"
+ "<h2>header1</h2>";
String cleaned = Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true));
System.out.println(cleaned);
The above works and keeps the relative links. With String cleaned = Jsoup.clean(html, Whitelist.relaxed().preserveRelativeLinks(true)) however the link is deleted.
Note the documentation of Whitelist.preserveRelativeLinks(true):
Note that when handling relative links, the input document must have
an appropriate base URI set when parsing, so that the link's protocol
can be confirmed. Regardless of the setting of the preserve relative
links option, the link must be resolvable against the base URI to an
allowed protocol; otherwise the attribute will be removed.

Where is and which function does encode or decode URL in phpfox?

I'll be wonder if you tell me how phpfox handle URLs?
In details i want to know that which function gererates urls in PHPFox?
I have some problem with encoding or decoding of PHPFox. Because it transform some urls which is in Persian language to ??????.
For example it will resolve this link: 'http://www.mydomain.com/photos/اخبار/' to 'http://www.mydomain.com/photos/???????/'
This is the main library class for URLs: /include/library/phpfox/url/url.class.php
Is this what you are looking for?

Can I use a regular expression to extract the domain from a URL?

Suppose I want to turn this :
http://en.wikipedia.org/wiki/Anarchy
into this :
en.wikipedia.org
or even better, this :
wikipedia.org
Is this even possible in regex?
Why use a regex when Ruby has a library for it? The URI library:
ruby-1.9.1-p378 > require 'uri'
=> true
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
=> #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy>
ruby-1.9.1-p378 > uri.host
=> "en.wikipedia.org"
ruby-1.9.1-p378 > uri.host.split('.')
=> ["en", "wikipedia", "org"]
Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".
/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.
/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.
yes
Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)
perl:
$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;
ruby:
url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')
php:
$url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);
Of course, for this particular example, you don't even need a regex, just this will do:
url = 'wikipedia.org'
but I jest...
you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.
In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:
domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')
Hope this helps
Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html
Sure all you would have to do is search on http://(.*)/wiki/Anarchy
In Perl (Sorry I don't know Ruby, but I expect it's similar)
$string_to_search =~ s/http:////(.)//. should give you wikipedia.org
to get rid of the en, you can simply search on http:////en(.)//......
That should do it.
Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.
Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org

Resources