Can I use a regular expression to extract the domain from a URL? - ruby-on-rails

Suppose I want to turn this :
http://en.wikipedia.org/wiki/Anarchy
into this :
en.wikipedia.org
or even better, this :
wikipedia.org
Is this even possible in regex?

Why use a regex when Ruby has a library for it? The URI library:
ruby-1.9.1-p378 > require 'uri'
=> true
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
=> #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy>
ruby-1.9.1-p378 > uri.host
=> "en.wikipedia.org"
ruby-1.9.1-p378 > uri.host.split('.')
=> ["en", "wikipedia", "org"]
Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".

/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.
/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.

yes
Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)
perl:
$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;
ruby:
url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')
php:
$url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);
Of course, for this particular example, you don't even need a regex, just this will do:
url = 'wikipedia.org'
but I jest...
you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.
In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:
domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')
Hope this helps
Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html

Sure all you would have to do is search on http://(.*)/wiki/Anarchy
In Perl (Sorry I don't know Ruby, but I expect it's similar)
$string_to_search =~ s/http:////(.)//. should give you wikipedia.org
to get rid of the en, you can simply search on http:////en(.)//......
That should do it.
Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.
Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org

Related

LibreOffice: embed script in script URL

In LibreOffice, It is possible to run python scripts like this:
sURL = "vnd.sun.star.script:file.function?language=Python&location=document"
oScript = scriptProv.getScript(sURL)
x = oScript.Invoke(args, Array(), Array())
In that example 'file' is a filename, and 'function' is the name of a function in that file.
Is it possible to embed script in that URL? sURL="vnd.." & scriptblock & "?language.."
(It seems like the kind of thing that might be possible with the correct URL, or might not be possible if just not supported).
We can use Python's eval() function. Here is an example inspired by JohnSUN's explanation in the discussion. Note: xray() uses XrayTool to show output, but you could replace that line with any output method of your choosing, such as writing to a file.
def runArbitraryCode(*args):
url = args[0]
codeString = url.split("&codeToRun=")[1]
x = eval(codeString)
xray(x)
Now enter this formula in Calc and Ctrl+click on it.
=HYPERLINK("vnd.sun.star.script:misc_examples.py$runArbitraryCode?language=Python&location=user&codeToRun=5+1")
Result: 6
Obligatory caveat: Running eval() on an unknown string is about the worst idea imaginable in terms of security. So hopefully you're the one controlling the URL and not some black hat hacker!

Regex failing to match the punycode url

I was having the url which on converting to punycode has suffix as xn---- which all the regex present in ruby libraries fails to match.
Currently I am using validates_url_format_of ruby library.
Example Url: "https://www.θεραπευτικη-κανναβη.com.gr"
Punycode url: "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr"
So can you please suggest that is there any issue in the regex in the library or the issue lies in the conversion to punycode.
As per the punycode conversion rules the suffix always is xn--. So can anyone suggest what extra two -- means here
"https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr".match(/https?:\/\/w*\.xn----.*/)
=> #<MatchData "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr">
Note the url matcher is not perfect
When you have a - inside the URL, the algorithm gets it duplicated and moves it to the beginning of the puny code.
For example:
áéíóú.com -> xn--1caqmy9a.com
á-é-í-ó-ú.com -> xn-------4na3c3a3cwd.com
I guess it has to do with the xn-- encoding restrictions.
This one should work for you:
(xn--)(--)*[a-z0-9]+.com.gr
The beginning of the code: (xn--)
An even number (or 0) of --: (--)*
The domain chars/numbers :([a-z0-9]+)
The TLD of the domain : .com.gr
You can add http/https if you wish
Update:
After adding numbers to the URL I found that the regex needs a fix:
(xn--)(-[-0-9]{1})*[a-z0-9]+.com.gr
á-1é-2í-3ó-4ú.gr.com -> xn---1-2-3-4-7ya6f1b6dve.gr.com

regex to extract URLs from text - Ruby

I am trying to detect the urls from a text and replace them by wrapping in quotes like below:
original text: Hey, it is a url here www.example.com
required text: Hey, it is a url here "www.example.com"
original text show my input value and required text represents the required output. I searched a lot on web but could not find any possible solution. I already have tried URL.extract feature but that doesn't seem to detect URLs without http or https. Below are the examples of some of urls I want to deal with. Kindly let me know if you know the solution.
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Find words who look like urls:
str = "ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les Belles lettres, 2001.\n\nhttps://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/\n\nwww.jstor.org/stable/24084454\n\nwww.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/\n\ninsu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so\n\nwww.cerege.fr/spip.php?page=pageperso&id_user=94"
str.split.select{|w| w[/(\b+\.\w+)/]}
This will give you an array of words which have no spaces and include a one or more . characters which MIGHT work for your use case.
puts str.split.select{|w| w[/(\b+\.\w+)/]}
www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,
https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/
www.jstor.org/stable/24084454
www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/
insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so
www.cerege.fr/spip.php?page=pageperso&id_user=94
Updated
Complete solution to modify your string:
str_with_quote = str.clone # make a clone for the `gsub!`
str.split.select{|w| w[/(\b+\.\w+)/]}
.each{|url| str_with_quote.gsub!(url, '"' + url + '"')}
Now your cloned object wraps urls inside double quotes
puts str_with_quote
Will give you this output
ANQUETIL-DUPERRON Abraham-Hyacinthe, KIEFFER Jean-Luc, "www.hominides.net/html/actualites/outils-preuve-presence-hominides-asie-0422.php,Les" Belles lettres, 2001.
"https://www.ancient-code.com/indian-archeologists-stumbleacross-ruins-great-forgotten-civilization-mizoram/"
"www.jstor.org/stable/24084454"
"www.biorespire.com/2016/03/22/une-nouvelle-villeantique-d%C3%A9couverte-en-inde/"
"insu.cnrs.fr/terre-solide/terre-et-vie/de-nouvellesdatations-repoussent-l-age-de-l-apparition-d-outils-surle-so"
"www.cerege.fr/spip.php?page=pageperso&id_user=94"

Ruby/Rails : Get url extension with URI

currently i'm have a bit problem to parse URL using URI
i've tried to use this code :
uri = URI::parse(Model.first.media)
#<URI::HTTPS https://my-bucket.s3.amazonaws.com/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4>
uri.path
"/model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
File.basename(Model.first.media, '.mp4')
"cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4"
but i'm still confused to get path without / as first char in example model/media/41/cdbb21cc-1c59-4aa3-92ec-917e7237a850.mp4 and get only the path without domain and the file in example model/media/41
do i must using regex to get above output ? or URI can handle this ?
note:
i've found how to get url extension without first char based on this question Ruby regexp: capture the path of url
URI class helps break apart URLs into components and gives you methods like
[:scheme, :userinfo, :host, :port, :path, :query, :fragment]
If you simply need to get rid of the first slash it's simple as this with no regex.
uri.path[1..-1] #gives all string characters except the 0 index.
But you could probably even get away with:
Model.first.media.split('.com/').last # don't even need URI parse.
For last part of your question you can do:
File.dirname(uri.path) # will return => "/model/media/41"
File.dirname(uri.path)[1..-1] # if you want to remove leading /

Strange URL containing 'A=0 or '0=A in web server logs

During the last weekend some of my sites logged errors implying wrong usage of our URLs:
...news.php?lang=EN&id=23'A=0
or
...news.php?lang=EN&id=23'0=A
instead of
...news.php?lang=EN&id=23
I found only one page originally which mentioned this (https://forums.adobe.com/thread/1973913) where they speculated that the additional query string comes from GoogleBot or an encoding error.
I recently changed my sites to use PDO instead of mysql_*. Maybe this change caused the errors? Any hints would be useful.
Additionally, all of the requests come from the same user-agent shown below.
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)
This lead me to find the following threads:
pt-BR
and
Strange parameter in URL - what are they trying?
It is a bot testing for SQL injection vulnerabilities by closing a query with apostrophe, then setting a variable. There are also similar injects that deal with shell commands and/or file path traversals. Whether it's a "good bot" or a bad bot is unknown, but if the inject works, you have bigger issues to deal with. There's a 99% chance your site is not generating these style links and there is nothing you can do to stop them from crafting those urls unless you block the request(s) with a simple regex string or a more complex WAF such as ModSecurity.
Blocking based on user agent is not an effective angle. You need to look for the request heuristics and block based on that instead. Some examples of things to look for in the url/request/POST/referrer, as both utf-8 and hex characters:
double apostrophes
double periods, especially followed by a slash in various encodings
words like "script", "etc" or "passwd"
paths like dev/null used with piping/echoing shell output
%00 null byte style characters used for init a new command
http in the url more than once (unless your site uses it)
anything regarding cgi (unless your site uses it)
random "enterprise" paths for things like coldfusion, tomcat, etc
If you aren't using a WAF, here is a regex concat that should capture many of those within a url. We use it in PHP apps, so you may/will need to tweak some escapes/looks depending on where you are using this. Note that this has .cgi, wordpress, and wp-admin along with a bunch of other stuff in the regex, remove them if you need to.
$invalid = "(\(\))"; // lets not look for quotes. [good]bots use them constantly. looking for () since technically parenthesis arent valid
$period = "(\\002e|%2e|%252e|%c0%2e|\.)";
$slash = "(\\2215|%2f|%252f|%5c|%255c|%c0%2f|%c0%af|\/|\\\)"; // http://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work
$routes = "(etc|dev|irj)" . $slash . "(passwds?|group|null|portal)|allow_url_include|auto_prepend_file|route_*=http";
$filetypes = $period . "+(sql|db|sqlite|log|ini|cgi|bak|rc|apk|pkg|deb|rpm|exe|msi|bak|old|cache|lock|autoload|gitignore|ht(access|passwds?)|cpanel_config|history|zip|bz2|tar|(t)?gz)";
$cgis = "cgi(-|_){0,1}(bin(-sdb)?|mod|sys)?";
$phps = "(changelog|version|license|command|xmlrpc|admin-ajax|wsdl|tmp|shell|stats|echo|(my)?sql|sample|modx|load-config|cron|wp-(up|tmp|sitemaps|sitemap(s)?|signup|settings|" . $period . "?config(uration|-sample|bak)?))" . $period . "php";
$doors = "(" . $cgis . $slash . "(common" . $period . "(cgi|php))|manager" . $slash . "html|stssys" . $period . "htm|((mysql|phpmy|db|my)admin|pma|sqlitemanager|sqlite|websql)" . $slash . "|(jmx|web)-console|bitrix|invoker|muieblackcat|w00tw00t|websql|xampp|cfide|wordpress|wp-admin|hnap1|tmunblock|soapcaller|zabbix|elfinder)";
$sqls = "((un)?hex\(|name_const\(|char\(|a=0)";
$nulls = "(%00|%2500)";
$truth = "(.{1,4})=\1"; // catch OR always-true (1=1) clauses via sql inject - not used atm, its too broad and may capture search=chowder (ch=ch) for example
$regex = "/$invalid|$period{1,2}$slash|$routes|$filetypes|$phps|$doors|$sqls|$nulls/i";
Using it, at least with PHP, is pretty straight forward with preg_match_all(). Here is an example of how you can use it: https://gist.github.com/dhaupin/605b35ca64ca0d061f05c4cf423521ab
WARNING: Be careful if you set this to autoban (ie, fail2ban filter). MS/Bing DumbBots (and others) often muck up urls by entering things like strange triple dots from following truncated urls, or trying to hit a tel: link as a URi. I don't know why. Here is what i mean: A link with text www.example.com/link-too-long...truncated.html may point to a correct url, but Bing may try to access it "as it looks" instead of following the href, resulting in a WAF hit due to double dots.
since this is a very old version of FireFox, I blocked it in my htaccess file -
RewriteCond %{HTTP_USER_AGENT} Firefox/3\.5\.2 [NC]
RewriteRule .* err404.php [R,L]

Resources