(rails) validating URL help with regexp - ruby-on-rails

i'm using the following to verify if a URL is formatted validly:
validates_format_of :website, :with => URI::regexp(%w(http https))
however, it doesn't work when the url doesn't start with http:// or https://. Is there some similar way to validate URLs with URI::regexp (or URI) and make it include valid URLs that don't start with http://? (For example, www.google.com is valid, as is http://www.google.com)

This post on Daring Fireball provides a robust regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
A more recent post improves on it (N.B. newlines and indentation added here for clarity; see the post for an even more expanded version with explanations):
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|
www\d{0,3}[.]|
[a-z0-9.\-]+[.][a-z]{2,4}/)
(?:[^\s()<>]+|
\(([^\s()<>]+|
(\([^\s()<>]+\)))*\))+
(?:\(([^\s()<>]+|
(\([^\s()<>]+\)))*\)|
[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
From my tests URL::regexp is to loose in its definition of a URI (though it does require http…).
You can use a virtual attribute or before_save filter to append a http:// to your URLs if necessary.

This is ruby interpretation (escaped forward slashes)
(?i)\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b\/?(?!#)))
Original gist by Gruber
Please copy it to Rubular for testing, i could not make permanent link, as reqexp is probably to long.
Works with http and without, and works with short domains like 'google.com'

Related

How to get the hostname from a url with accented letters inside in Ruby

I have the following url inside a field of model:
https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new
Inside the URL there is an accented letter (à). If I use URI.parse to get hostname gives me the following error:
URI::InvalidURIError: URI must be ascii only "https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potr\u00E0_non_rispettare_il/?sort=new"
The method URL.encode resolves the problem, but I discover that the URL.encode is obsolete and should not be used.
Which method should I use for replacing URI.encode?
this is encoding issue and you need to do it as below
first lets encode your URI first
encoded_url = URI.encode('https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new')
And then parse it
URI.parse(encoded_url)
good luck
The only solution that I find uses the gem Addressable(https://github.com/sporkmonger/addressable):
Addressable::URI.parse('https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new').host
Perhaps this could be an inelegant solution:
URI.parse(URI.extract(target.url).first)
# => #<URI::HTTPS https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potr>
Then I use the method host
URI.parse(URI.extract(target.url).first).host
# => "www.reddit.com"

(JSON::ParserError) "{N}: unexpected token at 'alihack<%eval request(\"alihack.com\")%>

I have the website on Ruby on Rails 3.2.11 and Ruby 1.9.3.
What can cause the following error:
(JSON::ParserError) "{N}: unexpected token at 'alihack<%eval request(\"alihack.com\")%>
I have several errors like this in the logs. All of them tries to eval request(\"alihack.com\").
Part of the log file:
"REMOTE_ADDR" => "10.123.66.198",
"REQUEST_METHOD" => "PUT",
"REQUEST_PATH" => "/ali.txt",
"PATH_INFO" => "/ali.txt",
"REQUEST_URI" => "/ali.txt",
"SERVER_PROTOCOL" => "HTTP/1.1",
"HTTP_VERSION" => "HTTP/1.1",
"HTTP_X_REQUEST_START" => "1407690958116",
"HTTP_X_REQUEST_ID" => "47392d63-f113-48ba-bdd4-74492ebe64f6",
"HTTP_X_FORWARDED_PROTO" => "https",
"HTTP_X_FORWARDED_PORT" => "443",
"HTTP_X_FORWARDED_FOR" => "23.27.103.106, 199.27.133.183".
199.27.133.183 - is CLoudFlare IP.
"REMOTE_ADDR" => "10.93.15.235" and "10.123.66.198" and others, I think, are fake IPs of proxy.
Here's a link guy has the same issue with his web site from the same ip address(23.27.103.106).
To sum up, the common ip from all errors is 23.27.103.106 and they try to run the script using ruby's eval.
So my questions are:
What type of vulnerability they try to find?
What to do? Block the ip?
Thank you in advance.
Why it happens?
It seems like an attempt to at least test for, or exploit, a remote code execution vulnerability. Potentially a generic one (targeting a platform other than Rails), or one that existed in earlier versions.
The actual error however stems from the fact that the request is an HTTP PUT with application/json headers, but the body isn't a valid json.
To reproduce this on your dev environment:
curl -D - -X PUT --data "not json" -H "Content-Type: application/json" http://localhost:3000
More details
Rails action_dispatch tries to parse any json requests by passing the body to be decoded
# lib/action_dispatch/middleware/params_parser.rb
def parse_formatted_parameters(env)
...
strategy = #parsers[mime_type]
...
case strategy
when Proc
...
when :json
data = ActiveSupport::JSON.decode(request.body)
...
In this case, it's not a valid JSON, and the error is raised, causing the server to report a 500.
Possible solutions
I'm not entirely sure what's the best strategy to deal with this. There are several possibilities:
you can block the IP address using iptables
filter (PUT or all) requests to /ali.txt within your nginx or apache configs.
use a tool like the rack-attack gem and apply the filter there. (see this rack-attack issue )
use the request_exception_handler gem to catch the error and handle it from within Rails (See this SO answer and this github issue)
block PUT requests within Rails' routes.rb to all urls but those that are explicitly allowed. It looks like that in this case, the error is raised even before it reaches Rails' routes - so this might not be possible.
Use the rack-robustness middleware and catch the json parse error with something like this configuration in config/application.rb
Write your own middleware. Something along the lines of the stuff on this post
I'm currently leaning towards options #3, #4 or #6. All of which might come in handy for other types of bots/scanners or other invalid requests that might pop-up in future...
Happy to hear what people think about the various alternative solutions
I saw some weird log entries on my own site [which doesn't use Ruby] and Google took me to this thread. The IP on my entries was different. [120.37.236.161]
After poking around a bit more, here is my mostly speculation/educated guess:
First, in my own logs I saw a reference to http://api.alihack.com/info.txt - checked this link out; looked like an attempt at a PHP injection.
There's also a "register.php" page there - submitting takes you to an "invite.php" page.
Further examination of this domain took me to http://www.alihack.com/2014/07/10/168.aspx (page is in Chinese but Google Translate helped me out here)
I expect this "Black Spider" tool has been modified by script kiddies and is being used as a carpet bomber to attempt to find any sites which are "vulnerable."
It might be prudent to just add an automatic denial of any attempt including the "alihack" substring to your configuration.
I had a similar issue show up in my Rollbar logs, a PUT request to /ali.txt
Best just to block that IP, I only saw one request on my end with this error. The request I received came from France -> http://whois.domaintools.com/37.187.74.201
If you use nginx, add this to your nginx conf file;
deny 23.27.103.106/32;
deny 199.27.133.183/32;
For Rails-3 there is a special workaround-gem: https://github.com/infopark/robust_params_parser

Sanitizing Unicode strings for URL slugs (Ruby/Rails)

I have UTF-8 encoded post titles which I'd rather show using the appropriate characters in slugs. An example is Amazon Japan's URL here.
How can any arbitrary string be converted to a safe URL slug such as this, with Ruby (or Rails)?
(There are some related PHP posts, but nothing I could find for Ruby.)
From reading here it seems like a solution is this:
require 'open-uri'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts URI::encode(str)
Here is the documentation for open-uri. and here is some info on utf-8 encoded url schema.
EDIT: having looked into this more I noticed encode is just an alias for URI.escape which is documented here. example taken from the docs below:
require 'uri'
enc_uri = URI.escape("http://example.com/?a=\11\15")
p enc_uri
# => "http://example.com/?a=%09%0D"
p URI.unescape(enc_uri)
# => "http://example.com/?a=\t\r"
p URI.escape("#?#!", "!?")
# => "#%3F#%21"
Let me know if this is what you were looking for?
EDIT #2: I was interested and kept looking a little more, according to the comments ryan bates' railscasts on friendlyid also seems to work with chinese characters.

devise confirmation_url in HTTPS

My website is entirely SSL, so I would like to have also the urls generated by devise (3.2.2) for Email verification to be https://....
currently the urls are generated by:
confirmation_url(#resource, :confirmation_token => #token)
which produces nice urls like:
http://example.com/users/confirmation?confirmation_token=zqfHS35ckLQZscSbsgMm
I would like the url to be
https://example.com/users/confirmation?confirmation_token=zqfHS35ckLQZscSbsgMm
Also, currently the email verification doesn't work, because nginx operates a redirect to the https equivalent to every page, and for some reasons things get messed up and the https version is a corrupted url, like:
https://example.com/users/confirmation?confirmation_token=zqfHS35ckLQZscSbsgMm?confirmation_token=zqfHS35ckLQZscSbsgMm
for some reasons nginx redirects to this corrupted url, so Unicorn can't but reject the request.
any clues?
You can either specify the protocol in the email template, as you did in your own answer, or you can specify it as a default in the mailer. The simplest way to do this, if you are happy for all emails to use https links, is to add it to your app config. For example, in your production.rb:
config.action_mailer.default_url_options = {:protocol => 'https', :host => 'example.com'}
I know it doesnt matter any more if you're going straight to https, but your url after the nginx redirect from http to https looks like it's appending the query string to the entire url, so it would be worth fixing that so it works in all cases even if you don't need it for the emails any more. If you're using a return 301 … statement in the nginx config, perhaps there's a trailing $query_string or $args you don't need - for example, if you're using $request_uri that already has the GET parameters in it.
Also, I don't think you will find confirmation_url defined directly anywhere. If you try rake routes you'll probably see one of them is:
user_confirmation GET /users/confirmation(.:format) {:controller=>"devise/confirmations", :action=>"show"}
which means that there will automatically be a user_confirmation_url helper available as with routes in general. I think devise then allows you to use confirmation_url due to its clever tracking of the scope you're using ('user' in this case), though I must admit I haven't looked at the code enough in devise's routing to know exactly how it does it for the routes.
I changed the method call to:
confirmation_url(#resource, :confirmation_token => #token, protocol: "https")
and that started generating correctly the urls with https as required.
I couldn't find the definition of confirmation_url anywhere in the devise code though.

Get URL without filename?

I'm trying to figure out how to parse a URL in Rails, and return everything except the filename, or, everything except that which follows the last backslash.
For example, I'd like:
http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg
to become:
http://bucket.s3.amazonaws.com/directoryname/1234/
I've found every way to parse a URI, but this. Any help would be appreciated.
Ruby has methods available to get you there easily:
File.dirname(URL) # => "http://bucket.s3.amazonaws.com/directoryname/1234"
Think about what a URL/URI is: It's a designator for a protocol, a site, and a directory-path to a resource. The directory-path to a resource is the same as a "path/to/file", so File.dirname works nicely, without having to reinvent that particular wheel.
The trailing / isn't included because it's a delimiter between the path segments. You generally don't need that, because joining a resource to a path will automatically supply it.
Really though, using Ruby's URI class is the proper way to mangle URIs:
require 'uri'
URL = 'http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg'
uri = URI.parse(URL)
uri.merge('foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/1234/foo.html"
URI allows you to mess with the path easily too:
uri.merge('../foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/foo.html"
uri.merge('../bar/foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/bar/foo.html"
URI is well-tested, and designed for this purpose. It will also allow you to add query parameters easily, encoding them as necessary.
File name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[2]
=> "thumbnail.jpg"
URL without the file name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[1]
=> "http://bucket.s3.amazonaws.com/directoryname/1234/"
String#match
'http://a.b.pl/a/b/c/d.jpg'.rpartition('/').first
=> "http://a.b.pl/a/b/c"

Resources