How to remove random excess of slashes from url? - ruby-on-rails

How to remove random excess of slashes from url or just validate it?
For example,
valid statements:
http://domain.com/url/url2
https://domain.com/url/url2
www.domain.com/url/url2
invalid statements:
http://domain.com//url/url2
https://domain.com/////url/url2
www.domain.com/url/////////url2
Thanks for help!

Use regular expressions:
require 'uri'
url = URI.parse('https://domain.com/////url/url2')
url.path.gsub! %r{/+}, '/'
p url.to_s

this pattern do the job (with http(s) or not) :
"https://domain.com/////url/url2".gsub! %r{(?<!:)/+(?=/)}, ''

The other answers do not remove a trailing slash from the URL - which can be important for SEO purposes. There are many ways to do this, but for example:
require 'uri'
url = URI.parse('https://example.com/////url/url2/')
url.path.gsub! %r{/+}, '/'
url.path.sub! %r{/$}, ''
Or:
require 'uri'
url = URI.parse('https://example.com/////url/url2/')
url.path.squeeze!('/')
url.path.chomp!('/')
See: String#squeeze! and String#chomp!.

Related

Decoding a redirect url - Ruby on Rails

I'd like to use the URI or CGI libraries to get the path from the query part of this url. In other words, just: '/scouting/amateur'. Is this possible or do I need to use regexp?
http://10.241.180.63:3149/login?redirect_path=http%3A%2F%2F10.241.180.63%3A3149%2Fscouting%2Famateur
Suggestion with Ruby built-ins (if designing a method, you might want to implement some error handling).
require 'uri'
query = URI("http://10.241.180.63:3149/login?redirect_path=http%3A%2F%2F10.241.180.63%3A3149%2Fscouting%2Famateur").query
path = URI(URI.decode(query).split('=')[1]).path
You may find the gem uri-query_params helpful / more elegant (it will decode query params automatically). E.g.
require 'uri/query_params'
uri = URI("http://10.241.180.63:3149/login?redirect_path=http%3A%2F%2F10.241.180.63%3A3149%2Fscouting%2Famateur")
URI(uri.query_params["redirect_path"]).path
Try this -
url = URI.parse('http:://10.241.180.63:3149/login?redirect_path=http%3A%2F%2F10.241.180.63%3A3149%2Fscouting%2Famateur')
redirect_path = u.opaque.split("redirect_path=").last
# redirect_path is now {"redirect_path"=>["http://10.241.180.63:3149/scouting/amateur"]}
result = redirect_path.split("/").last(2).join("/")
# result = 'scouting/amateur'

URI encoding not working

On a rails app, I need to parse uris
a = 'some file name.txt'
URI(URI.encode(a)) # works
b = 'some filename with :colon in it.txt'
URI(URI.encode(b)) # fails URI::InvalidURIError: bad URI(is not URI?):
How can I safely pass a file name to URI that contains special characters? Why doesn't encode work on colon?
URI.escape (or encode) takes an optional second parameter. It's a Regexp matching all symbols that should be escaped. To escape all non-word characters you could use:
URI.encode('some filename with :colon in it.txt', /\W/)
#=> "some%20filename%20with%20%3Acolon%20in%20it%2Etxt"
There are two predefined regular expressions for encode:
URI::PATTERN::UNRESERVED #=> "\\-_.!~*'()a-zA-Z\\d"
URI::PATTERN::RESERVED #=> ";/?:#&=+$,\\[\\]"
require 'uri'
url = "file1:abc.txt"
p URI.encode_www_form_component url
--output:--
"file1%3Aabc.txt"
p URI(URI.encode_www_form_component url)
--output:--
#<URI::Generic:0x000001008abf28 URL:file1%3Aabc.txt>
p URI(URI.encode url, ":")
--output:--
#<URI::Generic:0x000001008abcd0 URL:file1%3Aabc.txt>
Why doesn't encode work on colon?
Because encode/escape is broken.
Use Addressable::URI::encode
require "addressable/uri"
a = 'some file name.txt'
Addressable::URI.encode(Addressable::URI.encode(a))
# => "some%2520file%2520name.txt"
b = 'some filename with :colon in it.txt'
Addressable::URI.encode(Addressable::URI.encode(b))
# => "some%2520filename%2520with%2520:colon%2520in%2520it.txt"
The problem seems to be the empty space preceding the colon, 'lol :lol.txt' don't work, but 'lol:lol.txt' works.
Maybe you could replace the spaces for something else.
If you want to escape special character from the given string. It is best to use
esc_uri=URI.escape("String with special character")
The result string is URI escaped string and safe to pass it to URI.
Refer URI::Escape for how to use URI escape. Hope this helps.

Regex in Ruby: expression not found

I'm having trouble with a regex in Ruby (on Rails). I'm relatively new to this.
The test string is:
http://www.xyz.com/017010830343?$ProdLarge$
I am trying to remove "$ProdLarge$". In other words, the $ signs and anything between.
My regular expression is:
\$\w+\$
Rubular says my expression is ok. http://rubular.com/r/NDDQxKVraK
But when I run my code, the app says it isn't finding a match. Code below:
some_array.each do |x|
logger.debug "scan #{x.scan('\$\w+\$')}"
logger.debug "String? #{x.instance_of?(String)}"
x.gsub!('\$\w+\$','scl=1')
...
My logger debug line shows a result of "[]". String is confirmed as being true. And the gsub line has no effect.
What do I need to correct?
Use /regex/ instead of 'regex':
> "http://www.xyz.com/017010830343?$ProdLarge$".gsub(/\$\w+\$/, 'scl=1')
=> "http://www.xyz.com/017010830343?scl=1"
Don't use a regex for this task, use a tool designed for it, URI. To remove the query:
require 'uri'
url = URI.parse('http://www.xyz.com/017010830343?$ProdLarge$')
url.query = nil
puts url.to_s
=> http://www.xyz.com/017010830343
To change to a different query use this instead of url.query = nil:
url.query = 'scl=1'
puts url.to_s
=> http://www.xyz.com/017010830343?scl=1
URI will automatically encode values if necessary, saving you the trouble. If you need even more URL management power, look at Addressable::URI.

Ruby/Rails 3.1: Given a URL string, remove path

Given any valid HTTP/HTTPS string, I would like to parse/transform it such that the end result is exactly the root of the string.
So given URLs:
http://foo.example.com:8080/whatsit/foo.bar?x=y
https://example.net/
I would like the results:
http://foo.example.com:8080/
https://example.net/
I found the documentation for URI::Parser not super approachable.
My initial, naïve solution would be a simple regex like:
/\A(https?:\/\/[^\/]+\/)/
(That is: Match up to the first slash after the protocol.)
Thoughts & solutions welcome. And apologies if this is a duplicate, but my search results weren't relevant.
With URI::join:
require 'uri'
url = "http://foo.example.com:8080/whatsit/foo.bar?x=y"
baseurl = URI.join(url, "/").to_s
#=> "http://foo.example.com:8080/"
Use URI.parse and then set the path to an empty string and the query to nil:
require 'uri'
uri = URI.parse('http://foo.example.com:8080/whatsit/foo.bar?x=y')
uri.path = ''
uri.query = nil
cleaned = uri.to_s # http://foo.example.com:8080
Now you have your cleaned up version in cleaned. Taking out what you don't want is sometimes easier than only grabbing what you need.
If you only do uri.query = '' you'll end up with http://foo.example.com:8080? which probably isn't what you want.
You could use uri.split() and then put the parts back together...
WARNING: It's a little sloppy.
url = "http://example.com:9001/over-nine-thousand"
parts = uri.split(url)
puts "%s://%s:%s" % [parts[0], parts[2], parts[3]]
=> "http://example.com:9001"

Why does this regex check return true for this string?

I need a regex that will determine if a string is a tweet URL. I've got this
Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i)
Why does it return true for the following?
"http://i.stack.imgur.com/QdOS0.jpg".match(Regexp.new(/http:|https:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i))? true : false
=> true
http: will always match a URL starting with http:
Try the following:
/https?:\/\/(twitter\.com\/.*\/status\/.*|twitter\.com\/.*\/statuses\/.*|www\.twitter\.com\/.*\/status\/.*|www\.twitter\.com\/.*\/statuses\/.*|mobile\.twitter\.com\/.*\/status\/.*|mobile\.twitter\.com\/.*\/statuses\/.*)/i
The question mark will make the s optional, thus matching http or https.
Your regex could be abbreviated like :
#^https?://(:?www\.|mobile\.)?twitter\.com/.*?/status(:?es)?/.*#i
explanation:
# regex delimiter
^ start of line
https? http or https
:// ://
(:? start of non capture group
www\.|mobile\. www. or mobile.
)? end of group
twitter\.com/ twitter.com
.*? any number of any char not greedy
/status /status
(:?es)? non capture group that contains possibly `es`
/.* / followed by any number of any char
$ end of string
#i delimiter and case insensitive
No need for regular expressions here (as usual).
require 'uri'
uri = URI.parse("http://www.twitter.com/status/12345")
p uri.host.split('.')[-2] == 'twitter' # returns true
More docs at: http://ruby-doc.org/stdlib/
You should group your OR-Clauses, like this:
(http:|https:)
Additionally, it wouldn't hurt to specify beginning and end of it:
^(http:|https:).*$
The start of your regex specifies an option of just 'http:', which naturally matches the URL you are testing. Depending on how strict you need your check to be, you could just remove the http/https parts from the start of the regex.
While many other answers show you a better regex, the answer is because /foo|bar/ will match either foo or bar, and what you wrote was /http:|.../, hence all URLs will be matched.
See #giraff's answer for how you could have written the alternation to do what you expect, or #M42's or #Koraktor's answers for a better regexp.
And as posted in the comments, note that you can write a regex literal as %r{...} instead of /.../, which is nice when you want to use / characters in your regex without escaping them.

Resources