Remove URL prefix in Ruby - ruby-on-rails

I have an S3 URL like so:
https://bucket.s3.amazonaws.com/uploads/1c4248b2-4256-4af4-ac1b-0e1e3f7ec2c8/filename.jpg
What I'd like to do, is, using Ruby, remove the prefix https://bucket.s3.amazonaws.com/ leaving only uploads/1c4248b2-4256-4af4-ac1b-0e1e3f7ec2c8/filename.jpg.
I'm unsure whether using gsub and just replacing the prefix (hardcoded) with empty space is the right way to go – or, if there's a more efficient approach.
url.gsub('https://bucket.s3.amazonaws.com/', '')

You can use URI from ruby's standard library:
irb> require 'uri'
=> true
irb> u = URI("https://bucket.s3.amazonaws.com/uploads/1c4248b2-4256-4af4-ac1b-0e1e3f7ec2c8/filename.jpg")
=> #<URI::HTTPS:0x000000020995f0 URL:https://bucket.s3.amazonaws.com/uploads/1c4248b2-4256-4af4-ac1b-0e1e3f7ec2c8/filename.jpg>
irb> u.path
=> "/uploads/1c4248b2-4256-4af4-ac1b-0e1e3f7ec2c8/filename.jpg"
Alternatively u.request_uri returns any parameters on the URI too.

Related

How to get the hostname from a url with accented letters inside in Ruby

I have the following url inside a field of model:
https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new
Inside the URL there is an accented letter (à). If I use URI.parse to get hostname gives me the following error:
URI::InvalidURIError: URI must be ascii only "https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potr\u00E0_non_rispettare_il/?sort=new"
The method URL.encode resolves the problem, but I discover that the URL.encode is obsolete and should not be used.
Which method should I use for replacing URI.encode?
this is encoding issue and you need to do it as below
first lets encode your URI first
encoded_url = URI.encode('https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new')
And then parse it
URI.parse(encoded_url)
good luck
The only solution that I find uses the gem Addressable(https://github.com/sporkmonger/addressable):
Addressable::URI.parse('https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potrà_non_rispettare_il/?sort=new').host
Perhaps this could be an inelegant solution:
URI.parse(URI.extract(target.url).first)
# => #<URI::HTTPS https://www.reddit.com/r/italy/comments/i6ix3x/trenitalia_sostiene_che_potr>
Then I use the method host
URI.parse(URI.extract(target.url).first).host
# => "www.reddit.com"

Sanitizing Unicode strings for URL slugs (Ruby/Rails)

I have UTF-8 encoded post titles which I'd rather show using the appropriate characters in slugs. An example is Amazon Japan's URL here.
How can any arbitrary string be converted to a safe URL slug such as this, with Ruby (or Rails)?
(There are some related PHP posts, but nothing I could find for Ruby.)
From reading here it seems like a solution is this:
require 'open-uri'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts URI::encode(str)
Here is the documentation for open-uri. and here is some info on utf-8 encoded url schema.
EDIT: having looked into this more I noticed encode is just an alias for URI.escape which is documented here. example taken from the docs below:
require 'uri'
enc_uri = URI.escape("http://example.com/?a=\11\15")
p enc_uri
# => "http://example.com/?a=%09%0D"
p URI.unescape(enc_uri)
# => "http://example.com/?a=\t\r"
p URI.escape("#?#!", "!?")
# => "#%3F#%21"
Let me know if this is what you were looking for?
EDIT #2: I was interested and kept looking a little more, according to the comments ryan bates' railscasts on friendlyid also seems to work with chinese characters.

Get URL without filename?

I'm trying to figure out how to parse a URL in Rails, and return everything except the filename, or, everything except that which follows the last backslash.
For example, I'd like:
http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg
to become:
http://bucket.s3.amazonaws.com/directoryname/1234/
I've found every way to parse a URI, but this. Any help would be appreciated.
Ruby has methods available to get you there easily:
File.dirname(URL) # => "http://bucket.s3.amazonaws.com/directoryname/1234"
Think about what a URL/URI is: It's a designator for a protocol, a site, and a directory-path to a resource. The directory-path to a resource is the same as a "path/to/file", so File.dirname works nicely, without having to reinvent that particular wheel.
The trailing / isn't included because it's a delimiter between the path segments. You generally don't need that, because joining a resource to a path will automatically supply it.
Really though, using Ruby's URI class is the proper way to mangle URIs:
require 'uri'
URL = 'http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg'
uri = URI.parse(URL)
uri.merge('foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/1234/foo.html"
URI allows you to mess with the path easily too:
uri.merge('../foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/foo.html"
uri.merge('../bar/foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/bar/foo.html"
URI is well-tested, and designed for this purpose. It will also allow you to add query parameters easily, encoding them as necessary.
File name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[2]
=> "thumbnail.jpg"
URL without the file name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[1]
=> "http://bucket.s3.amazonaws.com/directoryname/1234/"
String#match
'http://a.b.pl/a/b/c/d.jpg'.rpartition('/').first
=> "http://a.b.pl/a/b/c"

Getting all the links of a web page in ruby without using inbuilt library

I'm a beginner in ruby. I want a ruby script to fetch every single link associated with that domain without using gems.
(e.x)
if i enter url as http://hsps.in
My Expected output is:
hsps.in/contacts
hsps.in/projects
hsps.in/blog ..etc
can anyone tell me how can i achieve this?
open-uri is part of the standard library, you'll need to install the nokogiri gem, it'll make things a lot easier
require 'open-uri'
require 'nokogiri'
url = 'http://hsps.in'
doc = Nokogiri::HTML(open(url))
links = doc.css('a')
links.each { |link| puts link['href'] }
RegExp is your friend :)
Maybe this gist would help you i created a while ago.
In Line 570 i use a Regexp to scan links:
toScan[:links] = toScan[:response].body.scan(/https?:\/\/[^:\s"'<>#\(\)\[\]\{\},;]+/mi)
and in Line 572 i use this Regexp to scan for intern links:
interneLinks = toScan[:response].body.scan(/href\s*=\s*['"]\/?[^\s:'"<>#\(\)\[\]\{\},;]+/im )
I also dont want to use gems and do it on my own. So i used a RegExp. With Regexpressions you can deal with Textpatterns. Its like a small language you can use to idetify text in a string (in your case urls). :) Maybe there is a better regexp for links (google could find them), but i want to deal with it on my own.
Hoptefully i could help you with that case.
In your controller action
arr = []
routes = %x[rake routes]
routes.split(' ').map{|rt| arr << rt if rt.count('/') > 0 && rt.count('#') == 0}
puts arr.uniq
require 'open-uri'
class PageLinks
attr_reader :page
include OpenURI
def initialize(url)
#page = open(url).readlines
end
def links
#page.grep(/href/)
end
end
url = 'http://www.hsps.in'
doc = PageLinks.new url
puts doc.links.inspect
As you said 'without using any gems' I will take it that includes Rails even though it is tagged as such.
This is not a 'clean' answer as it doesn't extract the values of the a tags href values. But it should demonstrate that it indeed can be done with no gems, only that which comes with Ruby.

How to retrieve the 'scheme://domain' part of a generic URL?

I am using Ruby on Rails 3.0.10 and I would like to retrieve the scheme://domain part of a generic URL (note: a URL syntax can be also scheme://domain:port/path?query_string#fragment_id).
That is, for example, if I have the following URLs
http://stackoverflow.com/questions/7304043/how-to-retrieve-the-scheme-domainport-part-of-a-generic-url
ftp://some_link.org/some_other_string
# Consider also 'https', 'ftps' and so on... 'scheme:' values.
I would like to retrieve just the
# Note: The last '/' character was removed.
http://stackoverflow.com
ftp://some_link.org
# Consider also 'https', 'ftps' and so on... 'scheme:' values.
parts. How can I do that?
require 'uri'
uri = URI.parse("http://stackoverflow.com/questions/7304043/how-to-retrieve-the-scheme-domainport-part-of-a-generic-url")
url = "#{uri.scheme}://#{uri.host}" #url would be set to http://stackoverflow.com
From Module: URI.

Resources