Sanitizing Unicode strings for URL slugs (Ruby/Rails) - ruby-on-rails

I have UTF-8 encoded post titles which I'd rather show using the appropriate characters in slugs. An example is Amazon Japan's URL here.
How can any arbitrary string be converted to a safe URL slug such as this, with Ruby (or Rails)?
(There are some related PHP posts, but nothing I could find for Ruby.)

From reading here it seems like a solution is this:
require 'open-uri'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts URI::encode(str)
Here is the documentation for open-uri. and here is some info on utf-8 encoded url schema.
EDIT: having looked into this more I noticed encode is just an alias for URI.escape which is documented here. example taken from the docs below:
require 'uri'
enc_uri = URI.escape("http://example.com/?a=\11\15")
p enc_uri
# => "http://example.com/?a=%09%0D"
p URI.unescape(enc_uri)
# => "http://example.com/?a=\t\r"
p URI.escape("#?#!", "!?")
# => "#%3F#%21"
Let me know if this is what you were looking for?
EDIT #2: I was interested and kept looking a little more, according to the comments ryan bates' railscasts on friendlyid also seems to work with chinese characters.

Related

Can not transliterate strings with CP850 encoding

For my Blog App I use FriendlyId to generate slugs.
In irb post creation process following message appears:
ArgumentError (Can not transliterate strings with CP850 encoding)
I found out that the error appears because of a space between words in the title only, so it probably has something to do with friendly_id.
I develop on Windows 10
Encoding.default_internal = #<Encoding:UTF-8>
Encoding.default_external = #<Encoding:UTF-8>
Encoding.locale_charmap = "CP850"
I would like to use FriendlyId but also be able to use spaces in post title. Any ideas?

How to automatically sanitize ActiveStorage Blob filenames on or before upload?

I am using ActiveStorage on a large Rails app and the client seems to be unable to sanitize their own filenames from non-url-safe characters. I cannot find a way to do this automatically within the framework itself. I would rather not just output a sanitized filename when the file URL is output on the frontend, as that might be confusing to other developers debugging issues later on.
ActiveStorage::Blob.filename returns a Filename instance which has a sanitized method on it, however I cannot seem to see where it is actually used?
Blob::filename: https://github.com/rails/rails/blob/master/activestorage/app/models/active_storage/blob.rb#L139
# Returns an ActiveStorage::Filename instance of the filename that can be
# queried for basename, extension, and a sanitized version of the filename
# that's safe to use in URLs.
def filename
ActiveStorage::Filename.new(self[:filename])
end
Filename: https://github.com/rails/rails/blob/master/activestorage/app/models/active_storage/filename.rb#L57
# Returns the sanitized filename.
#
# ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
# ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"
#
# Characters considered unsafe for storage (e.g. \, $, and the RTL override character) are replaced with a dash.
def sanitized
#filename.encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "�").strip.tr("\u{202E}%$|:;/\t\r\n\\", "-")
end
The Rails docs describe how to attach files, however this is for individual files and wouldn't suitable if the client is uploading multiple files at once:
https://edgeguides.rubyonrails.org/active_storage_overview.html#attaching-file-io-objects
Another SO Question asked something similar, but requires the record to be updated, rather than sanitizing on create:
https://stackoverflow.com/a/52957480/3588645
Another medium article suggests adding an after save on a model, but the app uses multiple attachment names per model and multiple models with attachments, so again, this would not be a suitable solution:
https://medium.com/fiatinsight/how-to-change-a-filename-in-rails-active-storage-f3e4f26f427e
I feel like this should be pretty simple, however its proving to be quite the problem.
Any help would be appreciated.

Get URL without filename?

I'm trying to figure out how to parse a URL in Rails, and return everything except the filename, or, everything except that which follows the last backslash.
For example, I'd like:
http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg
to become:
http://bucket.s3.amazonaws.com/directoryname/1234/
I've found every way to parse a URI, but this. Any help would be appreciated.
Ruby has methods available to get you there easily:
File.dirname(URL) # => "http://bucket.s3.amazonaws.com/directoryname/1234"
Think about what a URL/URI is: It's a designator for a protocol, a site, and a directory-path to a resource. The directory-path to a resource is the same as a "path/to/file", so File.dirname works nicely, without having to reinvent that particular wheel.
The trailing / isn't included because it's a delimiter between the path segments. You generally don't need that, because joining a resource to a path will automatically supply it.
Really though, using Ruby's URI class is the proper way to mangle URIs:
require 'uri'
URL = 'http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg'
uri = URI.parse(URL)
uri.merge('foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/1234/foo.html"
URI allows you to mess with the path easily too:
uri.merge('../foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/foo.html"
uri.merge('../bar/foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/bar/foo.html"
URI is well-tested, and designed for this purpose. It will also allow you to add query parameters easily, encoding them as necessary.
File name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[2]
=> "thumbnail.jpg"
URL without the file name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[1]
=> "http://bucket.s3.amazonaws.com/directoryname/1234/"
String#match
'http://a.b.pl/a/b/c/d.jpg'.rpartition('/').first
=> "http://a.b.pl/a/b/c"

Getting all the links of a web page in ruby without using inbuilt library

I'm a beginner in ruby. I want a ruby script to fetch every single link associated with that domain without using gems.
(e.x)
if i enter url as http://hsps.in
My Expected output is:
hsps.in/contacts
hsps.in/projects
hsps.in/blog ..etc
can anyone tell me how can i achieve this?
open-uri is part of the standard library, you'll need to install the nokogiri gem, it'll make things a lot easier
require 'open-uri'
require 'nokogiri'
url = 'http://hsps.in'
doc = Nokogiri::HTML(open(url))
links = doc.css('a')
links.each { |link| puts link['href'] }
RegExp is your friend :)
Maybe this gist would help you i created a while ago.
In Line 570 i use a Regexp to scan links:
toScan[:links] = toScan[:response].body.scan(/https?:\/\/[^:\s"'<>#\(\)\[\]\{\},;]+/mi)
and in Line 572 i use this Regexp to scan for intern links:
interneLinks = toScan[:response].body.scan(/href\s*=\s*['"]\/?[^\s:'"<>#\(\)\[\]\{\},;]+/im )
I also dont want to use gems and do it on my own. So i used a RegExp. With Regexpressions you can deal with Textpatterns. Its like a small language you can use to idetify text in a string (in your case urls). :) Maybe there is a better regexp for links (google could find them), but i want to deal with it on my own.
Hoptefully i could help you with that case.
In your controller action
arr = []
routes = %x[rake routes]
routes.split(' ').map{|rt| arr << rt if rt.count('/') > 0 && rt.count('#') == 0}
puts arr.uniq
require 'open-uri'
class PageLinks
attr_reader :page
include OpenURI
def initialize(url)
#page = open(url).readlines
end
def links
#page.grep(/href/)
end
end
url = 'http://www.hsps.in'
doc = PageLinks.new url
puts doc.links.inspect
As you said 'without using any gems' I will take it that includes Rails even though it is tagged as such.
This is not a 'clean' answer as it doesn't extract the values of the a tags href values. But it should demonstrate that it indeed can be done with no gems, only that which comes with Ruby.

(rails) validating URL help with regexp

i'm using the following to verify if a URL is formatted validly:
validates_format_of :website, :with => URI::regexp(%w(http https))
however, it doesn't work when the url doesn't start with http:// or https://. Is there some similar way to validate URLs with URI::regexp (or URI) and make it include valid URLs that don't start with http://? (For example, www.google.com is valid, as is http://www.google.com)
This post on Daring Fireball provides a robust regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
A more recent post improves on it (N.B. newlines and indentation added here for clarity; see the post for an even more expanded version with explanations):
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|
www\d{0,3}[.]|
[a-z0-9.\-]+[.][a-z]{2,4}/)
(?:[^\s()<>]+|
\(([^\s()<>]+|
(\([^\s()<>]+\)))*\))+
(?:\(([^\s()<>]+|
(\([^\s()<>]+\)))*\)|
[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
From my tests URL::regexp is to loose in its definition of a URI (though it does require http…).
You can use a virtual attribute or before_save filter to append a http:// to your URLs if necessary.
This is ruby interpretation (escaped forward slashes)
(?i)\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b\/?(?!#)))
Original gist by Gruber
Please copy it to Rubular for testing, i could not make permanent link, as reqexp is probably to long.
Works with http and without, and works with short domains like 'google.com'

Resources