I have UTF-8 encoded post titles which I'd rather show using the appropriate characters in slugs. An example is Amazon Japan's URL here.
How can any arbitrary string be converted to a safe URL slug such as this, with Ruby (or Rails)?
(There are some related PHP posts, but nothing I could find for Ruby.)
From reading here it seems like a solution is this:
require 'open-uri'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts URI::encode(str)
Here is the documentation for open-uri. and here is some info on utf-8 encoded url schema.
EDIT: having looked into this more I noticed encode is just an alias for URI.escape which is documented here. example taken from the docs below:
require 'uri'
enc_uri = URI.escape("http://example.com/?a=\11\15")
p enc_uri
# => "http://example.com/?a=%09%0D"
p URI.unescape(enc_uri)
# => "http://example.com/?a=\t\r"
p URI.escape("#?#!", "!?")
# => "#%3F#%21"
Let me know if this is what you were looking for?
EDIT #2: I was interested and kept looking a little more, according to the comments ryan bates' railscasts on friendlyid also seems to work with chinese characters.
Related
For my Blog App I use FriendlyId to generate slugs.
In irb post creation process following message appears:
ArgumentError (Can not transliterate strings with CP850 encoding)
I found out that the error appears because of a space between words in the title only, so it probably has something to do with friendly_id.
I develop on Windows 10
Encoding.default_internal = #<Encoding:UTF-8>
Encoding.default_external = #<Encoding:UTF-8>
Encoding.locale_charmap = "CP850"
I would like to use FriendlyId but also be able to use spaces in post title. Any ideas?
I am using ActiveStorage on a large Rails app and the client seems to be unable to sanitize their own filenames from non-url-safe characters. I cannot find a way to do this automatically within the framework itself. I would rather not just output a sanitized filename when the file URL is output on the frontend, as that might be confusing to other developers debugging issues later on.
ActiveStorage::Blob.filename returns a Filename instance which has a sanitized method on it, however I cannot seem to see where it is actually used?
Blob::filename: https://github.com/rails/rails/blob/master/activestorage/app/models/active_storage/blob.rb#L139
# Returns an ActiveStorage::Filename instance of the filename that can be
# queried for basename, extension, and a sanitized version of the filename
# that's safe to use in URLs.
def filename
ActiveStorage::Filename.new(self[:filename])
end
Filename: https://github.com/rails/rails/blob/master/activestorage/app/models/active_storage/filename.rb#L57
# Returns the sanitized filename.
#
# ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
# ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"
#
# Characters considered unsafe for storage (e.g. \, $, and the RTL override character) are replaced with a dash.
def sanitized
#filename.encode(Encoding::UTF_8, invalid: :replace, undef: :replace, replace: "�").strip.tr("\u{202E}%$|:;/\t\r\n\\", "-")
end
The Rails docs describe how to attach files, however this is for individual files and wouldn't suitable if the client is uploading multiple files at once:
https://edgeguides.rubyonrails.org/active_storage_overview.html#attaching-file-io-objects
Another SO Question asked something similar, but requires the record to be updated, rather than sanitizing on create:
https://stackoverflow.com/a/52957480/3588645
Another medium article suggests adding an after save on a model, but the app uses multiple attachment names per model and multiple models with attachments, so again, this would not be a suitable solution:
https://medium.com/fiatinsight/how-to-change-a-filename-in-rails-active-storage-f3e4f26f427e
I feel like this should be pretty simple, however its proving to be quite the problem.
Any help would be appreciated.
I'm trying to figure out how to parse a URL in Rails, and return everything except the filename, or, everything except that which follows the last backslash.
For example, I'd like:
http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg
to become:
http://bucket.s3.amazonaws.com/directoryname/1234/
I've found every way to parse a URI, but this. Any help would be appreciated.
Ruby has methods available to get you there easily:
File.dirname(URL) # => "http://bucket.s3.amazonaws.com/directoryname/1234"
Think about what a URL/URI is: It's a designator for a protocol, a site, and a directory-path to a resource. The directory-path to a resource is the same as a "path/to/file", so File.dirname works nicely, without having to reinvent that particular wheel.
The trailing / isn't included because it's a delimiter between the path segments. You generally don't need that, because joining a resource to a path will automatically supply it.
Really though, using Ruby's URI class is the proper way to mangle URIs:
require 'uri'
URL = 'http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg'
uri = URI.parse(URL)
uri.merge('foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/1234/foo.html"
URI allows you to mess with the path easily too:
uri.merge('../foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/foo.html"
uri.merge('../bar/foo.html').to_s
# => "http://bucket.s3.amazonaws.com/directoryname/bar/foo.html"
URI is well-tested, and designed for this purpose. It will also allow you to add query parameters easily, encoding them as necessary.
File name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[2]
=> "thumbnail.jpg"
URL without the file name
"http://bucket.s3.amazonaws.com/directoryname/1234/thumbnail.jpg".match(/(.*\/)+(.*$)/)[1]
=> "http://bucket.s3.amazonaws.com/directoryname/1234/"
String#match
'http://a.b.pl/a/b/c/d.jpg'.rpartition('/').first
=> "http://a.b.pl/a/b/c"
I'm a beginner in ruby. I want a ruby script to fetch every single link associated with that domain without using gems.
(e.x)
if i enter url as http://hsps.in
My Expected output is:
hsps.in/contacts
hsps.in/projects
hsps.in/blog ..etc
can anyone tell me how can i achieve this?
open-uri is part of the standard library, you'll need to install the nokogiri gem, it'll make things a lot easier
require 'open-uri'
require 'nokogiri'
url = 'http://hsps.in'
doc = Nokogiri::HTML(open(url))
links = doc.css('a')
links.each { |link| puts link['href'] }
RegExp is your friend :)
Maybe this gist would help you i created a while ago.
In Line 570 i use a Regexp to scan links:
toScan[:links] = toScan[:response].body.scan(/https?:\/\/[^:\s"'<>#\(\)\[\]\{\},;]+/mi)
and in Line 572 i use this Regexp to scan for intern links:
interneLinks = toScan[:response].body.scan(/href\s*=\s*['"]\/?[^\s:'"<>#\(\)\[\]\{\},;]+/im )
I also dont want to use gems and do it on my own. So i used a RegExp. With Regexpressions you can deal with Textpatterns. Its like a small language you can use to idetify text in a string (in your case urls). :) Maybe there is a better regexp for links (google could find them), but i want to deal with it on my own.
Hoptefully i could help you with that case.
In your controller action
arr = []
routes = %x[rake routes]
routes.split(' ').map{|rt| arr << rt if rt.count('/') > 0 && rt.count('#') == 0}
puts arr.uniq
require 'open-uri'
class PageLinks
attr_reader :page
include OpenURI
def initialize(url)
#page = open(url).readlines
end
def links
#page.grep(/href/)
end
end
url = 'http://www.hsps.in'
doc = PageLinks.new url
puts doc.links.inspect
As you said 'without using any gems' I will take it that includes Rails even though it is tagged as such.
This is not a 'clean' answer as it doesn't extract the values of the a tags href values. But it should demonstrate that it indeed can be done with no gems, only that which comes with Ruby.
i'm using the following to verify if a URL is formatted validly:
validates_format_of :website, :with => URI::regexp(%w(http https))
however, it doesn't work when the url doesn't start with http:// or https://. Is there some similar way to validate URLs with URI::regexp (or URI) and make it include valid URLs that don't start with http://? (For example, www.google.com is valid, as is http://www.google.com)
This post on Daring Fireball provides a robust regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
A more recent post improves on it (N.B. newlines and indentation added here for clarity; see the post for an even more expanded version with explanations):
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|
www\d{0,3}[.]|
[a-z0-9.\-]+[.][a-z]{2,4}/)
(?:[^\s()<>]+|
\(([^\s()<>]+|
(\([^\s()<>]+\)))*\))+
(?:\(([^\s()<>]+|
(\([^\s()<>]+\)))*\)|
[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
From my tests URL::regexp is to loose in its definition of a URI (though it does require http…).
You can use a virtual attribute or before_save filter to append a http:// to your URLs if necessary.
This is ruby interpretation (escaped forward slashes)
(?i)\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!#)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b\/?(?!#)))
Original gist by Gruber
Please copy it to Rubular for testing, i could not make permanent link, as reqexp is probably to long.
Works with http and without, and works with short domains like 'google.com'