Detect and replace URLs in text - ruby-on-rails

I want to detect and replace URLs in texts input by users. An example worth thousand words:
Here's a link to stackoverflow.com, so is http://stackoverflow.com.
=>
Here's a link to [stackoverflow.com](http://stackoverflow.com), so is [http://stackoverflow.com](http://stackoverflow.com).
All I found from Google is how to detect URLs and change them to <a> tags. Is there a way that I can detect URLs, and replace them with custom code blocks to generate something as the example above? Thanks a lot!

The tricky part of this is finding a regexp which will match all urls. eg this might work, from http://ryanangilly.com/post/8654404046/grubers-improved-regex-for-matching-urls-written
regexp = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/?)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s\`!()\[\]{};:\'\".,<>?«»“”‘’]))/i
Once you've got your regexp, then use gsub with a block, eg
text = "Here's a link to stackoverflow.com, so is http://stackoverflow.com."
=> "Here's a link to stackoverflow.com, so is http://stackoverflow.com."
text.gsub(regexp){|url| "FOO#{url}BAR"}
=> "Here's a link to stackoverflow.com, so is FOOhttp://stackoverflow.comBAR."
Note that this doesn't do anything with the first one in the text (that doesn't have the protocol), because it's not a url. if you were expecting it to pick up the first one too then that's going to be much harder for you.

Related

Regex to normalize topic links in Discourse forum

I am using Discourse forum software. As in its current state, Discourse presents links to topic in two ways, with and without a post number at the end.
Example:
forum.domain.com/t/some-topic/23
forum.domain.com/t/some-topic/23/5
The first one is what I want and the second one I want to not be displayed in the forum at all.
I've written a post about it on Discourse forum but didn't receive an answer what Regex to put in the permalink normalization input field in the admin section.
I was told that there is an option to do it using permalink normalization like so (It's an example shown in the admin under the Regex input text, I didn't write it):
permalink normalizations
Apply the following regex before matching permalinks,
for example: /(topic.)\?./\1 will strip query strings from topic routes.
Format is regex+string use \1 etc. to access captures
I don't know what Regex I should use in order to remove the numerical value of the post number from links. I need it only for topic links.
This is the routes.rb routing library and this is the permalink.rb library (I think that the permalink library should help get a better clue how to achieve this). I have no idea how to approach this, because it seems that I need some knowledge of the Discourse routing to make it work. For example, I don't understand why (topic.) is part of the regex, what does it mean, so their example doesn't help me to find a solution.
In the admin I have an input field in which I nee to put the normalization regex code.
I need help with the Regex. I need the regex to work with all topics.
Things I've tried that didn't work out:
/(\/\d+)\/\d+$/\1
/(t/[^/]+/\d+).*/\1
/(\/\d+)\/[0-9]+$/\1
/(\/\d+)\/[0-9]+/\1
/(\/\d+)\/\d+$/\1/
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1
Note: The Permalink Normalization input field treats the character | as a separator to separate between several Regex expressions.
I think this may be the expression you are looking for to put inside de settings field:
/(t\/.*\/\d+)(\/\d+)/\1
You can see it working on Rubular.
However, the code that generates the url is not using the normalization code, so the expression is being ignored.
You could try normalizing the permalink there:
def last_post_url
url = "#{Discourse.base_uri}/t/#{slug}/#{id}/#{posts_count}"
url = Permalink.normalize_url url
url
end
I didn't truly understand your question, but if I got it right, you are saying that you want links with /some-number at the end but don't what links with /some-number/some-number at the end. If that is the case, the regex is:
forum\.domain\.com\/t\/[^0-9\/]+\/\d{1,9}$
You can replace 'forum' with your forum name and 'domain' with your domain name.
This will remove trailing "/<digits>" after another "/<digits>":
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1

React Native, how to turn a comment into a URL?

I was wondering how I can turn a comment (such as stackoverflow.com) to a post as a URL so that when clicked, it will go straight to the website?
Thanks for your help in advance
You can pre-process your text and look for things that might be a URL. Look at this answer here: Regex to match URL.
You'd want to take your text, split it by white-space, then for each white-space-separated word, check if it's a URL. If it's a URL, then output a proper <a> tag surrounding it. If not, just output the word.

how to force long URL/text to wrap in table cell when generating PDF in rails?

I am using princely(https://github.com/mbleigh/princely) to generate PDF in rails. I have long url in one table cell. It will extend the margin when I generate the PDF. In html,"word-break: break-all;" works well. But this rule "word-break: break-all;" doesn't work in PDF. Any body have any idea to wrap the long text when generating PDF?
Since princely is converting the ERB to PDF i believe we can use the truncate helper function of rails to make the link shorter.
= link_to truncate("The anchor you want to place", :length => 5), 'http://yoururl'
I was confronted with a similar issues when printing. Aside from the technical problem, you need to ask yourself some design questions: If you were generating a web page, a user would click the link. But if you're generating a PDF, I assume the goal is to print it. In which case, someone needs to type in this long URL.
How likely are the users going to be to type in a long url? If it only makes sense in the context of a live webpage, maybe you want to eliminate that content from the PDF.
If they do need to visit the destination, but they don't need to see the exact URL, then it may make more sense to shorten the URL.
If you want to shorten the URL, you can implement a URL redirection service on your Rails app. I like the following bit of code to generate the short URL code because they are all keyboard-friendly (they don't contain confusing characters, and don't require lots of shifting or switching of keyboards):
def generate_short_murl
a = [('a'..'k'),('m'..'z')].map{|i| i.to_a}.flatten
n = [('2'..'9')].map{|i| i.to_a}.flatten
(0...4).map{ a[rand(a.length)] }.join + (0...3).map{ n[rand(n.length)] }.join
end

How to generate complex url like stackoverflow?

I'm using playframework, and I hope to generate complex urls like stackoverflow. For example, I want to generate a question's url:
http://aaa.com/questions/123456/How-to-generator-a-complex-url
Note the last part, it's the title of the question.
But I don't know how to do it.
UPDATED
In the playframework, we can define routes in conf/routes file, and what I do is:
GET /questions/{<\d+>id} Questions.show
In this way, when we call #{Questions.show(id)} in views, it will generate:
http://aaa.com/questions/123456
But how to let the generated has a title part, is difficult.
With playframework it's easy to generate such url. In your routes file you add this :
GET /questions/{id}/{title} YourController.yourMethod
See the doc in playframework site about routing for more info
In your html page :
<a href="#{YourController.yourMethod(id,title.slugify())}">
slugify method from JavaExtensions, clean your title from reserved characters (see doc)
It a server-side url rewriter does. In case of SO it doesn't matter you type {...}/questions/4698625/how-to-generate-complex-url-like-stackoverflow or {...}/questions/4698625 - they both redirects to the same content. So this postfix is used just to increase readability of a url.
To see more details about url rewriting, see this post.
UPD:
to generate such a postfix,
take a title of the content,
shrink multiple whitespaces into single
replace all whitespaces with dash (-)
remove all non-letter symbols from a title
Better to perform this operations with Regular Expressions

Extracting email addresses in an html block in ruby/rails

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)
I've tried regexes and so far this has been successful:
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
problem is, i need to ignore all email addresses with mailto hrefs. for example:
test#mail.com
should only return the second email add.
To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:
moc.liam#tset
problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!
Here were my references btw:
so.com/questions/504860/extract-email-addresses-from-a-block-of-text
so.com/questions/1376149/regexp-for-extracting-a-mailto-address
im also testing using this:
http://rubular.com/
edit
here's my current helper code:
def email_obfuscator(text)
text.gsub(/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
m = "<span class='anti-spam'>#{m.reverse}</span>"
}
end
which results in this:
<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg#tset</span>"><span class="anti-spam">moc.liamg#tset</span></a>
Another option if lookbehind doesn't work:
/\b(mailto:)?([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})\b/i
This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.
Would this work?
/\b(?<!mailto:)[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:
I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...
Why not just store all the matched emails in an array and remove any duplicates? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.
emails = ["email_one#example.com", "email_one#example.com", "email_two#example.com"]
emails.uniq # => ["email_one#example.com", "email_two#example.com"]

Resources