Regex URL validation in Ruby - ruby-on-rails

I have the following ruby code in a model:
if privacy_policy_link.match(/#{domain}($|\/$)/)
errors.add(:privacy_policy_link, 'Link to a specific privacy policy page on your site instead of your homepage.')
end
This worked until a user tried to save a privacy policy link that looked like this:
https://example.com/about-me/privacy-policy-for-example-com/
The goal is that I don't want them linking to their base homepage (example.com or www.example.com etc) for this privacy policy link (for some random, complicated reasons I won't go into). The link provided above should pass the matcher (meaning that because they are linking to a separate page on their site, it shouldn't be considered a match and they shouldn't see any errors when saving the form) - but because they reference their base domain in second half of the url, it comes up as a match.
I cannot, for the life of me, figure out how the correct regex on rubular to get this url to pass the matching algorithm. And of course I cannot just ask the user to rename their privacy policy link to remove the "com" from it at the end - because this: https://example.com/about-me/privacy-policy-for-example would pass. :)
I would be incredibly grateful for any assistance that could help me understand how to solve this problem!
Rubular link: http://rubular.com/r/G5OmYfzi6t

Your issue is the . character is any character so it matched the - in example-com.
If you chain it to the beginning of the line it will match correctly without trying to escape the . in the domain.
if privacy_policy_link.match(%r{^(http[s]?://|)(www.)?#{domain}($|/$)})

Related

Regex to normalize topic links in Discourse forum

I am using Discourse forum software. As in its current state, Discourse presents links to topic in two ways, with and without a post number at the end.
Example:
forum.domain.com/t/some-topic/23
forum.domain.com/t/some-topic/23/5
The first one is what I want and the second one I want to not be displayed in the forum at all.
I've written a post about it on Discourse forum but didn't receive an answer what Regex to put in the permalink normalization input field in the admin section.
I was told that there is an option to do it using permalink normalization like so (It's an example shown in the admin under the Regex input text, I didn't write it):
permalink normalizations
Apply the following regex before matching permalinks,
for example: /(topic.)\?./\1 will strip query strings from topic routes.
Format is regex+string use \1 etc. to access captures
I don't know what Regex I should use in order to remove the numerical value of the post number from links. I need it only for topic links.
This is the routes.rb routing library and this is the permalink.rb library (I think that the permalink library should help get a better clue how to achieve this). I have no idea how to approach this, because it seems that I need some knowledge of the Discourse routing to make it work. For example, I don't understand why (topic.) is part of the regex, what does it mean, so their example doesn't help me to find a solution.
In the admin I have an input field in which I nee to put the normalization regex code.
I need help with the Regex. I need the regex to work with all topics.
Things I've tried that didn't work out:
/(\/\d+)\/\d+$/\1
/(t/[^/]+/\d+).*/\1
/(\/\d+)\/[0-9]+$/\1
/(\/\d+)\/[0-9]+/\1
/(\/\d+)\/\d+$/\1/
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1
Note: The Permalink Normalization input field treats the character | as a separator to separate between several Regex expressions.
I think this may be the expression you are looking for to put inside de settings field:
/(t\/.*\/\d+)(\/\d+)/\1
You can see it working on Rubular.
However, the code that generates the url is not using the normalization code, so the expression is being ignored.
You could try normalizing the permalink there:
def last_post_url
url = "#{Discourse.base_uri}/t/#{slug}/#{id}/#{posts_count}"
url = Permalink.normalize_url url
url
end
I didn't truly understand your question, but if I got it right, you are saying that you want links with /some-number at the end but don't what links with /some-number/some-number at the end. If that is the case, the regex is:
forum\.domain\.com\/t\/[^0-9\/]+\/\d{1,9}$
You can replace 'forum' with your forum name and 'domain' with your domain name.
This will remove trailing "/<digits>" after another "/<digits>":
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1

Can friendly-id gem work with capital letters in url e.g. /users/joe-blogs and /users/Joe-Blogs both work

I like the friendly id gem but one problem i'm seeing is when I type in a url with a capitol letter in it such as /users/Joe-Blogs it cant find the page. Its a little trivial but most sites can handle something like this and will generate the page whether it has a capitol letter or not. Does anyone know a fix for this?
Edit: to clarify this is for when users enter a url manually and put capitals in it just because its a name like author/Joe-Blogs. I've seen other sites handle this but rails seems to just give a 404.
friendly_id uses parameterize to create the slugs.
I think the best way to solve your problem is to parameterize the params before using it to find.
# controller
User.find(params[:id].parameterize)
Or parameterize the url where the link originated from.
As an addition to Vic's answer, you'll want to look at url normalization:
The following normalizations are described in RFC 3986 to result in equivalent URLs:
Converting the scheme and host to lower case.
The scheme and host components of the URL are case-insensitive. Most normalizers will convert them to lowercase.
Example: HTTP://www.Example.com/ → http://www.example.com/
In short - it's against convention to use capitalization in your urls.
You may also wish to look at URI normalize; more importantly, you should work to remove the capitalization from your URLs:
URI.parse(params[:id]).normalize

W3C validator says 'feed does not validate' 'url must be a full URL'... whats wrong with it?

Validating my feed, it has an enclosure with a URL of
https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3
I know it is a bit convoluted... but what is wrong with it? The stop in the directory name? the double dot in the file name? the comma? all of em?
I have looked at the RFC on URL's but cant make it out(!).
This feed does not validate.
line 441, column 2: url must be a full URL: https://archive.org/download/NigelFarageAPersonalMessageToNorthernIrelandVoters./Nigel%20Farage,%20a%20personal%20message%20to%20Northern%20Ireland%20voters..mp3 (4 occurrences) [help]
<enclosure type="audio/mpeg" url="https://archive.org/download/NigelFarage ...
^
** edit **
A useful (even if incorrect) answer was added (and removed...) showing the result from the w3c URL validator - https://validator.w3.org/checklink
This Link Checker looks for issues in links, anchors and referenced objects in a Web page, CSS style sheet, or recursively on a whole Web site. For best results, it is recommended to first ensure that the documents checked use Valid (X)HTML Markup and CSS. The Link Checker is part of the W3C's validators and Quality Web tools.
If you find this question, you may find the link checker a useful resource!
The problem seems to be that it’s a HTTPS URL instead of a HTTP URL.
The linked error documentation, foo attribute of bar must be a full URL, says:
If this is a link to a web page, you must include the "http://" at the beginning and immediately follow it with a valid domain name.
The RSS 2.0 spec says about <enclosure>:
The url must be an http url.
If you change https://archive.org/download/… to http://archive.org/download/…, it validates.
And if you don't have httpS then your SSL says your page isn't secure. #feedvalidator step up. There are a ton of feedback/complaints about this on the support forum here https://groups.google.com/forum/#!forum/feedvalidator-users
More specifically here: https://github.com/rubys/feedvalidator/issues/16

How do SO URLs self correct themselves if they are mistyped?

If an extra character (like a period, comma or a bracket or even alphabets) gets accidentally added to URL on the stackoverflow.com domain, a 404 error page is not thrown. Instead, URLs self correct themselves & the user is led to the relevant webpage.
For instance, the extra 4 letters I added to the end of a valid SO URL to demonstrate this would be automatically removed when you access the below URL -
https://stackoverflow.com/questions/194812/list-of-freely-available-programming-booksasdf
I guess this has something to do with ASP.NET MVC Routing. How is this feature implemented?
Well, this is quite simple to explain I guess, even without knowing the code behind it:
The text is just candy for search engines and people reading the URL:
This URL will work as well, with the complete text removed!
The only part really important is the question ID that's also embedded in the "path".
This is because EVERYTHING after http://stackoverflow.com/questions/194812 is ignored. It is just there to make the link, if posted somewhere, if more speaking.
Internally the URL is mapped to a handler, e.g., by a rewrite, that transforms into something like: http://stackoverflow.com/questions.php?id=194812 (just an example, don't know the correct internal URL)
This also makes the URL search engine friendly, besides being more readable to humans.

dynamic seo title for news articles

I have a news section where the pages resolve to urls like
newsArticle.php?id=210
What I would like to do is use the title from the database to create seo friendly titles like
newsArticle/joe-goes-to-town
Any ideas how I can achieve this?
Thanks,
R.
I suggest you actually include the ID in the URL, before the title part, and ignore the title itself when routing. So your URL might become
/news/210/joe-goes-to-town
That's exactly what Stack Overflow does, and it works well. It means that the title can change without links breaking.
Obviously the exact details will depend on what platform you're using - you haven't specified - but the basic steps will be:
When generating a link, take the article title and convert it into something URL-friendly; you probably want to remove all punctuation, and you should consider accented characters etc. Bear in mind that the title won't need to be unique, because you've got the ID as well
When handling a request to anything starting with /news, take the next part of the path, parse it as an integer and load the appropriate article.
Assuming you are using PHP and can alter your source code (this is quite mandatory to get the article's title), I'd do the following:
First, you'll need to have a function (or maybe a method in an object-oriented architecture) to generate the URLs for you in your code. You'd supply the function with the article object or the article ID and it returns the friendly URL with the ID and the friendly title.
Basically function url(Article $article) => URL.
You will also need some URL rewriting rules to remove the PHP script from the URL. For Apache, refer to the mod_rewrite documentation for details (RewriteEngine, RewriteRule, RewriteCond).

Resources