Validating URL domain in Rails - ruby-on-rails

I want to validate a URL, so I searched and found this
Brian Ray said in his post that
"#Tate's answer is good for a full URL, but if you want to validate a domain column, you don't want to allow the extra URL bits his regex allows (e.g. you definitely don't want to allow a URL with a path to a file).
So I removed the protocol, port, file path, and query string parts of the regex, resulting in this:"
I don't understand what he said at all. How can a URL be a path to a file? What is a "domain column"?

A URL consists of several parts. If you have a very eleborate URL, like:
http://www.example.com:1234/path/to/file.html?key1=value1&key2=value2
The parts are:
protocol: http://
host name: www
domain name: example.com
port: 1234
file path: path/to/file.html
query string: key1=value1&key2=value2
The only parts that may not be omitted are the protocol (but many programs allow defaulting to http://) and host name. Each part has its own requirements for what are legal characters in it. And what's worse, not all web servers agree on what those requirements are. So the only thing you can check without making an actual connection and seeing if it fails, is the part which is needed to contact the web server. This is only the protocol, host and domain name, and port. These are all case insensitive (the rest may not be). I'm not sure what are valid characters in a host or domain name, but this is also something where name servers may not agree with the specification.
In short, the only way to check if an URL is valid is to try to make a connection to it. If your program uses some magic to reject URLs (or email addresses), some people are going to hate you and/or their internet provider for it (because even if your check follows the specification, some host or domain names don't).
As to your question how an URL can refer to a local file, there is a special protocol for that: file://. Since the path must start with a / as well, this results in URLs like file:///home/user/file.html, so with three slashes at the start.

Related

My naked url can't be linked with a "www"

Issue:
I have a naked url which works, but "www." + the naked url yields a 404 error. I'd like to point the www version to the naked version.
Context:
I have a serverless website, deployed on google cloud run.
I have a domain name registered through google domains, let's call it foo.bar.
Google cloud run provides two types of resource types for me to provide to google domains: A and AAAA, which are ipv4 and ipv6 addresses respectively.
I have added these A and AAAA records as custom records to google domains. With google domains, the hostname field is left blank, which is equivalent to #. From my understanding, this means that when someone types foo.bar, they get directed to these ipv4/ipv6 addresses. This works as intended.
But if I type www.foo.bar, I get a 404 error. As far as I know, www.foo.bar and foo.bar are not equivalent, and so there is no expectation that these two things should work unless explicitly instructed to. So I have tried two approaches to link these, as follows:
Added the A and AAAA resources, but specified www as the hostname, rather than blank. I would expect this to point www.foo.bar to the ipv4/ipv6 addresses too.
Added a CNAME resource, which specifies www.foo.bar as the hostname, and foo.bar as the data. I would expect this to redirect www.foo.bar to foo.bar, which points towards the ipv4/ipv6 addresses.
I would expect both of these to work equivalently, yet neither of them work. I feel like I am misunderstanding what is going on when managing these records, or that google domains is more opaque about what it is doing than I'd like.
Is my understanding of what should happen incorrect? And how can I properly set up the www extension?
You must create Cloud Run custom domains for each hostname. One for example.com and another for www.example.com. You must also create the DNS resource records, which you mentioned that you did complete.

website's url - domain vs host names

i am very confused about the whole domain vs hostname thing. I tried googling, but the answers are too confusing for me, so i wanted to create a question using a specific example known to most people.
So lets take as an example atlassian's products. The url is atlassian.net.
If i wanted to create my project on atlassian, url would look like myproject.atlassian.net. My customer potentially wants a similar system for his website, so i need to know how to validate names of subprojects.
So, is myproject a subdomain or a host?
What are naming rules, e.g. can i use underscores/dots in that part of url?
How does routing work for such urls? Dns resolves the top url atlassian.net and then server logic serves pages for specific subproject?
Thanks.
In most common case Hostname (e.g. localhost) is usually parameter used for internal program reference and domain name is used for external / internet reference. How domain resolve to hostname can't be summarized easily. You may see many places where both by fault were used as synonyms and so based on context you need to figure out the purpose.
Your example is not correct in context of JIRA.
"mycompany.atlassian.net" is actually your company account it might have multiple projects within it issues that are within the project got URLs like "mycompany.atlassian.net/browse/STAC-20" where STAC is the project key
It terms of how domain/sub-domain name works. There are different level of configurations and a good start for you probably might be this link:
https://uk.godaddy.com/help/what-is-a-subdomain-296
Here is short explanation in case you are referencing "Name server" hosts:
Each domain once configured needs 2-3 name servers that will know how to resolve all its subdomains, aliases usually these are provided by hosting company where we are placing our domain.
So if "mydomain.com" might be configured with "ns1.anyhosting.com", "ns2.anyhosting.com" to serve "mydomain.com" dns requests.
In case you are the one that wants to to host "name server". It is possible by configuring "Hosts" at domain registration that have explicit IPs e.g. "ns1.mydomain.com" , "ns2.mydomain.com" and these are referred as Hosts.
This is the exact word from my textbook:
Demonstrate by giving an example:
What is the hostname of the following URL?
http://www.weather.com/summer/temperatures.html
solution: www.weather.com
Reason:
The hostname is the complete domain name, which is the characters after the scheme and
before the path.
Some other definitions:
Scheme - Characters at the beginning of a URL followed by a colon ":" or a colon and double slashes "://". Common URL schemes include http, https, mailto, and file. Ex: In http://www.cdc.gov/alcohol, the scheme is "http".
Hostname - The complete domain name specified in the URL. Ex: In http://www.cdc.gov/alcohol, the hostname is "www.cdc.gov".
Path - All characters to the right of the hostname in the URL. Ex: In http://www.cdc.gov/alcohol, the path is "/alcohol".

How does the URL I type in lead to the eventual content I see in my browser?

I'm trying to figure out how these all work together, and there are bits and pieces of information all over the internet.
Here's what I (think) I know:
1) When you enter a url into your browser that gets looked up in a domain name server (DNS), and you are sent an IP address.
2) Your computer then follows this IP address to a server somewhere.
3) On the server there are nameservers, which direct you to the specific content you want within the server. -> This step is unclear to me.
4) With this information, your request is received and the server relays site content back to you.
Is this correct? What do I have wrong? I've done a lot of searching over the past week, and the thing I think I'm missing is the big picture explanation of how all these details tie together.
Smaller questions:
a) How does the nameserver know which site I want directions to?
b) How can a site like GoDaddy own urls? Why do I have to pay them yearly fees, and why can't I buy a url outright?
I'm looking for a cohesive explanation of how all this stuff works together. Thanks!
How contents get loaded when I put a URL in a browser ?
Well there some very well docs available on this topic each step has its own logic and algorithms attached with it, here I am giving you a walk through.
Step 1: DNS Lookup : Domain name get converted into IP address, in this process domain name from the URL is used to find IP address of the associated server machine by looking up records on multiple servers called name servers.
Step 2: Service Request : Once the IP address is known, as service request depending on protocol is created in form of packets and sent to the server machine using IP address. In case of a browser normally it will be a HTTP request; in other cases it can be something else.
Step 3: Request handling: Depending on the service request and underlying protocol, request is handled by a software program which lives normally on the server machine whose address was discovered in previous step. As per the logic programmed on the server program it will return a appropriate response in case of HTTP its called HTTP Response.
Step 4: Response handling: In this step the requesting program in your case a browser receives the response as mentioned in the previous step and renders it and display it as per defined in the protocol, in case of HTTP a HTTP body is extracted and rendered, which is written in HTML.
How does the nameserver know which site I want directions to
URL has a very well defined format, using which a browser find out a hostname/domain name which is used in turn to find out the associated IP address; there are different algorithms that name-servers runs to find out the correct server machine IP.
Find more about DNS resolution here.
How can a site like GoDaddy own urls? Why do I have to pay them yearly fees, and why can't I buy a url outright?
Domain name are resources which needed management and regulation which is done ICANN they have something called registries from which registrar(like GoDaddy) get domains and book them for you; the cost you pay is split up between ICANN and registrar.
Registrar does a lot of work for you, eg setup name server provide hosting etc.
Technically you can create you own domain name but it won't be free off course because you will need to create a name server, need to replicate it other servers and that way you can have whatever name you want (has too be unique); a simple way to do that is by editing your local hosts files in linux it is located at /etc/hosts and in windows it is located at C:\Windows\System32\drivers\etc\hosts but its no good on internet, since it won't be accepted by other servers.
(Precise and detailed description of this process would probably take too much space and time to write, I am sure you can google it somewhere). So, although very simplified, you have pretty good picture of what is going on, but some clarifications are needed (again, I will be somewhat imprecise) :
Step 2: Your computer does follow the IP address received in step 1, but the request set to that IP address usually contains one important piece of information called 'Host header', that is the actual name as you typed in your browser.
Step 3: There is no nameserver involved here, the software(/hardware) is usually called 'webserver' (for example Apache, IIS, nginx etc...). One webserver can serve one or many different sites. In case there are more than one, webserver will use the 'Host header' to direct you to the specific content you want.
ICAAN 'owns' the domain names, and the registration process involves technical and administrative effort, so you pay registrars to handle that.

Nginx - What precautions need to be taken when I turn underscores_in_headers on?

I'm writing a rails application and passing in a custom access token through the HTTP headers. To accommodate this I need to turn on underscores_in_header in nginx.conf for my code to run. (See Rails Not able to access headers after moving to Digital Ocean)
Because this option is by default off, I assume there are some security risks I assume by turning it on. However, I have been unable to find an explanation for what these risks or concerns are. What are these risks and how do I account for them within my code?
Thanks!
According to the Nginx Pitfalls...
This is done in order to prevent ambiguities when mapping headers to CGI variables, as both dashes and underscores are mapped to underscores during that process.
So it looks like a question of avoiding collisions between variable names. FWIW, the applicable RFC 7230, sec 3.2.6 specifically allows underscores and RFC 3875, sec. 4.1.18 states that:
The HTTP header field name is converted to upper case, has all occurrences of "-" replaced with "_" and has "HTTP_" prepended to give the meta-variable name.
The security problem, then, is related to this conversion process of "-" to "_" and how receiving applications then access the User-Agent variable. For instance, "User-Agent" would be mapped to "User_Agent" by the server, and then in PHP (for example) the CGI environment var is accessed as:
$_SERVER['HTTP_USER_AGENT']
In rails:
request.env['HTTP_USER_AGENT']
So what happens if the client sends "User_Agent" instead of "User-Agent?" The underscore would be left in place and then "HTTP_USER_AGENT" will have been explicitly set by the a client script (normally, it's set by the browser). The following post from 2007 discusses the potential to exploit this process:
Exploiting reflected XSS vulnerabilities, where user input must come through HTTP Request Headers
That post suggests there is a problem if the server app "insecurely prints" the header value (to the client browser) and in the example it would presumably execute a javascript alert popup. It's just an example though.
The question is, does the problem still exist? Well, yes. See the following post that discusses the Shellshock vulnerability where the same idea is used to exploit the BASH shell:
Inside Shellshock: How hackers are using it to exploit systems
Therefore, if you intend to parse any header with an older version of BASH, you need to be aware of the vulnerability presented by Shellshock. At the end of the day, you should always take care to sanitize any data value that has been sent to your application outside of your control.

how to add subdomain name from current url using .htacces rules

I have a URL link like,
http://domain.com/abs/def/city and,
i want to display it as http://city.domain.com/ABC/def
using .htaccess.
Can any one help me by providing .ht access rules.
I want to write .htaccess rules for each city name in URL act as sub domain name.
Also i want it to be dynamic as there are different cities are available in site.
i am using below code in .htaccess file, but not working properly.
RewriteRule ^index.php/(.)/(.)/([^/]+)$ http://$3.domain/$1/$2/$3 [R=301,L]
is there any way to get my requirement using or by modifying my above code or by some other .htaccess code.
Sorry, but what you ask is not possible. This is a typical missunderstanding about url rewriting:
Url rewriting rewrites (manipulates) incoming requests on the server side before processing them. It is not possible to alter outgoing content such that contained urls are changed by this means.
There are solutions for that though:
apaches proxy module can "map" one url into the scope of some other url
there are also modules for automatic post processing of generated html markup
more exotic or creative solutions exist, it depends on your situation in the end...
But usually the easiest is to change the application (typically just its central configuration) such that it contains final urls (pointing to the subdomain in your case). Then you can indeed use the rewriting module to "re-map" those to the previous scope when future incoming requests refer to them (they got clicked).
Ok, second step getting additional info from your comments:
Just to get this clear: you understand that it is not possible to change the link you send out by means of rewriting, but you want to change the url shown in the browser after the user has clicked on some city link? That is something different to what you wrote before, that actually is possible. Great.
If the rewriting works as you want it to (you see the desired url in the browsers address bar), then we can go on. The error message indicates a name resolution problem, that has nothing to do with rewriting. Most likely the domain "cambridge.192.168.2.107" cannot be resolved, which is actually not surprising. You cannot mix ip addresses and names, it is either or.
Also I see that you are using internal, non-routable addresses. So you also are responsible for the name resolution yourself, since no public DNS server can guess what you are setting up internally. Did you do that?
I suggest these steps:
stop using an ip addres for this, use a domain name.
since you are working internally, take care that that domain name is actually resolved to your local systems ip address. How you do this depends on your setup and system, obviously. Most likely you need some entry in the file /etc/hosts or similar.
you need to take care that also those "subdomain names" get resolved to the same address. This is not trivial, again it depends on the setting and system you locally use.
if that name resolution works, then you should see a request in your http servers access log file. Then and only then it makes sense to go on...

Resources