How to block robots without robots.txt

How to block robots without robots.txt - search-engine

As we know, robots.txt helps us avoid indexing of certain webpages/section by web crawlers/robots. But there are certain disadvantages by using this method: 1. the web crawlers might not listen to robots.txt file; 2. you are exposing the folders you want to protect to everybody;
There is another way of blocking the folders you want to protect from crawlers? Keep in mind that those folders might be wanted to be accessible from the browser (like /admin).

Check the User-Agent header on requests and issue a 403 if the header contains the name of a robot. This will block all of the honest robots but not the dishonest ones. But then again, if the robot was really honest, it would obey robots.txt.

Related

Cherry-pick: from two URLs with same file on different CDNs, load whichever is in cache

I have a web app that wants to load bootstrap.min.js
It's on these two CDN's (among others):
https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/4.3.1/js/bootstrap.min.js
https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js
The odds of a cache hit from some other app using these CDN's is relatively high.
How can I tell the browser to check if they are cached and load from browser cache?
Can a service worker do this?

I believe that there are some privacy/security restrictions in place that attempt that make it difficult to determine, using JavaScript, whether a third-party URL is present in the browser's cache.
Adding a service worker into the mix will not get around those restrictions.
It's possible to use the Fetch API to create a Request with a mode of 'only-if-cached', which will behave more or less in the way you describe, but that will only work if the request's mode is 'same-origin'. In other words, only if the Request is for a first-party URL, not a third-party CDN URL as in your example.

Cloudfront/S3: Server different file depending on Request Header

I am hosting a static website generated with Middleman on CloudFront and S3. I want to add multiple language support and middleman allows me to localize the content and have the english version at /index.html and the translated content at /sp/index.html for example.
I would like to be able to detect the "Accept-Language" header in the request and based on that server either /index.html or /sp/index.html .
Based on my research I cannot see a way of doing this with S3 and Cloudfront, but maybe you guys have an idea?
If there is no "proper and good way" of doing this with CloudFront and S3, what would be the next best alternative? Currently I am thinking of detecting the language in JavaScript and then redirecting the user if the language is not english.
Greetings, Kim

As mentioned in the comments you will need some kind of arbitrator that can read request headers and either redirect or serve dynamic content. S3 is the problem there.
CloudFront can forward the Accept-Language header to your origin server, and ensure that content is only cached per-language. So that part isn't a problem.
If S3 is your origin, then you have a problem because your files are static and unable to process the incoming request with the language information. I don't recommend trying to detect language with JavaScript. It's problematic.
Although CloudFront can be configured with multiple origins (one per language, in your case) it cannot forward to these based on request header. Currently "behaviours" can only match the URL path. I suspect they could introduce header rules at some point, but until they do (or unless you can find another CDN that does) I'm afraid my answer is going to be a "you can't" answer.
As your site is all flat HTML, I suspect you're not interested in a convoluted solution that comprises various CloudFront behaviours and dynamic server scripts, etc..
I think your best option by far is a simple, low-tech one --
Offer the visitor a choice of language and allow them to switch language from any page. This also avoids surprises - If I google something in English, but I speak Spanish I should see the English page that I googled and then switch to Spanish if I feel like it.

Best way to serve files?

I'm a novice web developer with some background in programming (mostly Python).
I'm looking for some basic advice on choosing the right technology.
I need to serve files over the internet (mp3's), but I need to implement some
control on the access:
1. Files will be accessible only for authorized users.
2. I need to keep track on how many times a file was loaded, by whom, etc.
What might be the best technology to implement this? That is, should I
learn Apache, or maybe Django? or maybe something else?
I'm looking for a 'pointer' in the right direction.
Thank!
R

If you need to track/control the downloads that suggests that the MP3 urls need to be routed through a Rails controller. Very doable. At that point you can run your checks, track your stats, and send the file back.
If it's a lot of MP3's, you would like to not have Rails do the actual sending of the MP3 data as it's a waste of it's time and ties up an instance. Look into xsendfile where Rails can send a response header indicating the file path to send and apache will intercept it and do the actual sending.
https://tn123.org/mod_xsendfile/
http://rack.rubyforge.org/doc/classes/Rack/Sendfile.html

You could use Django and Lighttpd as a web server. With Lighttpd you can use mod_secdownload, wich enables you to generate one time only urls.
More info can be found here: http://redmine.lighttpd.net/projects/1/wiki/Docs_ModSecDownload
You can check for permissions in your Django (or any other) app and then redirect the user to this disposable URL if he passed the permission check.

Serving files through controllers with partial download support

I need to serve files through grails, only users with permission have access, so I cant serve them with a static link to a container. The system is able to stream binary files to the client without problems,but now (for bandwidth performance issues on the client) I need to implement segmented or partial downloads in the controllers.
Theres a plugin or proven solution to this problem?
May be some kind of tomcat/apache plugin to restrict access to files with certain rules or temporal tickets so I can delegate the "resume download" or "segmented download" problem to the container.
Also i need to log and save stats on the downloads of the users.
I need good performance so, I think doing this in the controller is not good idea.
Sorry bad english.

There is a plugin for apache - https://tn123.org/mod_xsendfile/ It doesn't matter what you're using behind apache at this case. By using this plugin you will respond with special header X-SENDFILE, with path to file to serve, and Apache will take care about actual file downloading for current request.
If you're using Nginx, you have to use X-Accel-Redirect header, see http://wiki.nginx.org/XSendfile

Best way to redirect image requests to a different webserver?

I am trying to reduce the load on my webservers by adding an "Image server" (a dedicated server for handling image requests), and redirecting all requests for .gif,.jpg,.png etc., to it.
My question is, what is the best way to handle the redirection?
At the firewall level? (can I do this using iptables?)
At the load balancer level? (can ldirectord handle this?)
At the apache level - using rewrite rules?
Thanks for any suggestions on the best way to do this.
--Update--
One thing I would add is that these are domains that are hosted for 3rd parties, so I can't expect all the developers to modify their code and point their images to another server.

The further up the chain you can do it, the better.
Ideally, do it at the DNS level by using a different domain for your images (eg imgs.example.com)
If you can afford it, get someone else to do it by using a CDN (Content delivery network).
-Update-
There are also 2 featuers of apache's mod_rewrite that you might want to look at. They are all described well at http://httpd.apache.org/docs/1.3/misc/rewriteguide.html.
The first is under the heading "Dynamic Miror" in the above document, that uses the mod_rewrite Proxy flag [p]. This lets your server silently fetch files from another domain and return them.
The second is to just redirect the request to the new domain. This second option puts less strain on your server, but requests still need to come in and it slows down the final rendering of the page, as each request needs to make an essentially redundant request to your server first.

i agree with rikh. If you want images to be served from a different webserver, then serve them on a different web-server. For example:
<IMG src="images/Brett.jpg">
becomes
<IMG src="http://brettnesbitt.akamia-technologies.com/images/Brett.jpg">
Any kind of load balancer will still feed the image from the web-server's pipe, which is what you're trying to avoid.
i, of course, know what you really want. What you really want is for any request like:
GET images/Brett.jpg HTTP/1.1
to automatically get converted into:
HTTP/1.1 307 Temporary Redirect
Location: http://brettnesbitt.akamia-technologies.com/images/Brett.jpg
this way you don't have to do any work, except copy the images to the other web-server.
That i really don't know how to do.
By using the phrase "NAT", it implies that the firewall/router receives HTTP requests, and you want to forward the request to a different internal server if the HTTP request was for image files.
This then begs the question about what you're actually trying to save. No matter which internal web-server services the HTTP request, the data is still going to have to flow through the firewall/router's pipe.
The reason i bring it up is because the common scenario when someone wants to serve images from a different server is because they want to split up high-bandwidth, mostly static, low-CPU cost content from their actual logic.
Only using NAT to re-write the packet and send it to a different server will not work towards that common issue.
The other reason might be because images are not static content on your system, and a request to
GET images/Brett.jpg HTTP/1.1
actually builds an image on the fly, with a high-CPU cost, or only using with data available (i.e. SQL Server database) to ServerB.
If this is the case then i would still use a different server name on the image request:
GET http://www.brettsoft.com/default.aspx HTTP/1.1
GET http://imageserver.brettsoft.com/images/Brett.jpg HTTP/1.1
i understand what you're hoping for, with network packet inspection to override the NAT rule and send it to another server - i've never seen any such thing that can do that.
It sounds more "proxy-ish", where the web-proxy does this. (i.e. pfSense and m0n0wall can't do it)
Which then leads to a kind of solution we used once: a custom web-server that analyzes the request, makes the appropriate request off some internal server, and binary writes the response to the client.
That pain in the ass solution was insisted upon by a "security consultant", who apparently believes in security through obscurity.
i know IIS cannot do such things for you itself - i don't know about other web-server products.
i just asked around, and apparently if you wanted to write a custom kernel module for you linux based router, you could have it inspect packets and take appropriate action. Such a module might exist. There are, apparently, plenty of other open-sourced modules to use as a starting point.
But i'd rather shoot myself in the head.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart