Requesting input on conceptual ideas for disguising browser history - http-redirect

I am working with a Domestic Violence support organisation to build a website and have been asked to provide a "Quick Exit" function.
The purpose is to enable the user to exit the site quickly without closing the browser. I have seen such buttons on similar sites and the normal scenario is that they simply cause a Google search page to be shown. (easy but doesn't hide history)
I am looking for ideas to improve on this function to hide/disguise the history stored in the browser as this is currently a fairly significant flaw with the Quick Exit buttons I've seen to date.
I had a concept but I am looking for input on either fleshing out my concept, or other alternative directions to consider.
My concept was to have two domains: let's call them dv-site.com and decoy-site.com. The former being the source of domestic violence support information and the latter being some random content, could be anything, lets just say weather information for the sake of the conversation.
If a user navigates directly to dv-site.com the server redirects to decoy-site.com but also attaches some session specific, or perhaps single use query string or similar.
decoy-site.com validates the query string and, if valid, loads dv-site.com within an iframe or something like that so from the users perspective they are just looking at dv-site.com, though the domain recorded in history is decoy-site.com.
Links within the iframe loaded site would similarly be redirected with the same or a new query string.
If a user was to click on the browser history and go directly to decoy-site.com it would not be able to validate the query string and would just load the decoy site like a normal site. i.e. just showing weather information that exist on that site.
Domestic violence is a serious systemic issue and I would love some input from anyone who has more technical knowledge than I do on fleshing out this concept.
Other aspects I am unsure of how to tackle;
ensuring that dv-site.com can get crawled and ranked by search engines, even though users are all redirected, as it is imperative that it appears in search results so it can be found
technical aspects of a redirect that does not appear in history.
I'm unsure if it's possible to do this without all content and engagement being attributed to the decoy-site..

For the redirect, I believe that HTTP redirects do not get stored in history. You can use a 302 redirect for that. HTTP has a set-cookie header that lets you record a cookie - coupled with the headers here, you can give the decoy site access without recording it in history. Then, delete the cookie.
As far as pagerank goes, you could add a line to robots.txt as described here (the last point) to force the bot to scrape using a query parameter. Then in the backend, return the dv site only if that parameter is passed, otherwise redirect. If the googlebot removes query params when publishing, it will work out. Otherwise, it might fail.
Best of luck.

Related

Is AmazonAWS creating non-genuine hits? How can I verify?

I have a site that logs a "hit (via saving a record to a Hits table that captures the date/time and IP of the machine whenever the detail page is loaded)" whenever a user brings up a detail page for a particular item so that admins can see how many hits that particular item gets. We get random instances where items are being hit multiple times/day in twos. So, in the data, it looks like a user is viewing an item, but the site is logging their hit twice in the database (same item, same date/time, same IP Address, etc.). Most hits are only being recorded once, and all my testing has lead to assurance the site is working appropriately. I'm noticing that particular IP Addresses are causing double hits. When I do Reverse IP searches, all the "double hits" are tied to IP Addresses that trace back to Amazonaws in northern Virginia, on the other side of the country. Our site is used locally, and the single hits are coming from IPs that trace back to local areas. Is there a bot hitting my site from afar? Should I block Amazonaws in Azure (which is where my site is hosted) or is that going to lock out genuine users? Is there a way I can detect whether a hit is genuine in my code (my site is in .Net MVC)? Has anyone faced a similar situation in the past?
Note: This IS RELEVANT to software engineering because a part of the question is asking how I can verify in my code that a hit is genuine!!!!!!!!!!!!!!!!!!
Basically, what I found out (no thanks to the elitist user who downvoted my question and offered no contribution) is that, my hit counter is being inflated by web crawlers. The quick and dirty solution is to implement a robots.txt file to block crawlers from hitting that page. Of course, that comes with the sacrifice that my client's site will no longer come up, should the public do a google search for the product being offered.
One alternative is the hidden link method; in which we put a hidden page on the site that no human user would ever access. When a bot hits that page, we record the IP in a "blacklist" table. Then, before our real hit counter logs a hit, it checks the user's IP against the blacklist.
Another alternative is to implement a blacklist of known User-Agents used by bots. We check the user's credentials against that list in order to determine whether a user is a bot.
Neither of these solutions are 100% though.
These are fairly adequate responses to my question. Of course, since this is StackExchange (or StackOverflow or StackYourMomma or whatever it is), people are just going to downvote your question and act like you're beneath a response because you didn't follow all the little bull crap rules that come along with being a member of the SE/SO/SYM community.

Prevent bot from crawling certain areas of site

I don't know much about SEO and how web spiders work, so forgive my ignorance here. I'm creating a site (using ASP.NET-MVC) which has areas that displays information retrieved from the database. The data is unique to the user, so there's no real server-side output caching going on. However, since the data can contain things the user may not wish to have displayed from search engine results, I'd like to prevent any spiders from accessing the search results page. Are there any special actions I should take to ensure that the search result directory isn't crawled? Also, would a spider even crawl a page that's dynamically generated and would any actions preventing certain directories being search mess up my search engine rankings?
edit: I should add, I'm reading up on robots.txt protocol, but it relies on co-operation from the web crawler. However, I'd also like to prevent any data-mining users who will ignore the robots.txt file.
I appreciate any help!
You can prevent some malicious clients from hitting your server too heavily by implementing throttling on the server. "Sorry, your IP has made too many requests to this server in the past few minutes. Please try again later." In practice, though, assume that you can't stop a truly malicious user from bypassing any throttling mechanisms that you put in place.
Given that, here's the more important question:
Are you comfortable with the information that you're making available for all the world to see? Are your users comfortable with this?
If the answer to those questions is no, then you should be ensuring that only authorized users are able to see the sensitive information. If the information isn't particularly sensitive but you don't want clients crawling it, throttling is probably a good alternative. Is it even likely that you're going to be crawled anyway? If not, robots.txt should be just fine.
It seems like you have 2 issues.
Firstly a concern about certain data appearing in search results. The second about malicious or unscrupulous user harvesting user related data.
The first issue will be covered by appropriate use of a robots.txt file as all the big search engines honour this.
The second issue seems more to do with data privacy. The first question which immediately springs to mind is: If there is user information which people may not want displayed, why are you making it available at all?
What is the privacy policy for such data?
Do users have the ability to control what information is made available?
If the information is potentially sensitive but important to the system could it be restricted so it is only available to logged in users?
Check out the Robots exclusion standard. It's a text file that you put on your site that tells a bot what it can and can't index. You will also want to address what happens if a bot doesn't honour the robots.txt file.
robots.txt file as mentioned. If that is not enough then you can:
Block unknown useragents - hard to maintain, easy for a bot to forge a browser's (although most legitimate bots wont)
Block unknown IP addresses - not useful for a public site
Require logins
Throttle user connections - tricky to tune, you will still be disclosing information.
Perhaps by using a combination. Either way it is a trade off, if the public can browse to it, so can a bot. Be sure you don't block & alienate people in your attempts to block bots.
a few options:
force the user to login to view the content
add a CAPTCHA page before the content
embed content in Flash
load dynamically with JavaScript

Why would Google Search use client-side URL parameters?

Yesterday morning I noticed Google Search was using hash parameters:
http://www.google.com/#q=Client-side+URL+parameters
which seems to be the same as the more usual search (with search?q=Client-side+URL+parameters). (It seems they are no longer using it by default when doing a search using their form.)
Why would they do that?
More generally, I see hash parameters cropping up on a lot of web sites. Is it a good thing? Is it a hack? Is it a departure from REST principles? I'm wondering if I should use this technique in web applications, and when.
There's a discussion by the W3C of different use cases, but I don't see which one would apply to the example above. They also seem undecided about recommendations.
Google has many live experimental features that are turned on/off based on your preferences, location and other factors (probably random selection as well.) I'm pretty sure the one you mention is one of those as well.
What happens in the background when a hash is used instead of a query string parameter is that it queries the "real" URL (http://www.google.com/search?q=hello) using JavaScript, then it modifies the existing page with the content. This will appear much more responsive to the user since the page does not have to reload entirely. The reason for the hash is so that browser history and state is maintained. If you go to http://www.google.com/#q=hello you'll find that you actually get the search results for "hello" (even if your browser is really only requesting http://www.google.com/) With JavaScript turned off, it wouldn't work however, and you'd just get the Google front page.
Hashes are appearing more and more as dynamic web sites are becoming the norm. Hashes are maintained entirely on the client and therefore do not incur a server request when changed. This makes them excellent candidates for maintaining unique addresses to different states of the web application, while still being on the exact same page.
I have been using them myself more and more lately, and you can find one example here: http://blixt.org/js -- If you have a look at the "Hash" library on that page, you'll see my implementation of supporting hashes across browsers.
Here's a little guide for using hashes for storing state:
How?
Maintaining state in hashes implies that your application (I'll call it application since you generally only use hashes for state in more advanced web solutions) relies on JavaScript. Without JavaScript, the only function of hashes would be to tell the browser to find content somewhere on the page.
Once you have implemented some JavaScript to detect changes to the hash, the next step would be to parse the hash into meaningful data (just as you would with query string parameters.)
Why?
Once you've got the state in the hash, it can be modified by your code (or your user) to represent the current state in your application. There are many reasons for why you would want to do this.
One common case is when only a small part of a page changes based on a variable, and it would be inefficient to reload the entire page to reflect that change (Example: You've got a box with tabs. The active tab can be identified in the hash.)
Other cases are when you load content dynamically in JavaScript, and you want to tell the client what content to load (Example: http://beta.multifarce.com/#?state=7001, will take you to a specific point in the text adventure.)
When?
If you had a look at my "JavaScript realm" you'll see a border-line overkill case. I did it simply because I wanted to cram as much JavaScript dynamics into that page as possible. In a normal project I would be conservative about when to do this, and only do it when you will see positive changes in one or more of the following areas:
User interactivity
Usually the user won't see much difference, but the URLs can be confusing
Remember loading indicators! Loading content dynamically can be frustrating to the user if it takes time.
Responsiveness (time from one state to another)
Performance (bandwidth, server CPU)
No JavaScript?
Here comes a big deterrent. While you can safely rely on 99% of your users to have a browser capable of using your page with hashes for state, there are still many cases where you simply can't rely on this. Search engine crawlers, for example. While Google is constantly working to make their crawler work with the latest web technologies (did you know that they index Flash applications?), it still isn't a person and can't make sense of some things.
Basically, you're on a crossroads between compatability and user experience.
But you can always build a road inbetween, which of course requires more work. In less metaphorical terms: Implement both solutions so that there is a server-side URL for every client-side URL that outputs relevant content. For compatible clients it would redirect them to the hash URL. This way, Google can index "hard" URLs and when users click them, they get the dynamic state stuff!
Recently google also stopped serving direct links in search results offering instead redirects.
I believe both have to do with gathering usage statistics, what searches were performed by the same user, in what sequence, what of the search results the user has followed etc.
P.S. Now, that's interesting, direct links are back. I absolutely remember seeing there only redirects in the last couple of weeks. They are definitely experimenting with something.

ASP.NET MVC: Do GET requests on private web pages have to be nondestructive?

In ASP.NET MVC it seems to be common practice not to use GET requests for calls to a controller that modify the model. For example, deleting a customer should not be possible by clicking a simple HTML link.
The only reason for this rule I am aware of is not safeguard against web-crawlers which might indavertently alter the database. GET requests are commonly regarded as safe, whereas POST requests are not.
Does this mean that this rule does not apply to non-public portions of a website (Example: Your password-protected user administration area)? Or is there any other reason not to use destructive GET requests?
This is generally part of HTTP. From the HTTP 1.1 RFC 2616
Implementors should be aware that the
software represents the user in their
interactions over the Internet, and
should be careful to allow the user to
be aware of any actions they might
take which may have an unexpected
significance to themselves or others.
In particular, the convention has been
established that the GET and HEAD
methods SHOULD NOT have the
significance of taking an action other
than retrieval. These methods ought to
be considered "safe". This allows user
agents to represent other methods,
such as POST, PUT and DELETE, in a
special way, so that the user is made
aware of the fact that a possibly
unsafe action is being requested.
Naturally, it is not possible to
ensure that the server does not
generate side-effects as a result of
performing a GET request; in fact,
some dynamic resources consider that a
feature. The important distinction
here is that the user did not request
the side-effects, so therefore cannot
be held accountable for them.
In other words, it's not enforced, but it's really bad form for a GET request to have side-effects. Imagine if a user bookmarks a URL which does updates something, for example - they probably wouldn't expect that to happen.
Another good reason is accelerator plug-ins for browsers. These attempt to speed up page loads by pre-fetching links on the current page. Imagine if you had a bunch of GET requests to delete all the objects in a list, the plug-in would delete them!
The short of it is that you can't predict what a browser will do with GET requests, if it looks like a plain-old hyperlink then its fair game for a browser to go fetch it.
Yes.
It's not just about web crawlers, it's about CRSF - Cross Site Request Forgery.
So imagine that someone is logged into your web site, and browses to www.hax0rs.com
In the source for hax0rs.com is the following tag
<img src="http://mysite.com/members/statusChange?status=I%20am%20looking%20for%20a%20gimp%20mask" height="0" width="0">
Because your user is logged in, and because the request is going to your site, the authentication cookie goes with it. And bang, suddenly your user's status has changed.
What fun :)
But I suppose you can still do some sort of "non-retrieval" actions on GET requests. For example updating the "LastVisit" records which can be consider undestructive and relatively safe.

Associating source and search keywords with account creation

As a part of the signup process for my online application, I'm thinking of tracking the source and/or search keywords used to get to my site. This would allow me to see what advertising is working and from where with a somewhat finer grain than Google Analytics would.
I assume I could set some kind of cookie with this information when people get to my site, but I'm not sure how I would go about getting it. Is it even possible?
I'm using Rails, but a language-independent solution (or even just pointers to where to find this information) would be appreciated!
Your best bet IMO would be to use javascript to look for a cookie named "origReferrer" or something like that and if that cookie doesn't exist you should create one (with an expiry of ~24hours) and fill it with the current referrer.
That way you'll have preserved the original referrer all the way from your users first visit and when your users have completed whatever steps you want them to have completed (ie, account creation) you can read back that cookie on the server and do whatever parsing/analyzing you want.
Andy Brice explains the technique in his blog post Cookie tracking for profit and pleasure.

Resources