Visitor using URL that doesn't exist - url

I had someone visit my site today from a link like this:
www.example.com/pagename.php?_sm_byp=iVVVMsFFLsqWsDL4
Can someone explain to me how that works since my actual URL ends with pagename.php and I never allowed a user to input any PHP query and never have session IDs or anything similar.

This is not unusual. Many sites/servers allow (or rather, ignore) arbitrary query components.
For example, you can append ?foo=bar to those URLs and still get a HTTP status 200:
https://stackoverflow.com/?foo=bar
http://en.wikipedia.org/wiki/Stack_Overflow?foo=bar
Now as they are linked here, users might visit them, so these URLs would appear in their logs. Apart from manually appending such a query component, they might also be added by various scripts, e.g. for tracking purposes, or third-party services that link to your pages (… and sometimes their origin is unknown).
If you don’t want your URLs to work with arbitrary query components, you can configure your backend/server in such a way that it redirects to the URLs without the query components, or respond with 404, or whatever.
If you keep allowing this, but want to prevent that bots index your URLs with these unnecessary query components, you can specify the canonical variants of your URLs with the canonical link relation.

Related

Requesting input on conceptual ideas for disguising browser history

I am working with a Domestic Violence support organisation to build a website and have been asked to provide a "Quick Exit" function.
The purpose is to enable the user to exit the site quickly without closing the browser. I have seen such buttons on similar sites and the normal scenario is that they simply cause a Google search page to be shown. (easy but doesn't hide history)
I am looking for ideas to improve on this function to hide/disguise the history stored in the browser as this is currently a fairly significant flaw with the Quick Exit buttons I've seen to date.
I had a concept but I am looking for input on either fleshing out my concept, or other alternative directions to consider.
My concept was to have two domains: let's call them dv-site.com and decoy-site.com. The former being the source of domestic violence support information and the latter being some random content, could be anything, lets just say weather information for the sake of the conversation.
If a user navigates directly to dv-site.com the server redirects to decoy-site.com but also attaches some session specific, or perhaps single use query string or similar.
decoy-site.com validates the query string and, if valid, loads dv-site.com within an iframe or something like that so from the users perspective they are just looking at dv-site.com, though the domain recorded in history is decoy-site.com.
Links within the iframe loaded site would similarly be redirected with the same or a new query string.
If a user was to click on the browser history and go directly to decoy-site.com it would not be able to validate the query string and would just load the decoy site like a normal site. i.e. just showing weather information that exist on that site.
Domestic violence is a serious systemic issue and I would love some input from anyone who has more technical knowledge than I do on fleshing out this concept.
Other aspects I am unsure of how to tackle;
ensuring that dv-site.com can get crawled and ranked by search engines, even though users are all redirected, as it is imperative that it appears in search results so it can be found
technical aspects of a redirect that does not appear in history.
I'm unsure if it's possible to do this without all content and engagement being attributed to the decoy-site..
For the redirect, I believe that HTTP redirects do not get stored in history. You can use a 302 redirect for that. HTTP has a set-cookie header that lets you record a cookie - coupled with the headers here, you can give the decoy site access without recording it in history. Then, delete the cookie.
As far as pagerank goes, you could add a line to robots.txt as described here (the last point) to force the bot to scrape using a query parameter. Then in the backend, return the dv site only if that parameter is passed, otherwise redirect. If the googlebot removes query params when publishing, it will work out. Otherwise, it might fail.
Best of luck.

Rails app too many parameters

I have a page with a list of items and above the list I have multiple links that act as filters. Clicking on the links causes an ajax request to be fired with a whole host of URL parameters. Any example set of params after clicking a few filters:
?letters=a-e&page=1&sort=alphabetically&type=steel
It is all working fine but I feel like the params on the URL are very messy, and the code behind has to do alot of checking to see which params exist, merge new ones, overwrite existing ones etc.
Is there a nicer way to accomplish this without URL parameters.
I guess the downside to that would be the fact a user would not be able to link to a specific filtered view or is there a way this could be accomplished too?
You have several options when working with long query strings. If this isn't really causing a problem (like requests dying) then you should ask yourself if it's really worth the effort to switch it to something else.
Use POST Requests
If the length of the query string is causing problem, you can switch to using POST requests instead of GET request from your filter links. That will prevent the URL from containing the filter parameters, but your controller can still deal with the parameters in the same way.
The link_to helper can be setup to use a different HTTP verb as follows:
link_to("My Filter", filter_path, method: :post)
Make sure you update your routes appropriately if you use this technique.
Use an Ajax Request to Refresh the Page
If you configure your filters to all be remote (Ajax) links, you can update the filters and refresh the contents of the page without ever changing the URL. This is the basic pattern of the solution:
Send a remote request to the server with the current filter options
Update the page contents based on those filters
Make sure the filters (and remote request) will submit all of the current parameters again
Store Filters in the User's Session
If you store the current filters in the session, whenever the user visits the base page, you can retrieve the stored filters and only display the appropriate information. Your filter links could still be GET requests (including the lengthy query strings), but instead of rendering the page after the filter request, you would redirect back to the main list with no extra query parameters. That would make it appear to the user that the URL never changed, and would even allow you to remember their last filter if they navigate away.
Sharing Links
Like you mentioned, sharing links becomes a problem with all of these solutions. You can provide a "share this filter" section on the page to help mitigate that. You would put a URL the user could copy in that section that includes the necessary information to recreate the filter. The links could contain the full query string or perhaps an encoded version of the filter.

#rails_folks | ''http://twitter.com/#!/user/followers" | Please explain

How would you achieve this route in rails 3 and the last stable version 2.3.9 or soish?
Explained
I don't really care about the followers action. What I'm really after is how to create '!#' in the routing.
Also, What's the point of this. Is it syntax or semantics?
Rails doesnt directly get anything after the #. Instead the index page checks that value with javascript and makes an AJAX request to the server based on the url after the #. What routes they use internally to handle that AJAX request I am not sure.
The point is to have a Javascript powered interface, where everyone is on the same "page" but the data in the hashtag allows it to load any custom data on the fly, and without loading a whole new page if you decide to view a different user, for instance.
The hash part is never sent to the URL, but it is a common practice to manipulate the hash to maintain history, and bookmarking for AJAX applications. The only problem being that by using a hash to avoid page reloads, search engines are left behind.
If you had a site with some links,
http://example.com/#home
http://example.com/#movies
http://example.com/#songs
Your AJAXy JavaScript application sees the #home, #movies, and #songs, and knows what kind of data it must load from the server and everything works fine.
However, when a search engine tries to open the same URL, the hash is discarded, and it always sends them to http://example.com/. As a result the inner pages of your site - home, movies, and songs never get indexed because there was no way to get to them until now.
Google has creating an AJAX crawling specification or more like a contract that allows sites to take full advantage of AJAX while still reaping the benefits of indexing by searching engines. You can read the spec if you want, but the jist of it is a translation process of taking everything that appears after #! and adding it as a querystring parameter.
So if your AJAX links were using #!, then a search engine would translate a URL like,
http://example.com/#!movies
to
http://example.com/?_escaped_fragment_=movies
Your server is supposed to look at this _escaped_fragment_ parameter and respond the same way that your AJAX does.
Note that HTML5's History interface now provides methods to change the address bar path without needing to rely upon the hash fragment to avoid page reloads.
Using the pushState and popState methods
history.pushState(null, "Movies page", "/movies");
you could directly change the URL to http://example.com/movies without causing a page refresh. Search engines can continue to use the same URL that you would be using in that case.
The part after the # in a URI is called the fragment identifier, and it is interpreted by the client, not the server. You cannot route this, because it will never leave the browser.

Why would Google Search use client-side URL parameters?

Yesterday morning I noticed Google Search was using hash parameters:
http://www.google.com/#q=Client-side+URL+parameters
which seems to be the same as the more usual search (with search?q=Client-side+URL+parameters). (It seems they are no longer using it by default when doing a search using their form.)
Why would they do that?
More generally, I see hash parameters cropping up on a lot of web sites. Is it a good thing? Is it a hack? Is it a departure from REST principles? I'm wondering if I should use this technique in web applications, and when.
There's a discussion by the W3C of different use cases, but I don't see which one would apply to the example above. They also seem undecided about recommendations.
Google has many live experimental features that are turned on/off based on your preferences, location and other factors (probably random selection as well.) I'm pretty sure the one you mention is one of those as well.
What happens in the background when a hash is used instead of a query string parameter is that it queries the "real" URL (http://www.google.com/search?q=hello) using JavaScript, then it modifies the existing page with the content. This will appear much more responsive to the user since the page does not have to reload entirely. The reason for the hash is so that browser history and state is maintained. If you go to http://www.google.com/#q=hello you'll find that you actually get the search results for "hello" (even if your browser is really only requesting http://www.google.com/) With JavaScript turned off, it wouldn't work however, and you'd just get the Google front page.
Hashes are appearing more and more as dynamic web sites are becoming the norm. Hashes are maintained entirely on the client and therefore do not incur a server request when changed. This makes them excellent candidates for maintaining unique addresses to different states of the web application, while still being on the exact same page.
I have been using them myself more and more lately, and you can find one example here: http://blixt.org/js -- If you have a look at the "Hash" library on that page, you'll see my implementation of supporting hashes across browsers.
Here's a little guide for using hashes for storing state:
How?
Maintaining state in hashes implies that your application (I'll call it application since you generally only use hashes for state in more advanced web solutions) relies on JavaScript. Without JavaScript, the only function of hashes would be to tell the browser to find content somewhere on the page.
Once you have implemented some JavaScript to detect changes to the hash, the next step would be to parse the hash into meaningful data (just as you would with query string parameters.)
Why?
Once you've got the state in the hash, it can be modified by your code (or your user) to represent the current state in your application. There are many reasons for why you would want to do this.
One common case is when only a small part of a page changes based on a variable, and it would be inefficient to reload the entire page to reflect that change (Example: You've got a box with tabs. The active tab can be identified in the hash.)
Other cases are when you load content dynamically in JavaScript, and you want to tell the client what content to load (Example: http://beta.multifarce.com/#?state=7001, will take you to a specific point in the text adventure.)
When?
If you had a look at my "JavaScript realm" you'll see a border-line overkill case. I did it simply because I wanted to cram as much JavaScript dynamics into that page as possible. In a normal project I would be conservative about when to do this, and only do it when you will see positive changes in one or more of the following areas:
User interactivity
Usually the user won't see much difference, but the URLs can be confusing
Remember loading indicators! Loading content dynamically can be frustrating to the user if it takes time.
Responsiveness (time from one state to another)
Performance (bandwidth, server CPU)
No JavaScript?
Here comes a big deterrent. While you can safely rely on 99% of your users to have a browser capable of using your page with hashes for state, there are still many cases where you simply can't rely on this. Search engine crawlers, for example. While Google is constantly working to make their crawler work with the latest web technologies (did you know that they index Flash applications?), it still isn't a person and can't make sense of some things.
Basically, you're on a crossroads between compatability and user experience.
But you can always build a road inbetween, which of course requires more work. In less metaphorical terms: Implement both solutions so that there is a server-side URL for every client-side URL that outputs relevant content. For compatible clients it would redirect them to the hash URL. This way, Google can index "hard" URLs and when users click them, they get the dynamic state stuff!
Recently google also stopped serving direct links in search results offering instead redirects.
I believe both have to do with gathering usage statistics, what searches were performed by the same user, in what sequence, what of the search results the user has followed etc.
P.S. Now, that's interesting, direct links are back. I absolutely remember seeing there only redirects in the last couple of weeks. They are definitely experimenting with something.

is there any way to overcome the 2k character limitation on the URL length?

I think the URL length can only be 2000 or so characters long. Otherwise, it will choke some versions of IE. Is there any way to overcome this problem?
At first i was thinking about tinyurl, but tinyurl actually immediately redirects to the longer URL, so that probably will fail too.
Update:
I need such long URL because I need to be able for people to bookmark the URL or to send it to other people by email.
That's what POST is for ;)
For bookmarking reasons you could store a hash of the argument string in the databse as well as the argument list. That way when somone bookmarks something they get a bookmark with the hash in it and your internal software looks up the appropriate arguments and gets them.
You are in essence rolling your own tiny url.
If somone else wants to bookmark a page with the same arguments then the hash will be the same.
the only problem is that your table of hashes will grow quite big, and many of these "book marks" might never be used.
HTML POST is designed for transferring larger amounts of data. You should look at that.
What do your urls look like?
Maybe you could use gzip or md5 to shorten them, or store them in a database and put the id of the row in the url?
There's no realistic way to get around this - it's not a limitation in the specs or anything like that, but in IE itself and presumably how the URL is allocated (I believe the limit is actually 2083 characters by the way, for some reason).
Since IE needs the URL all in one go to send to the server, I can't think of any clever tricks that would enable you to work around it. Some options I considered were to send the query parameters via POST instead of GET (but this is often not interchangeable on the server side, and the clients will treat this differently in that the URL can't then appear in a hyperlink or be bookmarked or entered manually, and if the user wants to refresh they'll get the "send information again" warning, which makes sense since POST is meant to update information on the remote server, and it'll only work if it's the query string pushing it beyond the limit rather than some ungodly URL). Alternatively you could perhaps chunk up the URL, setting the overflow part in a cookie and then making the request to the stub of the URL, which is intelligent enough to pull the context out of the cookie and append it to the URL actually received. However this again complicates processing on the server, probably far too much to be used beyond a trivial application, and also still means you can't put that URL in hyperlinks or bookmarks or whatever, since an important part of it is client state.
Basically, everything else would involve rewriting the server to somehow piece together the extra information, and if you're able to do this then you should be able to simply change the URL scheme so that everything's below 2000 characters. So no - no real way around it.
(Though if you could use something like tinyurl to act as a proxy rather than issuing a browser redirect to the URL, that could work).

Resources