I've been looking at some login forms lately. I saw how many different type of cookies they use. I've not been able to identify the meaning of them. I get the idea that cookies give information to the server about the current state. But why so many? This is a screenshot from a popular website login post request. As you can see there are many. How do they differ and what information they reveal by inspecting them?
I'm not very good at reading cookies, but I think Google thinks you're from Slovenia. ;)
But most of these cookies don't contain much relevant data at all. They're just some id, so the browser knows that the next request is you again. So the string of garbage is just some id, so after you navigate from the login page to another page of the same domain, the server knows that it is still you.
Not all cookies are for the login feature of the site itself, some are for tracking purposes:
__gads is for Double Click, related to Google advertising.
__ssid is for SiteSpect, a tool for A/B testing.
_ga I think is for Google Analytics, a tool to gather website statistics.
and so on.
But still, these cookies normally don't contain user names or passwords or other sensitive information. They are just to 'anonymously' track you. They just gather big data saying that people who visited Page X also visited Page Y, and by that sites like Google generate profiles to show targeted advertising.
Related
I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.
A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".
I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...
The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.
I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...
So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash
Any idea? I'm using rails 3.
Thanks!
If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.
User-agent: *
Disallow: /
(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)
It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.
If your users would copy&paste HTML, you could make use of the nofollow link relationship type:
cute cat
However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.
Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.
But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.
So the only chance you have is to decide if it’s a bot or a human after the link got clicked.
You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.
You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.
The contents of a commerce website (ASP.NET MVC) are regularly crawled by the competition. These people are programmers and they use sophisticated methods to crawl the site so identifying them by IP is not possible.
Unfortunately replacing values with images is not an option because the site should still remain readable by screen readers (JAWS).
My personal idea is using robots.txt: prohibit crawlers from accessing one common URL on the page (this could be disguised as a normal item detail link, but hidden from normal users Valid URL: http://example.com?itemId=1234 Prohibited: http://example.com?itemId=123 under 128). If an IP owner entered the prohibited link show a CAPTCHA validation.
A normal user would never follow a link like this because it is not visible, Google does not have to crawl it because it is bogus. The issue with this is that the screen reader still reads the link and I don't think that this would be so effective to be worth implementing.
Your idea could possibly work for a few basic crawlers but would be very easy to work around. They would just need to use a proxy and do a get on each link from a new IP.
If you allow anonymous access to your website then you can never fully protect your data. Even if you manage to prevent crawlers with lots of time and effort they could just get a human to browse and capture the content with something like fiddler. The best way to prevent your data being seen by your competitors would be to not put it on a public part of your website.
Forcing users to log in might help matters, at least then you could pick up who is crawling your site and ban them.
As mentioned, its not really going to be possible to hide publicly accessible data from a determined user, however, as these are automated crawlers, you could make life harder for them by altering the layout of your page regularly.
It is probably possible to use different master pages to produce the same (or similar) layouts, and you could swap in the master page on a random basis - this would make the writing of an automated crawler that bit more difficult.
I am about to get to the phase of protecting my content from crawlers either.
I am thinking of limiting what an anonymous user can see of the website and require them to register for a full functionality.
example:
public ActionResult Index()
{
if(Page.User.Identity.IsAuthorized)
return RedirectToAction("IndexAll");
// show only some poor content
}
[Authorize(Roles="Users")]
public ActionResult IndexAll()
{
// Show everything
}
Since you know users now, you can punish any crawler.
Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap
Is there a way to automatically append an identifier to a page URL when it is bookmarked in the browser, perhaps something in the document head that gives the browser a directive or an onBookmark JavaScript type of event? I'm looking for ways to further segment my direct traffic in Google Analytics (if you have other ideas for doing that not related to bookmarks, please share them as well).
Example:
http://www.example.com/article
When bookmarked becomes:
http://www.example.com/article#bookmarked
I don't think you can reliably do what you're describing. The best you could do is have a button on your pages that uses window.external.AddFavorite, and then specify your hashed address. It will work for the people who use the button, which may be better than nothing.
Keep in mind that bookmark traffic is not simply part of your direct traffic.
There is a nice demonstration in Justin Cutroni's blog how Google Analytics tracks bookmark visits and it turned out, that most of the time, bookmark visits are shown as organic (if you suppose, that people first do a Google search and then bookmark your site).