Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap
Related
I've been looking at some login forms lately. I saw how many different type of cookies they use. I've not been able to identify the meaning of them. I get the idea that cookies give information to the server about the current state. But why so many? This is a screenshot from a popular website login post request. As you can see there are many. How do they differ and what information they reveal by inspecting them?
I'm not very good at reading cookies, but I think Google thinks you're from Slovenia. ;)
But most of these cookies don't contain much relevant data at all. They're just some id, so the browser knows that the next request is you again. So the string of garbage is just some id, so after you navigate from the login page to another page of the same domain, the server knows that it is still you.
Not all cookies are for the login feature of the site itself, some are for tracking purposes:
__gads is for Double Click, related to Google advertising.
__ssid is for SiteSpect, a tool for A/B testing.
_ga I think is for Google Analytics, a tool to gather website statistics.
and so on.
But still, these cookies normally don't contain user names or passwords or other sensitive information. They are just to 'anonymously' track you. They just gather big data saying that people who visited Page X also visited Page Y, and by that sites like Google generate profiles to show targeted advertising.
I have a range of Google docs that are publicly viewable, but I would like to get some information about how often they are being viewed. I understand that there used to be a way of doing this with Google Analytics, but now that has been removed.
It seems to me that I have two main options, one of which is to make all my doc links point to a page which redirects according to a query string parameter, e.g.:
http://myurl.net?page=1 # Sends you to one page and logs the visit
http://myurl.net?page=2 # Sends you to another page and logs the visit
Or alternatively, I could try to embed some code in each doc that makes a call back to the server with its page number. But I don't know if this is possible.
The first option looks like it should be fairly easy, but I don't see how to redirect the client.
Could anyone give me some ideas about how to do this? It seems it would be useful for quite a lot of people.
Many thanks.
Justin.
I recently decided to start to take advantage of rich snippets to improve my personal website's content for the search engines and, IMHO most importantly, the site readers – hi, Mam! ;-). One of these are Google Authorship. Personally, I think the idea behind Google Authorship is a sound one: it helps to brings a sense of identity, personality and – arguably, most importantly – credibility to what is still largely an anonymous web.
Normally, I would link my article to Google Authorship using the following line of HTML:
<A REL="author" HREF="https://plus.google.com/112431363835029530079?rel=author">Jordan Clark</A>
However, in the instance of a website that publishes articles that are written by multiple authors, manually entering each another’s Google+ UID string starts to become a tiresome process.
Is is valid to do the following:
(a) Link to the author like so, using the script "author.php" (or other type of server-side script).
<A REL="author" HREF="/author.php?by=Alice&rel=author/[UID]?rel=author">Alice</A>
(b) The file "author.php" scripts simply do a quick check for Alice's (or whoever) User ID string provided by Google, and then uses a simple HTTP redirect header to pass this data to Google.
What I would like to know is:
Is it okay to use a local script to redirect to your Google+ user profile? (i.e. will it affect the PageRank of already indexed page or have any other unforeseen negative effects on new and indexed pages?)
Why do I not see more people linking with Google’s “prettified” version:
http://profiles.google.com/clarky.y2k?rel=author
Are there any drawbacks to using the “prettified” version of this method?
Ideally, I would like to use the intermediate PHP script, as I have already described above (see part 1). However, any tips, suggestions or other ways you may have implemented on your websites are very welcome!
For item (1), you can maintain your own app's profiles (author.php in your case) for your authors. On your own app's profile page (author.php), you would add a link from that page to Google and specify the rel="me" attribute on that link. So Alice's profile page might say something like "Find Alice on Google+.
This indirect authorship linking is supported. You also will need the link from Alice's Google+ profile that lists her as a contributor to your site. Once the linking is setup in both directions, authorship can start to show up. Authorship won't always display in all cases and can take some time for it to start appearing as Google would need to reindex your pages.
For item (2), I don't think the profiles URL will enable authorship. Some people use that URL as a vanity URL, but as far as I know it isn't supported for use with things like authorship, badges, etc.
You should test if your redirects are followed using the Rich Snippets Testing Tool: http://www.google.com/webmasters/tools/richsnippets
rel="author" is no longer supported.
I'm currently writing a shop-related site that has it's own community in different social networks. While posting to VKontakte and Facebook is less of an issue (I can understand the concept of "group", and VK actually has an option to write posts using the group's name), Twitter is more troublesome.
Two questions:
Is there even such a thing as "groups" in Twitter? The closest I have seen is lists and timelines, but neither appears to solve my issue.
I cannot give the operator access to the twitter account. VK has a specific option when posting in a group to use that group's name as poster name. How does this work in Twitter?
I need something akin to what lamoda has set up. (It appears to be a user, and every post is labeled as written by that user, however I doubt they give their ops access to the actual twitter account).
P.S.: I'm already done with getting past OAuth and using REST to actually post, thus no code provided. I'm just having trouble with the statuses/update.json call, if that's what I should actually be using.
Talk about simple solutions to simple problems.
It appears I have been overcomplicating. There are no groups in twitter, or even comments at that. You can only post to your own feed or re-post from somebody else's.
Posting to someone's feed (a shop account's, say) is simple enough using that account's pre-generated access token which can be stored in the configs.
How can i get private pages of my web site being crawled and indexed by google ?
maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page.
EDIT: Based on the addition of "maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page." To the question:
You can check the User Agent in your php code to basically allow google to see pages if it was a registered user (google's user agent is "Googlebot/1.0" and you can search to find user agents for other common engines).
However, this behavior is specifically against google's rules and they can and will remove your site from the index if they catch you doing it. Their policy is you should not treat googlebot any differently than you treat any random person who visits your site.
(Original Answer) One way is to use a sitemap to show google how to find all of your pages.
In general, and even in the case of sitemaps, if the content you want indexed is not linked to from a page that can be found through the "root" (/) (i.e. there is no way for the public to find it), then it probably won't get indexed. The only way to get it indexed is to link it in someplace.
The question is though, why do you want your private pages in google anyway?
They'll get crawled if and only if they're publicly accessible and your robots.txt file allows it. That's pretty much all you need to do.
Are you asking how to get Google to index your pages?
There are a couple of ways. You need to ensure that you have SEO'd, or Search Engine Optimisation, the pages properly with title text and description key words in your meta data.
You can also submit your site to Google, it's a free service, and it'll be placed in a queue of things that Google will index. May take some time though.
By far the best way to get your pages indexed is using the meta data in the pages themselves.
Google will only index what is
linked from somewhere already in Google's index
accessible to its crawler via normal (unauthenticated) HTTP
It will also
make the contents available in search results to anyone.
This may conflict with your idea of a "private" page.
I'm going to assume that all the other previous answerers are misunderstanding you. As I read it, you aren't asking how to get Google to index your pages, but rather how to get a list of all the pages that Google currently has already indexed on your site? If that is true, you should have a look at the Google Webmaster Tools.