Google web app to log page accesses to Google docs, slides, sheets, etc - google-docs-api

I have a range of Google docs that are publicly viewable, but I would like to get some information about how often they are being viewed. I understand that there used to be a way of doing this with Google Analytics, but now that has been removed.
It seems to me that I have two main options, one of which is to make all my doc links point to a page which redirects according to a query string parameter, e.g.:
http://myurl.net?page=1 # Sends you to one page and logs the visit
http://myurl.net?page=2 # Sends you to another page and logs the visit
Or alternatively, I could try to embed some code in each doc that makes a call back to the server with its page number. But I don't know if this is possible.
The first option looks like it should be fairly easy, but I don't see how to redirect the client.
Could anyone give me some ideas about how to do this? It seems it would be useful for quite a lot of people.
Many thanks.
Justin.

Related

Get rid of old links to a retired website in Google search

I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.
A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".

Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook

I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...
The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.
I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...
So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash
Any idea? I'm using rails 3.
Thanks!
If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.
User-agent: *
Disallow: /
(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)
It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.
If your users would copy&paste HTML, you could make use of the nofollow link relationship type:
cute cat
However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.
Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.
But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.
So the only chance you have is to decide if it’s a bot or a human after the link got clicked.
You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.
You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.

How do search engines see dynamic profiles?

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap

How would I find all the short urls that link to a particular long url?

Basically I want to know how many people have tweeted a link to a url, but since there are dozens of link shortener out there I don't see any way to do this without having access to all of their url maps. I found a previous question here but it was over a year old and didn't have any new answers.
So #1, does anyone know of a service/API for doing this?
And #2, can anyone think of a way to accomplish this task other than submitting the long url in question to all the popular link shortening sites?
ps- I'm also open to comments about why this is impossible or impractical.
You could perform a Google search (or the equivalent via API) for any pages that link to your page. This is done with the link: keyword. So if you're trying to figure out how many people link to www.example.com (regardless of whether it's through a link shortner URL), then you would just do a Google search for link:www.example.com.
e.g.: http://www.google.com/search?q=link:www.example.com
Note that this will only find pages that have been indexed, so pages that haven't been crawled, or pages that get crawled infrequently, will not show up in the results until a later date (if at all).
Since all sites have different algorithms for shortening the URLs, and these are different sites that most likely do not share their data with each other, how can you hope to find all of them in a single or small number of queries?
All you can do is brute-force it, and even then this might not be any good if a site is content to create a new value for the same long-form URL (especially if you send a different long-form URL that maps to the same place, like http://www.stackoverflow.com/ rather than http://stackoverflow.com/).
In order to really get this to work, there would have to be a site that ALREADY automatically collects all of this information from every site, which the URL shortening sites voluntarily call. And even if you wrote such a site, that doesn't account for the URL-shortening sites already out there who already have data!
In short, I do not see how this is remotely possible, unless I'm wrong about there being such a database somewhere out there.
So months after asking this question I came across a solution to a similar question, that is how to tell how many times a link has been shared on facebook. The solution, via a simple new API call:
http://graph.facebook.com/http://stackoverflow.com
returns the following json data:
{
"id": "http://stackoverflow.com",
"shares": 1627
}

How-To get private pages being crawled by google

How can i get private pages of my web site being crawled and indexed by google ?
maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page.
EDIT: Based on the addition of "maybe it's not very "conventionnal", but i want my private page "links" displayed in google index, but next require a registration to display the page." To the question:
You can check the User Agent in your php code to basically allow google to see pages if it was a registered user (google's user agent is "Googlebot/1.0" and you can search to find user agents for other common engines).
However, this behavior is specifically against google's rules and they can and will remove your site from the index if they catch you doing it. Their policy is you should not treat googlebot any differently than you treat any random person who visits your site.
(Original Answer) One way is to use a sitemap to show google how to find all of your pages.
In general, and even in the case of sitemaps, if the content you want indexed is not linked to from a page that can be found through the "root" (/) (i.e. there is no way for the public to find it), then it probably won't get indexed. The only way to get it indexed is to link it in someplace.
The question is though, why do you want your private pages in google anyway?
They'll get crawled if and only if they're publicly accessible and your robots.txt file allows it. That's pretty much all you need to do.
Are you asking how to get Google to index your pages?
There are a couple of ways. You need to ensure that you have SEO'd, or Search Engine Optimisation, the pages properly with title text and description key words in your meta data.
You can also submit your site to Google, it's a free service, and it'll be placed in a queue of things that Google will index. May take some time though.
By far the best way to get your pages indexed is using the meta data in the pages themselves.
Google will only index what is
linked from somewhere already in Google's index
accessible to its crawler via normal (unauthenticated) HTTP
It will also
make the contents available in search results to anyone.
This may conflict with your idea of a "private" page.
I'm going to assume that all the other previous answerers are misunderstanding you. As I read it, you aren't asking how to get Google to index your pages, but rather how to get a list of all the pages that Google currently has already indexed on your site? If that is true, you should have a look at the Google Webmaster Tools.

Resources