We have a relatively large website and by looking at Google Search Console we have found a lot of strange errors. By lot, I mean 199 URLs give 404 reponse.
My problem is that I don't think these URLs can be found on any of our pages, even though we have a lot of dynamically generated content.
Because of this, I wonder if these are URLs the crawler found or requests coming to our site? Like mysite.com/foobar, which obviously would return 404.
Google reports all backlinks to your website that deliver a 404 in the Google Search Console, no matter if there has ever been a webpage with that URL in the past.
When you click on an URL in the pages with an error list, a pop-up window will give you details. There is a tab "Linked from" listing all (external) links to that page.
(Some of the entries can be outdated. But if these links still exist, try to get them updated or set up redirects for them. The goal in the end is to improve the user experience.)
Related
I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.
A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".
I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.
Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.
Really strange. One of my posts is being tracked half the time in Google Analytics as its correct permalink, while the other half of the pageviews are coming from a single forward slash that is attributed with the same Page Title.
Example:
Title of Page: Official iPhone Unlock
Correct URL of page: /official-iphone-unlock
Two URL's being tracked with that page title:/official-iphone-unlock/
So, needless to say, this is throwing off my numbers as I'm getting pageviews for this page under both URLs, and really hard to figure out what the issue is. I'm using ECWID shopping cart, and I'm suspicious that it's their way of tracking things, but I can't prove it. But the issue started around the time I enabled their tracking code.
Have you tried segmenting the traffic for these pages by browser?
First find the page:
Behavior > Site Content > All Pages (then search for your pages)
...then cross-drill by browser segment:
Secondary Dimension > Visitors > Browsers
One possibility that comes to mind is that some browsers may auto-append a slash to the end of URLs without a file extension, while others may not. For example, Chrome forwards a /foo URL to /foo/ for me. It may only be specific versions of a browser that exhibits this behavior -- like IE9 for example.
You can implement the filter to remove the trailing slashes - check this https://www.petramanos.com/ecommerce-google-analytics/remove-slashes-end-urls-google-analytics
I am trying to replace lots of pages in my database at once, and lots of pages that are indexed by Google will have new URLs. So in result, old pages will be redirected to a 404 page.
So I need to design a new 404 page, by including a search box in it. Also, I want the 404 page to grab the keywords in the broken URL in the address bar, and show the search result based on the keywords in the broken links, so that user will have an idea where to go next to the new link.
Old URL:
http://abc.com/123-good-books-on-rails
New URL:
http://abc.com/good-books-on-rails
Then when a user comes from search engine, it shows the old URL. The 404 page will do a search on "good books on rails" keywords and return with a list of search result. So the user know the latest url of that link.
How do I implement this? I will be using Friendly ID, Sphinx and Rails 2.3.8.
Thanks.
You are far better off simply generating the appropriate redirects yourself than to expect your users to do anything weird if a Google link fails. This won't be indefinite - Google will eventually reindex you. If you use 301 (permanent) redirects, Google will be smart enough to NOT follow the link when reindexing your site. If you don't want to manually create redirects for hundreds of pages, then you'll need to try to figure out the algorithm for how your old pages map to new pages.
I know the Google Search Appliance has access to this information (as this factors into the PageRank Algorithm), but is there a way to export this information from the crawler appliance?
External tools won't work because a significant portion of the content is for a corporate intranet.
Might be something available on Google, but I have never checked. I usually use the link checker provided by W3C. The W3C one can also detect redirects which is useful if your server handles 404s by redirecting instead of returning a 404 status code.
You can use Google Webmaster Tools to view, among other things, broken links on your site.
This won't show you broken links to external sites though.
It seems that this is not possible. Under Status and Reports > Crawl Diagnostics there are
2 styles of report available: the directory drill-down 'Tree View'
and the 100 URLs at a time 'List View'. Some people have tried creating programs to page through the List View
but this seems to fail after a few thousand URLs.
My advice is to use your server logs instead.
Make sure that 404 and referrer URL logging are enabled on your web server,
since you will probably want to correct the page containing the broken link.
You could then use a log file analyser to generate a broken link report.
To create an effective, long-term way of monitoring your broken links, you may want to set up a cron job to do the following:
Use grep to extract lines containing 404 entries from the server log file.
Use sed to remove everything except requested URLs and referrer URLs from every line.
Use sort and uniq commands to remove duplicates from the list.
Output the result to a new file each time so that you can monitor changes over time.
A free tool called Xenu turned out the be the weapon of choice for this task. http://home.snafu.de/tilman/xenulink.html#Download
Why not just analyze your webserver logs and look for all the 404 pages? That makes far more sense and is much more reliable.
I know this is an old question but you can use the Export URLs feature on the GSA admin console then look for URLs with a state of not_found. This will show you all the URLs that the GSA has discovered but returned it a 404 when it attempted to crawl them.