My website should only have one index.php that checks the requested URL and displays the right content via include(...) respectively.
Right now I am using similar_text for URLs that doesn't exist on website. Than the most simliar path should be choosen.
But I heard that simliar URLs that gives the same content aren't good for search engines.
So does it have a bad effect to search engines like Google?
All urls with the same content are counted as double.
You should return 404 not found header for all urls wich do not mach exactly with yours.
header('HTTP/1.0 404 Not Found');
echo "<h1>Error 404 Not Found</h1>";
More detailes about google's attitude see in documentation
Related
I've been reading many articles about SEO and investigating how to improve my site. I found an article that said that having friendly URLs help online indexers to find and positionate your site better than using URLs with lots of GET parameters so I decided to adapt my site to this kind of URL. I've also read that there's a way (editing .htaccess) but it's not the best way and it doesn't look really good.
For example, that's how Google's About URL looks like:
https://www.google.com/search/about/es/
When surfing into FTP do they see the directories search/about/es/index.html? If so, you must create many files and directories for each language instead of using &l=es, is it that worth?
You can never know (for sure) how resources are mapped to URLs.
For example, the URL https://www.google.com/search/about/es/ could
point to the HTML file /search/about/es/index.html
point to the HTML file /foo/bar/1.html
point to the PHP script /index.php
point to the PHP script /search.php?title=about&lang=es
point to the document available from the URL https://internal.google.com/1238
…
It’s always the server that, given the URL from the request, decides which resource to deliver. Unless you have access to the server, you can’t know how. (Even if a URL ends with .php, it’s not necessarily the case that PHP is involved at all.)
The server could look for a file that physically exists (if URL rewriting is involved: even in "other" places than what the URL path suggests), the server could run a script that generates a document on the fly (e.g., taking the content from your database), the server could output the file available from another URL, etc.
Related Wikipedia articles:
Rewrite engine
Web framework: URL mapping
Front controller
tl;dr: Misconfigured ASP.NET MVC servers return "200 OK" when they should 404.
I'm building a list of tech employeer career page links. I am flummoxed to find it quite common that such companies have open positions listed on their sites, but they don't have any links to them. That is, if you visit www.example.com, nowhere on the homepage - sometimes, nowhere on the whole website - can be found a link to www.example.com/jobs
To get around that, after manually indexing a few hundred sites, I made a list of common URL paths:
/careers
/careers/
/careers.html
/jobs.aspx
I have written a straightforward python script that when given a list of company homepages, uses pycurl - a wrapper around libcURL - to attempt HTTP HEAD requests for each (homepage, urlpath) pair:
http://www.example.com/careers
http://www.example.com/jobs
http://www.example.net/careers
http://www.example.net/jobs
This mostly works.
However there is what I gather to be a common misconfiguration problem with ASP.NET MVC which results in custom 404 pages producing a 200 response code while displaying the custom "Not Found" page. For example
http://www.microsoft.com/bill-gates-is-the-spawn-of-satan.html
Yes, that's right folks: Microsoft misconfigured their own server. :-D
If you use Firefox' web developer tools you can see that the above link produces a 200 OK instead of a 404 Not Found.
I expect this is a common problem for anyone who deals with scraping or robots: is there a straightforward programmatic way that I could tell that the above link should produce a 404 instead of a 200?
In my particular case, a modestly unsatisfactory solution would be to note that none of my links produce 404s, then produce a "can't find" output. In such cases I manually google the careers pages:
http://www.google.com/search?q=site:microsoft.com+careers
My goal for the near term is to partially automate the discovery of the links for my tech employer index. I expect that fully automating it would be intractible; I hope to automate the easy stuff.
I don't know of any way from the client end to know that a page is invalid when the server is explicitly telling the client that the page is valid. The second best solution that I can come up with is to grep for common text that is usually displayed on such pages such as "sorry" and "not found". This will, of course, do nothing for you if the custom error page is actually a redirect to a completely valid page such as the home page.
I have a robots.txt like below but Google has still indexed my domain. Basically they've indexed mydomain.com but not mydomain.com/any_page
UserAgent: *
Disallow: /
I mean how can I go back further than / which I thought was the root of domain?
Note this domain is a work in progess, hence I don't want Google or any other search engines seeing it for a minute.
If you don't have one already, get a Google Webmaster Tools account. It includes a URL removal tool that may work for you.
This doesn't address the problem of search engines possibly ignoring or misinterpreting your robots.txt file, of course.
If you REALLY want your site to be off the air until it's launched, your best bet is to actually take it off the air. Make the site inaccessible except by password. If you put HTTP Basic authentication on your documentroot, then no search engine will be able to index anything, but you'll have full access with a password.
I have a little problem with google bot, I have a server working on windows server 2009, the system called Workcube and it works on coldfusion, there is an error reporter built-in, thus i recieve every message of error, especially it concerned with google bot, that trying to go to a false link, which doesn't exist! the links looks like this:
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=282&HIERARCHY=215.005&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=145&HIERARCHY=200.003&brand_id=hoyrrolmwdgldah
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=123&HIERARCHY=110.006&brand_id=xxblpflyevlitojg
http://www.bilgiteknolojileri.net/index.cfm?fuseaction=objects2.view_product_list&product_catid=1&HIERARCHY=100&brand_id=xxblpflyevlitojg
of course with definition like brand_id=hoyrrolmwdgldah or brand_id=xxblpflyevlitojg is false, i don't have any idea what can be the problem?! need advice! thank you all for help! ;)
You might want to verify your site with Google Webmaster Tools which will provide URLs that it finds that error out.
Your logs are also valid, but you need to verify that it really is Googlebot hitting your site and not someone spoofing their User Agent.
Here are instructions to do just that: http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
Essentially you need to do a reverse DNS lookup and then a forward DNS lookup after you receive the host from the reverse lookup.
Once you've verified it's the real Googlebot you can start troubleshooting. You see Googlebot won't request URLs that it hasn't naturally seen before, meaning Googlebot shouldn't be making direct object reference requests. I suspect it's a rogue bot with a User Agent of Googlebot, but if it's not you might want to look through your site to see if you're accidentally linking to those pages.
Unfortunately you posted the full URLs, so even if you clean up your site, Googelbot will see the links from Stack Overflow and continue to crawl them because it'll be in their crawl queue.
I'd suggest 301 redirecting these URLs to someplace that make sense to your users. Otherwise I would 404 or 410 these pages so Google know to remove these pages from their index.
In addition, if these are pages you don't want indexed, I would suggest adding the path to your robots.txt file so Googlebot can't continue to request more of these pages.
Unfortunately there's no real good way of telling Googlebot to never ever crawl these URLs again. You can always go into Google Webmaster Tools and request the URLs to be removed from their index which may stop Googlebot from crawling them again, but that doesn't guarantee it.
I am using phpbb forum, with some seo plugin, which turned all my dynamic urls "viewtopic.php?=1234" to SEO urls such as "/super-jackpot-t821.html". I was happy with it.
but now problme is, i have moved host, moved phpbb to sub folder and upgraded to latest phpbb. Now that plugin stopped working and all the urls are already indexed by google, yahoo etc.
So i was thinking is it possible to 301 redirect SEO urls back to normal urls? May be picking the last 821 number of seo url using HTACCESS and turning it back to viewtopic.php?t=821 ???
thankx.
Here's a htaccess guide i found.
http://www.garnetchaney.com/htaccess_tips_and_tricks.shtml
to match 0 to 9999 the regex should be ^[0-9]{1,4}$