How to block my website from all the web - search-engine

I have a website and it should not be accessible from any one without the URL or any search engines.
No search engine should be aware of my website, only the person with the link should access it. Can some one suggest the best ideas since I'm going to share my office data's on it.

You can prevent most search engines from indexing your site with a robots.txt file. More details here: http://www.robotstxt.org/
However, this is not very secure. Some robots ignore robots.txt. The best way to restrict access is either to require a user to log in before entering the site, or use a firewall to allow only that user's IP address.

You need to add a robots.txt file to the root folder of your website indicating that search engine spiders should not index tour website.
http://www.robotstxt.org/robotstxt.html
But its left to the search engines to read the file. Most popular search engines do honor this file. Another method is to not have a index.htm or default.htm in your website Even if it exists, remove any links to internal pages. This way spiders will never know the site structure of your website.

Wow. OK:
1) robots.txt
http://www.robotstxt.org/robotstxt.html
2) Authentication. If you're using apache, password protect the site
3) Ensure no one ever links to it from anywhere.
4) Consider a different alternative, like dropbox.
http://www.dropbox.com/

Related

Get rid of old links to a retired website in Google search

I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.
A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".

Want to make a Search Engine

I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.
Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.

How to delete old Google Urls with parameters

a month ago i relaunched a Website in Typo3 CMS. Before that, the site was hosted with Joomla CMS.
In Joomla Config, SEO Links were disabled, so Google indexed the Page Urls this:
www.domain.de/index.php?com_component&itemid=123....
for example.
Now, a month later (after the Typo3 Relaunch), these Links are still visible in Google because the Urls don't return a 404-Error. That's because "index.php" also exists on Typo3 and Typo3 doesnt care about the additional query string/variables - it returns a 200 status code and shows the front page.
In Google Webmaster Tools it's possible to delete single Urls from the Google Index, but that way i have to delete about 10000 Urls manually...
My Question is: Is there a way to remove these old Urls from the Google Index?
Greetings
With this amount of URL's there is only one sensible solution, implement the proper 404 handling in your TYPO3, or even better redirections to same content placed in TYPO3.
You can use TYPO3's handler (search for it in Install Tool > All configuration) it's called pageNotFound_handling, you can use options like REDIRECT for redirecting to some page or even USER_FUNCTION, which allow you to use own PHP script, check the description in the Install Tool.
You can also write a simple condition in TypoScript and check if Joomla typical params exists in the URL - so that easy way you can return custom 404 page. If it's important to you to make more sophisticated condition (for an example, you want to redirect links which previously pointed to some gallery in Joomla, to new gallery in TYPO3) you can make usage of userFunc condition and that would be probably best option for SEO
If these urls contain an acceptable number of common indicators, you could redirect these links with a rule in your virtual host or .htaccess so that google will run into the correct error message.
I wrote a google chrome extension to remove urls in bulk in google webmaster tools. Check it out here: https://github.com/noitcudni/google-webmaster-tools-bulk-url-removal.
Basically, it's a glorified for loop. You put all the urls in a text file. For example,
http://your-domain/link-1
http://your-domain/link-2
Having installed the extension as described in the README, you'll find a new "choose a file" button.
Select the file you just created. The extension reads it in, loops thru all the urls and submits them for removal.

How to restrict to access to non-SEO URLs in Joomla 3?

I've converted all my URLs to SEO friendly URLs.
But I want to restrict to be accessed to my non-seo friendly URLs.
As an example, you can access to www.example.com/article-1 with http://www.example.com/index.php?option=com_content&view=article&id=76&Itemid=113. But I don't want this. I just want you to be able to access with http://www.example.com/article-1
I wish that I'm clear to explain what I need.
I don't think it's possible for the simple reason that Joomla always uses the non-SEF links internally. That's why they always work.
Also there are links which are not converted to SEF links because the user will not see and Google will not index them. Like links used by AJAX scripts or similar things.
If you block non-SEF urls in your .htaccess file, expect your page to break sooner than later. Don't blame the extension developer then :-)

What should i know about search engine crawling?

I don't mean SEO things. What should i know. Such as
Do engines run javascript?
Do they use cookies?
Will cookies carry across crawl sessions (say cookies from today and a craw next week or month).
Are selected JS filters not loaded for any reason? (Such as suspected ad which is ignored for optimization reasons?)
I don't want to accidental have all index page say some kind of error or warning msg like please turn on your cookie, browser not supported, or not be indexed because i did something silly such as having my sitemap point to /r?id=5 and not have then index because it is a redirect (i would use 301 however).
From here: http://www.google.com/support/webmasters/bin/answer.py?answer=35769
Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
Read Google's Webmaster guidelines

Resources