How do search engines obtain unlinked pages? - search-engine

I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.

Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.

Related

Get rid of old links to a retired website in Google search

I have a website that has been replaced by another website with a different domain name.
In Google search, I am able to find links to the pages on the old site, and I hope they will not show up in future Google search.
Here is what I did, but I am not sure whether it is correct or enough.
Access to any page on the old website will be immediately redirected to the homepage of the new website. There is no one-to-one page mapping between the two sites. Here is the code for the redirect on the old website:
<meta http-equiv="refresh" content="0;url=http://example.com" >
I went to Google Webmasters site. For the old website, I went to Fetch as Google, clicked "Fetch and Render" and "Reindex".
Really appreciate any input.
A few things you'll want to do here:
You need to use permanent server redirects, not meta refresh. Also I suggest you do provide one-to-one page mapping. It's a better user experience, and large numbers of redirects to root are often interpreted as soft 404s. Consult Google's guide to site migrations for more details.
Rather than Fetch & Render, use Google Search Console's (Webmaster Tools) Change of Address tool. Bing have a similar tool.
A common mistake is blocking crawler access to an retired site. That has the opposite of the intended effect: old URLs need to be accessible to search engines for the redirects to be "seen".

Want to make a Search Engine

I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.
Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.

Code to check whether site has been listed on search engines and directories

I am currently developing an application in Rails, which requires to check whether a website has been listed in Google, Bing, Yahoo, Yelp and Yellow Pages. From my research the best is to check site: domain.com on Google and Bing and look for results and check in Yahoo directory for the domain.
Is there any other way to do it? I mean some code snippet to check on domain's home page or using their API or something like that. Also how to check on Yelp and Yellow pages.
You can use mechanize and write web-style drivers
Google: do a search on your domain with this on the search term
site:checkmeout360.com
https://www.google.com/search?q=site%3A<SITE_NAME>.com
Try to see how yelp, yahoo, bing and yellow pages do indexing. Then you can use mechanize to automate the searching process for you, you can use mechanize to do the search like above with google, then write asserts (check if stuff you are looking for is on the search result)
Search engines don't appreciate automated queries that are sent their way.
Here is what Google has to say about it:
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

Is there a web search service/site either with an API or which works with YQL?

I'd like to make a tool which accesses a search engine programatically.
I've been enjoying using YQL recently and thought it might be useful since it can dig data out of HTML pages.
But I tried it with Google, Bing, and Yahoo search and they all seem to block YQL.
I wonder if there are some lesser-known web search sites that might work with YQL.
Or actually if there's still any search engine which offers an API that would be even better.
(In fact I'm only searching linguistics.stackexchange.com because the Stack Exchange APIs don't provide a way to search by text that I can find.)
Most search engine sites will block access from screen scrapers and other agents. YQL is designed to respect the robots.txt file, so on many sites like this it won't work.
Instead, I suggest moving a step above HTML screen scraping and using a published search API.
In YQL for example, there is a table which provides access to the Bing search results:
select * from microsoft.bing where query="soccer" and source in ("web","image")
You could also look at the Yahoo! BOSS API or using the Bing Search API directly.

How do search engines see dynamic profiles?

Recently search engines have been able to page dynamic content on social networking sites. I would like to understand how this is done. Are there static pages created by a site like Facebook that update semi frequently. Does Google attempt to store every possible user name?
As I understand it, a page like www.facebook.com/username, is not an actual file stored on disk but is shorthand for a query like: select username from users and display the information on the page. How does Google know about every user, this gets even more complicated when things like tweets are involved.
EDIT: I guess I didn't really ask what I wanted to know about. Do I need to be as big as twitter or facebook in order for google to make special ways to crawl my site? Will google automatically find my users profiles if I allow anyone to view them? If not what do I have to do to make that work?
In the case of tweets in particular, Google isn't 'crawling' for them in the traditional sense; they've integrated with Twitter to provide the search results in real-time.
In the more general case of your question, dynamic content is not new to Facebook or Twitter, though it may seem to be. Google crawls a URL; the URL provides HTML data; Google indexes it. Whether it's a dynamic query that's rendering the page, or whether it's a cache of static HTML, makes little difference to the indexing process in theory. In practice, there's a lot more to it (see Michael B's comment below.)
And see Vartec's succinct post on how Google might find all those public Facebook profiles without actually logging in and poking around FB.
OK, that was vastly oversimplified, but let's see what else people have to say..
As far as I know Google isn't able to read and store the actual contents of profiles, because the Google bot doesn't have a Facebook account, and it would be a huge privacy breach.
The bot works by hitting facebook.com and then following every link it can find. Whatever content it sees on the page it hits, it stores. So even if it follows a dynamic url like www.facebook.com/username, it will just remember whatever it saw when it went there. Hopefully in that particular case, it isn't all the private data of said user.
Additionally, facebook can and does provide special instructions that search bots can follow, so that google results don't include a bunch of login pages.
profiles can be linked from outside;
site may provide sitemap

Resources