Code to check whether site has been listed on search engines and directories - ruby-on-rails

I am currently developing an application in Rails, which requires to check whether a website has been listed in Google, Bing, Yahoo, Yelp and Yellow Pages. From my research the best is to check site: domain.com on Google and Bing and look for results and check in Yahoo directory for the domain.
Is there any other way to do it? I mean some code snippet to check on domain's home page or using their API or something like that. Also how to check on Yelp and Yellow pages.

You can use mechanize and write web-style drivers
Google: do a search on your domain with this on the search term
site:checkmeout360.com
https://www.google.com/search?q=site%3A<SITE_NAME>.com
Try to see how yelp, yahoo, bing and yellow pages do indexing. Then you can use mechanize to automate the searching process for you, you can use mechanize to do the search like above with google, then write asserts (check if stuff you are looking for is on the search result)

Search engines don't appreciate automated queries that are sent their way.
Here is what Google has to say about it:
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

Related

How do search engines obtain unlinked pages?

I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.
Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.

Is there a web search service/site either with an API or which works with YQL?

I'd like to make a tool which accesses a search engine programatically.
I've been enjoying using YQL recently and thought it might be useful since it can dig data out of HTML pages.
But I tried it with Google, Bing, and Yahoo search and they all seem to block YQL.
I wonder if there are some lesser-known web search sites that might work with YQL.
Or actually if there's still any search engine which offers an API that would be even better.
(In fact I'm only searching linguistics.stackexchange.com because the Stack Exchange APIs don't provide a way to search by text that I can find.)
Most search engine sites will block access from screen scrapers and other agents. YQL is designed to respect the robots.txt file, so on many sites like this it won't work.
Instead, I suggest moving a step above HTML screen scraping and using a published search API.
In YQL for example, there is a table which provides access to the Bing search results:
select * from microsoft.bing where query="soccer" and source in ("web","image")
You could also look at the Yahoo! BOSS API or using the Bing Search API directly.

Rails site search with Bing API example?

Previously these folks promised a release of their implementation of Bing search for their site at the following article: http://www.globalnerdy.com/2009/06/29/learnhub-powered-by-rails-searches-with-bing/
Is anyone familiar with a Ruby or Rails lib that would facilitate site search with Bing? Google just hasn't been a good match so far with their site search and a search with MS Bing, surprisingly, seems to be a much better solution.
Otherwise, an example of how to accomplish this, even without a lib and directly using the API, would be much appreciated.
While not a custom site search per se, you should be able to use RBing for accessing Bings search API. There's an introductory tutorial over at http://9astronauts.com/code/ruby/rbing/
To make it work like a site search, simply append a site:example.com to your queries and it will only return results from that domain. For example:
bing = RBing.new("YOURAPPID")
query = "something interesting"
results = bing.web("#{query} site:stackoverflow.com")
puts results.web.results[0].title
=> "javascript - How to illuminate a browser window/tab when something ..."

How to obtain a log of web search queries?

It would help if I could do a search log analysis for my research. Is it possible to use a search API (Google, Yahoo, Bing) to create a log of web search queries over a specified time span, or is it available on request?
The only thing I know of is the old aol search logs which they released a while back. You can find it on some of the torrent sites. for news about it read this

Is there a search engine including indexing bot which can be used to make up special catalogue by feeding the bot with certain properties?

Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.
DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.

Resources