Google Search API vs. dedicated search engine - search-engine

I make a Q&A website like Stack Overflow, and now I want to add it searching capability.
I want going to use the Google Search API but one of my friends said that it's better to have your own searching engine because Google doesn't index all pages (specially recently added pages) and also it has negative effect on the site ranking.
My question is that these statements is true or not and what is the best way for searching in this site?

Depending on your database full text search capabilities may be available.
For example check MySQL : https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Alternative solution is to use a dedicated search index like Lucene: https://lucene.apache.org/core/

Related

How do search engines obtain unlinked pages?

I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.
Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.

Code to check whether site has been listed on search engines and directories

I am currently developing an application in Rails, which requires to check whether a website has been listed in Google, Bing, Yahoo, Yelp and Yellow Pages. From my research the best is to check site: domain.com on Google and Bing and look for results and check in Yahoo directory for the domain.
Is there any other way to do it? I mean some code snippet to check on domain's home page or using their API or something like that. Also how to check on Yelp and Yellow pages.
You can use mechanize and write web-style drivers
Google: do a search on your domain with this on the search term
site:checkmeout360.com
https://www.google.com/search?q=site%3A<SITE_NAME>.com
Try to see how yelp, yahoo, bing and yellow pages do indexing. Then you can use mechanize to automate the searching process for you, you can use mechanize to do the search like above with google, then write asserts (check if stuff you are looking for is on the search result)
Search engines don't appreciate automated queries that are sent their way.
Here is what Google has to say about it:
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

Is there a web search service/site either with an API or which works with YQL?

I'd like to make a tool which accesses a search engine programatically.
I've been enjoying using YQL recently and thought it might be useful since it can dig data out of HTML pages.
But I tried it with Google, Bing, and Yahoo search and they all seem to block YQL.
I wonder if there are some lesser-known web search sites that might work with YQL.
Or actually if there's still any search engine which offers an API that would be even better.
(In fact I'm only searching linguistics.stackexchange.com because the Stack Exchange APIs don't provide a way to search by text that I can find.)
Most search engine sites will block access from screen scrapers and other agents. YQL is designed to respect the robots.txt file, so on many sites like this it won't work.
Instead, I suggest moving a step above HTML screen scraping and using a published search API.
In YQL for example, there is a table which provides access to the Bing search results:
select * from microsoft.bing where query="soccer" and source in ("web","image")
You could also look at the Yahoo! BOSS API or using the Bing Search API directly.

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

Fuzzy match in sharepoint search engine?

In sharepoint 2007 sites, we can search for people or other contents. Is the search engine able to do fuzzy match so that "Micheal" can be corrected to "Michael"? If it's possible, does it need extra configuration?
I am also writing a custom webpart that uses sharepoint search service, a web service that has url like "http://site/_vti_bin/search.asm". Is it possible to use this service to do fuzzy search as well?
Thanks.
The Search Summary Web Part provides that capability: try searching SharePoint for "Microsfot" and you'll get a "Did you mean Microsoft?" prompt. However, I only seem to see it when I get no results at all, and it looks like it has some other limitations:
Threewill Wiki (posted by Kirk Liemohn)
I haven't seen that kind of matching used specifically, but you might get some ideas from the wildcard search web part on codeplex.

Resources