Is there a search engine including indexing bot which can be used to make up special catalogue by feeding the bot with certain properties? - search-engine

Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.

DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.

Related

Code to check whether site has been listed on search engines and directories

I am currently developing an application in Rails, which requires to check whether a website has been listed in Google, Bing, Yahoo, Yelp and Yellow Pages. From my research the best is to check site: domain.com on Google and Bing and look for results and check in Yahoo directory for the domain.
Is there any other way to do it? I mean some code snippet to check on domain's home page or using their API or something like that. Also how to check on Yelp and Yellow pages.
You can use mechanize and write web-style drivers
Google: do a search on your domain with this on the search term
site:checkmeout360.com
https://www.google.com/search?q=site%3A<SITE_NAME>.com
Try to see how yelp, yahoo, bing and yellow pages do indexing. Then you can use mechanize to automate the searching process for you, you can use mechanize to do the search like above with google, then write asserts (check if stuff you are looking for is on the search result)
Search engines don't appreciate automated queries that are sent their way.
Here is what Google has to say about it:
Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google. Sending automated queries consumes resources and includes using any software (such as WebPosition Gold) to send automated queries to Google to determine how a website or webpage ranks in Google search results for various queries. In addition to rank checking, other types of automated access to Google without permission are also a violation of our Webmaster Guidelines and Terms of Service.

How to set different languages for different spiders on a website?

I have multilanguage website. Actually, the website language is chosen according to the web browser language.
Is there any way to set the language according to the search engine spider? For example:
Display the website in Chinese for Baidu search engine spider,
Display the website in Russian for Yandex spider?
This is called crawler identification. When a request is made to your website, User-Agent field contains the information about the browser or the crawler.
Depending on the crawler, the value of this field will be different. You can then associate different values with different languages. You can also take a look at the large list of user agents.
I'm still pretty sure that by doing this, you'll lower your rank in search engines since you provide different responses to crawlers than to real users, but I don't have solid references to support this statement.
In all cases, crawlers are expected to gather resources in different languages, and those crawlers know how to deal with multilingual websites, except maybe the ones which try to follow every worst practice. Also, the search engines you quoted are not limited to one language. Yandex is available for example in Turkish. As for Baidu, According to Wikipedia, it serves China, Japan, Thailand, Egypt and India.

How do Bit Torrent search engines work?

For a normal search engine, I understand that it regularly travel across the internet to gather web page information, and sometimes the web page can voluntarily submit to engines their latest updates. But how about BT search engines? These torrents cannot be simply find through viewing web pages. Then how do they work? User submit?
A publisher submits their torrent to a tracker, and then distributes a link to the file on that tracker. Users in turn use that file to connect to the specified tracker and download that file; the tracker then gives a list of users who are sharing that file. The torrent search sites just list what trackers are available and what files can be found on what trackers, which are submitted by publishers.
However, I think this may be better suited to something like the superuser rather than stackoverflow...
No, users do not submit torrents. As we made with our torrent search site http://tornado.li/, we've created different robots that scan all added torrent sites for new torrents and add them to database. The whole process is fully automated, only in this way it's possible to give a good choise of torrents.

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

What's the best method to capture URLs?

I'm trying to find the best method to gather URLs, I could create my own little crawler but it would take my servers decades to crawl all of the Internet and the bandwidth required would be huge. The other thought would be using Google's Search API or Yahoo's Search API, but that's not really a great solution as it requires a search to be performed before I get results.
Other thoughts include asking DNS servers and requesting a list of URLs but DNS servers can limit/throttle my requests or even ban me all together. My knowledge of asking DNS servers is quite limited at the moment, so I don't know if this is the best method or not.
I just want a massive list of URLs, but I want to build this list without running into brick walls in the future. Any thoughts?
I'm starting this project to learn Python but that really has nothing to do with the question.
$ wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
You can register to get access to the entire .com and .net zone files at Verisign
I haven't read the fine print for terms of use, nor do I know how much (if anything) it costs. However, that would give you a huge list of active domains to use as URLs.
How big is massive? A good place to start is http://www.alexa.com/topsites. They offer a download of the top 1,000,000 sites (by their ranking mechanism). You could then expand this list by going to Google and scraping the results of the query link: url for each url in the list.
modern terms now are URI and URN, URL is the shrunk/outdated. i'd scan for sitemap files that contain many addresses in one file and study the classic text spiders, wanderes, brokers and bots and RFC 3305 (appendix b. p 50) defining URI regex

Resources