How do Bit Torrent search engines work? - search-engine

For a normal search engine, I understand that it regularly travel across the internet to gather web page information, and sometimes the web page can voluntarily submit to engines their latest updates. But how about BT search engines? These torrents cannot be simply find through viewing web pages. Then how do they work? User submit?

A publisher submits their torrent to a tracker, and then distributes a link to the file on that tracker. Users in turn use that file to connect to the specified tracker and download that file; the tracker then gives a list of users who are sharing that file. The torrent search sites just list what trackers are available and what files can be found on what trackers, which are submitted by publishers.
However, I think this may be better suited to something like the superuser rather than stackoverflow...

No, users do not submit torrents. As we made with our torrent search site http://tornado.li/, we've created different robots that scan all added torrent sites for new torrents and add them to database. The whole process is fully automated, only in this way it's possible to give a good choise of torrents.

Related

How do search engines obtain unlinked pages?

I noticed that quite a lot Dropbox pages are indexed by Google, Bing, etc. and was wondering how these search engines obtain for instance links like these:
https://dl.dropboxusercontent.com/s/85cdji4d5pl5qym/37-71.pdf
https://dl.dropboxusercontent.com/u/11421929/larin2014.pdf
Given that there are no links on dl.dropboxusercontent.com to follow and the path structure is not that easy to guess, how is it possible that a search engine obtains such a link?
One solution might be that it was posted on a forum and picked up by the search engine but I looked up quite a lot of the links and checked the backlinks without success. I also noticed that Bing and Yahoo show a considerable amount of more results than Google which would mean that Bing does a better job in picking up these links which seems unlikely to me.
Even if the document is really unlinked (no link on their site, no link on someone other’s site, no sitemap, no Referer log from a site that gets linked in the document, etc.), it’s still possible for search engines to find the link.
Two ways are:
Someone could submit the URL to a search engine (whether via a public tool, or via the site’s webmaster account).
The search engine could get all URLs that certain users visit in their browsers. This could, for example, happen when the user has installed a toolbar from that search engine. This is the case with Bing, see my related answer on Webmasters SE:
Microsoft has confirmed that they do discover and index URLs that they find through users surfing the Internet with the Bing Toolbar installed.
And there might be more ways, of course.

Custom user tracking or 3rd party service for page referral analytics

The question I'm trying to answer for a set of users is how other users end up on their page. There are about 5 different ways a user can end up on your page. For example, they could have searched your name, clicked a link from a newsfeed or received an e-mail with a link to your page.
What is the best way to accomplish tracking these events? I'm initially inclined to create a table to track this. Each link would send an async event to the server to be added to the table. However, I'm also aware that there are many tracking services out there such as Google Analytics and Mixpanel. I've looked at their docs briefly and they don't seem to fit my need.
Am I missing something? Is it worth it to create a "custom" even tracking system to accomplish this?
It is not worth creating your own service. Plus you cannot add async link to search engine result pages or emails (that would require tracking code that you cannot implement in search engines or that would not be executed in mail clients).
Web analytics software tracks traffic sources by analyzing the incoming traffic via its http headers. If there is a referrer set the traffic will be attributed to, well, the referring site, unless the traffic is included in a list of known search engines in which case it will be attributed to organic search traffic etc.
In most systems you can customize source attribution by adding query parameters in the url (obviously this will not work with search engines and the like, since you cannot add parameters to organic search results). For example with Google Analytics you can add custom campaign parameters in email links or advertising campaigns. If people click on those links the parameter value will be send to GA and the source/medium/campaign information will be set accordingly (e.g. traffic from web mail clients would usually be attributed as a referrer, but campaign parameters allow to attribute the link to your mail campaigns).
There might be reasons to create your own system, but channel attribution is not one of them; GA and every other system I know of has this thoroughly covered.

Want to make a Search Engine

I want to make a torrent search engine which will provide links to other torrent sites. So I need data from other sites to index them in the database. So, is it legal to crawl a website for this purpose or is there some other way to do that.
Depending on the site, without permission it is not legal.
You might wish to investigate Common Crawl, a website that has already crawled the entirety of the web. Check out their ToU section to check on the legality of it all.

Tracking users' clicks and page visits in Rails

I would like to monitor users' page visits and clicks in my Rails app to make recommendations. My questions are:
Is there a Rails gem for this, or Google Analytics is the standard? If latter is true, then how should I link a page visit to a particular user profile?
It is typical in Rails to have a section in application.html.erb, which is shared for all pages. If I add Google Analytics pageview tracking code to in application.html.erb, will it be able to track all individual pages?
There are other ways, but the vast majority probably use Google Analytics. Several gems exist that help you integrate with GA to get at the data. See here: https://www.ruby-toolbox.com/categories/Web_Analytics.
Based on your first question, it seems you may want more insight than GA can provide. I've used ClickTale (http://www.clicktale.com) and Woopra (http://www.woopra.com) before, to good effect. This article lists several other alternatives, too - notice the high marks for Clicky: http://imimpact.com/web-stats-alternatives-to-google-analytics/.
Google Analytics (and almost all of these others) will take care of your second question automatically whenever the user loads a new page, since it keyed by URL. That means that, although you put the GA script code in a single place, each unique page is tracked individually.
If you have AJAX requests that change that page without changing the URL, you'll need to dig in to the GA script API. Essentially you'll need to push a new url (possibly with a # in it) whenever you want to track an AJAX-driven link/button click. See here: http://davidwalsh.name/ajax-analytics
I am biased, but I would recommend checking out impressionist, if you need to integrate the page views into the app in real-time. With analytics you will always have some lag time and you are also relying on an external dependency. Impressionist is good if you need this kind of control, but if you are just looking for simple metrics and don't need to pull them into the app, then analytics is probably the way to go.
Check out Ahoy, at https://github.com/ankane/ahoy. With just a few lines of code in your app, you can track page views and tie them to user accounts.
You can further customize Ahoy to track custom events, both the client (with JavaScript) and server.
Ahoy does not depend on any third-party services.

Is there a search engine including indexing bot which can be used to make up special catalogue by feeding the bot with certain properties?

Our application (C#/.NET) needs a lot of queries to search. Google's 50,000 policy per day is not enough. We need something that would crawl Internet websites by specific rules we set (for ex. country domains) and gather URLs, Texts, keywords, name of websites and create our own internal catalogue so we wouldn't be limited to any massive external search engine like Google or Yahoo.
Is there any free open source solution we could use to install it on our server?
No point in re-inventing the wheel.
DataparkSearch might be the one you need. Or review this list of other Open Source Search Engines.

Resources