How do you build a torrent file indexer? - search-engine

I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
you search for keywords, it brings up a list of potential titles matching your search.
then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.

For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.

Related

Impact of Fiori Apps on Server

We are following Embedded Architecture for our S4HANA 1610 system.
Please let me know what will be the impact on Server if we implement 200+ Standard Fiori Apps in our System ?
Regards,
Sayed
When you say “server”, are you referring to the ABAP backend, consisting of one or more SAP application servers and usually one database server?
In this case, you might get a very first impression using transaction ST03.
Here, you get a detailed analysis of resource consumption on the SAP application server.
You also get information about database access times, as seen from the application server.
This can give you a good hint about resource consumption on the database server.
Usually, the ABAP backend is accessed from Fiori via OData calls.
Not every user interaction causes an OData call, some interactions are handled locally at the frontend.
In general, implemented apps only require some space on the hard disk, as long as nobody is using them.
So the important questions for defining the expected workload are:
How many users are working with these apps in which frequency (Avg.
thinktime)?
How many OData calls are sent from these apps to the backend and how
many dialog steps are handled by the frontend itself?
How expensive are these OData calls (see ST03)?
Every app reflects one or more typical business processes, which need to be defined.
Your specific Customizing also plays an important role, because it controls different internal functionality.
It’s also mandatory, to optimize database access, because in productive use, tables might get bigger in size, which might slow down database access over time.
Usually, this kind of sizing is done by SAP Hardware and Technology partners.

How to do Neo4j Cache-based Sharding?

I've been reading Neo4j's Operational Manual on Cache Sharding, and posts all over the web, however I can hardly find any detailed example on how to configure HAProxy for cache sharding(yes the one on Operation Manual is rather brief) on a real-world graph, which may contain multiple node labels.
Has anyone ever done this before? Would be lovely if you could share your experience.
Moreover, I'm a bit confused on the mechanism of the way to shard the graph using HAProxy. How do sub-graphs get cached on certain slaves, merely by providing rules in HAProxy? It surprised me to learn that cache sharding isn't handled by Neo4j.
The goal is to send queries hitting the same region of your graph always to the same instance. This of course means that the request data indicates the region. What to use as "region indicator" is heavily depending on the structure and shape of your graph.
In a lot of cases of customer facing applications people successfully used the current user id and set it as additional http header which is then evaluated by haproxy.

How Reduce Disk read write overhead?

I have one website mainly composed on javascript. I hosted it on IIS.
This website request for the images from the particular folder on hard disk and display them to end user.
The request of image are very frequent and fast.
Is there any way to reduce this overhead of disk read operation ?
I heard about memory mapping, where portion of hard disk can be mapped and it will be used as the primary memory.
Can somebody tell me if I am wrong or right, if I am right what are steps to do this.
If I am wrong , is there any other solution for this ?
While memory mapping is viable idea, I would suggest using memcached. It runs as a distinct process, permits horizontal scaling and is tried and tested and in active deployment at some of the most demanding website. A well implemented memcached server can reduce disk access significantly.
It also has bindings for many languages including those over the internet. I assume you want a solution for Java (Your most tags relate to that language). Read this article for installation and other admin tasks. It also has code samples (for Java) that you can start of with.
In pseudocode terms, what you need to do, when you receive a request for, lets say, Moon.jpeg is:
String key = md5_hash(Moon.jpeg); /* Or some other key generation mechanism */
IF key IN memcached
SUPPLY FROM memcached /* No disk access */
ELSE
READ Moon.jpeg FROM DISK
STORE IN memcached ASSOCIATED WITH key
SUPPLY
END
This is a very crude algorithm, you can read more about cache algorithms in this Wiki article.
The wrong direction. You want to reduce IO to slow disks (relative). You would want to have the files mapped in physical memory. In simple scenarios the OS will handle this automagically with file cache. You may look if Windows provides any tunable parameters or at least see what perf metric you can gather.
If I remember correctly (years ago) IIS handle static files very efficiently due a kernel routing driver linked to IIS, but only if it doesnt pass through further ISAPI filters etc.. You can probably find some info related to this on Channel9 etc..
Long term wise you should look to move static assets to a CDN such as CloudFront etc..
Like any problem though... are you sure you have a problem?

Printing from one Client to another Client via the Server

I don't know if it sounds crazy, but here's the scenario -
I need to print a document over the internet. My pc ClientX initiates the process using the web browser to access a ServerY on the internet and the printer is connected to a ClientZ (may be yours).
1. The document is stored on ServerY.
2. ClientZ is purely a cliet; no IIS, no print server etc.
3. I have the specific details of ClientZ, IP, Port, etc.
4. It'll be completely a server side application (and no client-side on ClientZ) with ASP.NET & C#
- so, is it possible? If yes, please give some clue. Thanks advanced.
This is kind of to big of a question for SO but basically what you need to do is
upload files to the server -- trivial
do some stuff to figure out if they are allowed to print the document -- trivial to hard depending on scope
add items to a queue for printing and associate them with a user/session -- easy
render and print the document -- trivial to hard depending on scope
notify the user that the document has been printed
handling errors
the big unknowns here are scope, if this is for a school project you probably don't have to worry about billing or queue priority in step two. If its for a commercial product billing can be a significant subsystem in its self.
the difficulty in step 4 depends directly on what formats you are going to support as many formats are going to require document specific libraries or applications. There are also security considerations here if this is a commercial product since it isn't safe to try to render all types of files.
Notifications can be easy or hard depending on how you want to do it. You can post back to the html page, but depending on how long its going to take for a job to complete it might be nice to have an email option as well.
You also need to think about errors. What is going to happen when paper or toner runs out or when someone tries to print something on A4 paper? Someone has to be notified so that jobs don't just build up.
On the server I would run just the user interaction piece on the web and have a "print daemon" running as a service to manage getting the documents printed and monitoring their status. I would use WCF to do IPC between the two.
Within the print daemon you are going to need a set of components to print different kinds of documents. I would make one assembly per type (or cluster of types) and load them into your service as plugins using MEF.
sorry this is so general, but you are asking a pretty general and difficult to answer question.

how does spider in a search engine works?

How does crawler or spider in a search engine works
Specifically, you need at least some of the following components:
Configuration: Needed to tell the crawler how, when and where to connect to documents; and how to connect to the underlying database/indexing system.
Connector: This will create the connections to a web page or a disk share or anything, really.
Memory: The pages already visited must be known to the crawler. This is usually stored in the index but it depends on the implementation and the needs. The content is also hashed for de-duplication and updates validation purposes.
Parser/Converter: Needed to be able to understand the content of a document and extract meta-data. Will convert the extracted data to a format usable by the underlying database system.
Indexer: Will push the data and meta-data to an database/indexing system.
Scheduler: Will plan runs of the crawler. Might need to handle a large number of running crawlers at the same time and take into consideration what is currently being done.
Connection algorithm: When the parser finds links to other documents, it is needed to analyse when, how, and where the next connections must be made. Also, some indexing algorithm take into consideration the page connection graphs so it might be needed to store and sort information related to that.
Policy Management: Some sites requires crawlers to respect certain policies (robots.txt for example).
Security/User Management: The crawler might need to be able to login in some system to access data.
Content compilation/execution: The crawler might need to execute certain things to be able to access what's inside, like applets/plugins.
Crawlers needs to be efficient at working together from different starting points, speed, memory usage and using a high number of threads/processes. I/O is key.
The world wide web is basically a connected directed graph of web documents,images,multimedia files etc. .Each node of the graph is a component of a web page-for example-a web page consists of image,text,video etc, all of them are linked.Crawler traverses the graph using Breadth First Search using links in web pages.
A crawler initially starts with one (or more) seed points.
It scans the webpage and explores the links in that page.
This process continues until all the graph is explored(some predefined constraint can be used to limit search depth).
From How Stuff Works
How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

Resources