Tiny URL Desing - Zero Collision Design with ZooKeeper - Issues - system-design

Here is a popular design available in internet of Tiny URL application:
The application is loadbalanced with the help of a zookeeper, where each server is assigned a counter range registerd with zookeeper.
Each application is having an locally available counter, which increments on every write request.
There is Cache available which(probably) gets updated with each write request.
Gaps in my understanding:
For every write request, we dont check if a tiny url exists in the db for the large URL..so we keep on inserting all write requests(even though a tiny url already exist for that particular large URL). Is that correct? If so then would there be a clean up activity(removing redundant duplicate tiny urls for same large URL) at some intentional downtime of application in the day?
What is the point of scaling...if for 1 million(or more) range of counter value there is just one server handling the request. Wouldn't there be a problem..? say for example there is large scale writing operation, would there be a vertical scaling to avoid slowness?
Kindly correct if there if I have got anything wrong here.

Design problems are open ended; keeping that in mind, here is my take on your questions.
Why there is no check if a large URL is already in the database
It may be a requirement to allow users to have their own tiny urls, even if they point to the same large url. For example, every use might want to see stats on how many times their specific tiny url was clicked one; this is a typical usage for tiny urls - put them into a blog/video/letter to get stats.
Scaling the service
Let me extend "each server is assigned a counter range registered". This implies that generated IDs have structure X bits of service id + Y bits from local counter. X bits are assigned by the zookeeper, and this is what makes each server responsible for one range.
Several server will be placed behind a load balancer. When a request comes to the load balancer, the request will be sent to a randomly picked server. If servers are overloaded, you could just add more servers behind the load balancer, each of those servers owns its own range. This will allow the service as a whole to scale up and down (and no need in vertical scaling).
The key understanding to this design is that those ranges are arbitrary ranges. There is no need for them to be consequential.

Related

Slowness in the geolocation API

I'm working on a project that uses HERE's geolocation service.
The project is basically a feature in our system that will route a list of addresses. This routing will happen every day and will have around 7000 points, at least.
Today we use the HERE service to geolocate these addresses and send them to our routing service. However, we are facing a huge bottleneck in this implementation: Of the 7000 points we use for testing, we were able to send only about 200 to geolocate, if we send a larger number of points, we simply do not receive any more response, nor the return of timeout or anything like that.
About the implementation: we do not send all points in the same request, each point to be geocoded is sent in a request. We adjusted our software to send only four requests per second thinking that there could be a QPS block, but we were not successful in solving the problem. We thought about also implementing a massage queue, but this could end up increasing the total time of geolocation + routing, which for us makes the solution unfeasible.
In the code, we have an array that stores the addresses to be geocoded, and for each position of the array we execute a GET request for the following URL: https://geocoder.ls.hereapi.com/6.2/geocode.json?apiKey=TOKEN&searchtext=ADDRESS
If you can help me find a solution.
For a large numbers of geocodes you may wish to consider the Batch Geocoder API:
https://developer.here.com/documentation/batch-geocoder/dev_guide/topics/quick-start-batch-geocode.html
I cannot replicate a problem with more than 200 Geocoder requests in a row, so we may need to see some code before we can help further.
Are you using our freemium service ? just to let you know that our 6.2 version of geocoder API is no longer support any new feature development, and hence if you are still implmenting the use case. Please try to switch to V7. Do you mean that you are not able to send entire 7000 addresses and fetch response even in chunks. It could be also due to Linux system that has restricted number of pool network connections on the same moment.try to send requests from some home endpoint (that not behind firewall ) and from Windows system

Controlled data sharding in MongoDB

I am new to MongoDB and I have very basic knowledge of its concepts of sharding. However I was wondering if it is possible to control the split of data yourself? For example a part of the records would be stored on one specific shard?
This will be used together with a rails app.
You can turn off the balancer to stop auto balancing:
sh.setBalancerState(false)
If you know the range of the key you are splitting on you could also presplit your data ranges to the desired servers see PreSplitting example. The management of the shard would be done via the javascript shell and not via your rails application.
You should take care that no shard gets more load (becomes hot) and that is why there is auto balancing by default, using monitoring like the free MMS service will help you monitor that.
The decision to shard is a complex decision and one that you should put a lot of thought into.
There's a lot to learn about sharding, and much of it is non-obvious. I'd suggest reviewing the information at the following links:
Sharding Introduction
Sharding Overview
FAQ
In the context of a shard cluster, a chunk is a contiguous range of shard key values assigned to a particular shard. By default, chunks are 64 megabytes (unless modified as per above). When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks. MongoDB chunks are logical and the data within them is NOT physically located together.
As I've mentioned the balancer moves the chunks around, however, you can do this manually. The balancer will take the decision to re-balance and request a chunk migration if there is a large enough difference ( minumum of 8) between the number of chunks on each shard. The actual moving of the chunks is co-ordinated between the "From" and "To" shard and when this is finished, the original chunks are removed from the "From" shard and the config servers are informed.
Quite a lot of people also pre-split, which helps with their migration. See here for more information.
In order to see documents split among the two shards, you'll need to insert enough documents in order to fill up several chunks on the first shard. If you haven't changed the default chunk size, you'd need to insert a minimum of 512MB of data in order to see data migrated to a second chunk. It's often a good idea to to test this and you can do this by setting your chunk size to 1MB and inserting 10MB of data. Here is an example of how to test this.
Probably http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding addresses your requirement in v2.2
Check out Kristina Chodorow's blog post too for a nice example : http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Why do you want to split data yourself if mongo DB is automatically doing it for you , You can upgrade your rails application layer to talk to mongos instance so that mongos routes the call for any CRUD operation to the place where the data resides . This is achieved using config server .

How Reduce Disk read write overhead?

I have one website mainly composed on javascript. I hosted it on IIS.
This website request for the images from the particular folder on hard disk and display them to end user.
The request of image are very frequent and fast.
Is there any way to reduce this overhead of disk read operation ?
I heard about memory mapping, where portion of hard disk can be mapped and it will be used as the primary memory.
Can somebody tell me if I am wrong or right, if I am right what are steps to do this.
If I am wrong , is there any other solution for this ?
While memory mapping is viable idea, I would suggest using memcached. It runs as a distinct process, permits horizontal scaling and is tried and tested and in active deployment at some of the most demanding website. A well implemented memcached server can reduce disk access significantly.
It also has bindings for many languages including those over the internet. I assume you want a solution for Java (Your most tags relate to that language). Read this article for installation and other admin tasks. It also has code samples (for Java) that you can start of with.
In pseudocode terms, what you need to do, when you receive a request for, lets say, Moon.jpeg is:
String key = md5_hash(Moon.jpeg); /* Or some other key generation mechanism */
IF key IN memcached
SUPPLY FROM memcached /* No disk access */
ELSE
READ Moon.jpeg FROM DISK
STORE IN memcached ASSOCIATED WITH key
SUPPLY
END
This is a very crude algorithm, you can read more about cache algorithms in this Wiki article.
The wrong direction. You want to reduce IO to slow disks (relative). You would want to have the files mapped in physical memory. In simple scenarios the OS will handle this automagically with file cache. You may look if Windows provides any tunable parameters or at least see what perf metric you can gather.
If I remember correctly (years ago) IIS handle static files very efficiently due a kernel routing driver linked to IIS, but only if it doesnt pass through further ISAPI filters etc.. You can probably find some info related to this on Channel9 etc..
Long term wise you should look to move static assets to a CDN such as CloudFront etc..
Like any problem though... are you sure you have a problem?

how does spider in a search engine works?

How does crawler or spider in a search engine works
Specifically, you need at least some of the following components:
Configuration: Needed to tell the crawler how, when and where to connect to documents; and how to connect to the underlying database/indexing system.
Connector: This will create the connections to a web page or a disk share or anything, really.
Memory: The pages already visited must be known to the crawler. This is usually stored in the index but it depends on the implementation and the needs. The content is also hashed for de-duplication and updates validation purposes.
Parser/Converter: Needed to be able to understand the content of a document and extract meta-data. Will convert the extracted data to a format usable by the underlying database system.
Indexer: Will push the data and meta-data to an database/indexing system.
Scheduler: Will plan runs of the crawler. Might need to handle a large number of running crawlers at the same time and take into consideration what is currently being done.
Connection algorithm: When the parser finds links to other documents, it is needed to analyse when, how, and where the next connections must be made. Also, some indexing algorithm take into consideration the page connection graphs so it might be needed to store and sort information related to that.
Policy Management: Some sites requires crawlers to respect certain policies (robots.txt for example).
Security/User Management: The crawler might need to be able to login in some system to access data.
Content compilation/execution: The crawler might need to execute certain things to be able to access what's inside, like applets/plugins.
Crawlers needs to be efficient at working together from different starting points, speed, memory usage and using a high number of threads/processes. I/O is key.
The world wide web is basically a connected directed graph of web documents,images,multimedia files etc. .Each node of the graph is a component of a web page-for example-a web page consists of image,text,video etc, all of them are linked.Crawler traverses the graph using Breadth First Search using links in web pages.
A crawler initially starts with one (or more) seed points.
It scans the webpage and explores the links in that page.
This process continues until all the graph is explored(some predefined constraint can be used to limit search depth).
From How Stuff Works
How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

How do you build a torrent file indexer?

I am curious about the technology behind a search engine like torrentz.com. From what I could observe, it doesn't host any torrent files, but rather connects you to other servers that do.
you search for keywords, it brings up a list of potential titles matching your search.
then you pick one of these and it provides you with another list of potential servers hosting the corresponding torrent file.
What I'm interested in particularly is the strategy behind gathering and indexing all that content:
How do they collect then aggregate the data?
Is it a submission base service, where each of these servers submits its content for indexing?
Is it a crawling algorithm? If so how do you even start crawling a site like piratebay.org?
Do they have access to these other servers' databases?
My knowledge and understanding of the bittorrent protocol is not very elaborate, but the documentation that I found online pointed me more toward the processes involved in building a tracker service, which isn't exactly what I'm interested in. Any insight and recommended reading material is appreciated.
For beginning start indexing their rss feeds and gather data from it. The next step would be indexing of portal's (like Mininova, tpb, etc) pages but watch out for the fact that you can be banned (ip based) for doing so, since that would provoke huge amount of data requested from their servers (i don't think that they be too happy about that)..
That said i doubt that they have access to other server's databases, but rather it's crawling +rss.
Another thing that you can use is that when somebody make a query of an item which you don't have in qyour database, you make the query on the main bt portal's, cache the result in your db, and then display results. Then if another user make the same query (which is pretty common scenario) you can show him cached data + new data from rss.

Resources