scalable sites - scalability

facebook, skype, myspace etc... all have millions and millions of users, does anyone know what their architecture looks like. Is it distributed on different nodes or do they use massive clusters?

Check below link to read how bigger applications like Amason, eBay, Flickr, Google etc. lives with high traffic.
http://highscalability.com/links/weblink/24
Interesting website for architects.
(I blogged about this earlier after a research for a BIG project - http://blog.ninethsense.com/high-scalability-how-big-applications-works/)

Memcached is used by a lot of sites with a lot of users, including Facebook. You can find lots of blogs that discuss the architecture of various high traffic Web sites. A long time ago I listened to this Arcast which I thought was quite interesting (if you do ASP.NET with SQL Server)

Facebook uses Hadoop/Hive and Erlang among other things (see http://www.facebook.com/notes.php?id=9445547199)

Hey this may not directly answer your question.
Interesting to watch nonetheless. See
facebook software stack.

Related

Image/File hosting storage best practices and standards

We are building an image and file hosting website and we will save these files on our servers, so I want to know if there are any best practices or standards I need to read and follow to make our website scalable and easy to extend in the future.
Is there a book or articles or videos talking about this subject, please share.
As per my experience to deal with large data.
its always best to opt for Cloud, check for "Amazon S3" (Amazon AWS) or Windows Azure.
features like "CDN" (cloud front) is a big plus.
I believe this is not a simple question that can be answered without knowing
how many files are expected ?
how many users/files accesses per hour/day/minute ?
your usage scenarios with this files (downloading? streaming? how many concurrent files downloaded at once?
are you stuck in one particular OS (windows) and filesystem (NTFS), or is there freedom in this ?
My personal note : Building own image/file hosting is not a trivial task, i strongly recommend you to hire somebody with experience from this area.
I would recommend that if possible, you look at a 3rd party solution that provides an api. you'll then get the benefits of lower cost of ownership, no maintenance costs for the hardware and continual updates thrown in for free when the 3rd party adds new features to the core offering. I know this from 1st hand experience as we scoped out the options for doing this in a recent project and came to the conclusion that we'd spend 100 times more on our own solution and even then, may not get it right. We opted for a company called Razuna who offer both a hosted and open source version of their platform. Their api is very straightfwd and can be consumed inside your mvc app with potentially only a few days effort (depending on your use case). The beauty of this approach is that the hosted elements are actually on the nirvanix backbone and are served via their CDN - so win win.
You can get the details at:
http://www.razuna.com
and can view the api docs at:
http://wiki.razuna.com/display/ecp/Developer+Guides
Good luck and if you need any further real-life guidence on this, feel free to come back. Oh and btw, we were also able to ask for 'paid for' features to be added to the core offering at pretty much standard market day rates.

Would I Regret Using ASP.NET MVC?

Firstly, I'm not going to start off asking what is better, ASP.NET MVC or Ruby On Rails. I already know that both are very good MVC solutions even though I'll admit at present I only have experience using ASP.NET MVC 3.
I really enjoy using VS 2010 for development work and what with a few additional plugins, I'm blown away by it. Looking at NetBeans & RadRails really looks like going back a few years? I also really like Microsoft's take on MVC and the new Razor engine, and enjoy developing websites with it.
So, my concern is this, if I were to be creating a website which 'snow-balled' in size towards something as big as Amazon, Ebay, Facebook etc. would I eventually have regrets? Is IIS and Windows Server really upto it compared with linux and say Apache? I know a few years ago when big sites first came into being ASP.NET MVC wasn't in existance, so in say 5 years time will there be a bigger share of the market powered by Microsoft or is Linux & Apache the more secure and stable workhorse that I'm led to believe?
I'm also concerned when I came across this url for sites using ASP.NET MVC and many of them don't appear to be up and running?
http://weblogs.asp.net/mikebosch/archive/2008/05/05/gallery-of-live-asp-net-mvc-sites.aspx
I'm also led to believe that this website uses ASP.NET MVC, but then it doesn't have loads of images etc.
Sorry it's such an open ended question, but I would love to hear the opions of anyone who has experience of very large websites.
Thank you.
The Stackoverflow site is built using Asp.Net MVC. So, You will not regret using Asp.Net MVC
Trust me, everyone wants to believe that their next project will be the next amazon or ebay or facebook. Those systems have hundreds of developers, with more hardware than you can imagine. You can't design a site today for that kind of workload because you will have no way to test it, or know how anything you do will affect scalability until you actually have the hardware and bandwidth to deal with those issues.
The best you can do is design the site for your needs now, and evolve it over time. You can always move to different technology during redesigns. But at that point, you will have a large budget and lots of people to work on the problems.
Do what works best now, and worry about the future IF it happens. Certainly, IIS and asp.net mvc are great technologies, and a lot of very busy sites run on them (microsoft.com is a very busy site, as is msn.com, etc..)
Stackoverflow is written in asp.net mvc, and it has millions of visitors, and is a very busy site. No, it's not very image heavy, but in general images are about bandwidth, and any technology can handle sending images. If you need to do a lot of image processing, that might be different.. but all it does is change how you process the images.
Your list of sites is 2 years old, lots of sites go up and down in two years, so it's not surprising that many of them aren't still running. That's a business issue, not a technology one.
EDIT:
Regarding Ruby On Rails... I know of no site the size of Amazon, Ebay, or Facebook that's written in RoR (or Asp.net MVC either). Twitter is probably the largest, but let's just say Twitter has pretty limited functionality. Penny Arcade, Github, and Hulu are probably no slouches either. Certainly Hulu is very media intensive, but i don't know if they use Ruby for the actual video serving portion (i kind of doubt it, but one never knows).
All you can do is develop in the technology you are most comfortable with. Otherwise, it won't be fun and you will never finish it.
You can build efficient site on any of existing technologies.
If you go for linux-based server software, you just save some cash on licenses.
If license cost isn't an issue- take what's more convenient for you.

google search engine architecture- how do so many concurrent users do a search on it

With millions of users searching for so many things on google, yahoo and so on.
How can the server handle so many concurrent searches?
I have no clue as to how they made it so scalable.
Any insight into their architecture would be welcomed.
One element, DNS load balancing.
There are plenty of resources on google architecture, this site has a nice list:
http://highscalability.com/google-architecture
DNS Load Balancing is correct, but it is not really the full answer to the question. Google uses a multitude of techniques, including but not limited to the following:
DNS Load Balancing (suggested)
Clustering - as suggested, but note the following
clustered databases (the database storage and retrieval is spread over many machines)
clustered web services (analogous to DNSLB here)
An internally developed clustered/distributed filing system
Highly optimised search indices and algorithms, making storage efficient and retrieval fast across the cluster
Caching of requests (squid), responses (squid), databases (in memory, see shards in the above article)
I've gone searching for information about this topic recently and Wikipedia's Google Platform article was the best all around source of information on how Google does it. However, the High Scalability blog has outstanding articles on scalability nearly every day. Be sure to check it out their Google architecture article too.
The primary concept in most of the highly scalable applications is clustering.
Some resources regarding the cluster architecture of different search engines.
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/googlecluster-ieee.pdf
https://opencirrus.org/system/files/OpenCirrusHadoop2009.ppt
You can also read interesting research articles at Google Research and Yahoo Research.

FAST ESP vs Google Search Appliance for development

Which of the two provides a better API for developing on top of?
Although there is a virtual Google Search Appliance available for download, no such equivalent is present for FAST.
So looking to developers with experience in either of these products to give suggestions and links to documentation. (especially for FAST as there's none available on their site)
Kind regards,
I'm pretty sure that FAST does not provide a trial download of their Enteprise Search Platform (ESP) today nor it's SDK (which is useless without ESP).
FAST is pretty much the industry leader for customization (Google is popular as simple out of the box solution and Autonomy seems to be the leader in compliance) which is what you are likely intrested in an API for. But not Cheap. Internal Python customization for processing documents, exteral .NET & Java API for interacting with the service.
Also, you if you are looking for a basic Enteprise Search + API, google on "Solr" project.
I think FAST provides a free trail version. Along with it comes the API documentation and other manuals. My company uses it. I use it.
Answering your queston, FAST is obviously better than Google search appliance (for various reasons). That's my view.
Freddie
I have worked on Google Search Appliance and it works great.
I can search in meta-data, get selective data back from query, see real time status of documents that are getting crawled, scalable with GSA6.14 and all great support from Google.
Apache Solr is a great solution with a very flexible client API. You should definitely check it out. We are moving from FAST to Solr currently & I find the features and API of Solr much better than FAST ESP.

Paid support for web-frameworks

This may sound strange but sometimes when your ASP.NET webapp isn't working and you can't tell why, you call Microsoft, pay them something like $300 and get about 1-3 weeks of 1-3 people looking at your configuration, memory dumps, sometimes code... but usually not the db, and with a fairly good percentage they help you fix your mistakes, without necessarily up-selling you.
I found that Novell would like to offer that for Mono. Everyone knows MySQL offers it for their clients, because it was part of the reason they got a truck of money to swing by one day to change the name-plate on the door.
I'm curious if anyone has found people for the support of these, and how they'd rate their experience:
Django
Rails
Grails
JRuby
Mono [ratings]
add your own.
I haven't ever looked for paid support for these open source technologies, but in general I would guess until there is significant market penetration there won't be a business case for 'dial in support' of an app built by a third party.
In general, you'll be looking for a niche technology expert consultant that will probably charge you an hourly rate to look at your problem.
For django - look at djangogigs.com, or post on rentacoder.com I suppose.
Each usually has an irc channel - you could also ask general questions there, or try to find someone for hire.
That niche is typically handled by 2 groups I believe
Software component developers. - I get a lot of my presentation layer support from DevExpress since I use their widgets for my GUIs for instance. In fact, typically I don't use a technology in an official capacity unless I have identified a dependable support channel.
The issue you raise with Microsoft is handle by abstracting your problem before reporting it. That's a common law with most commercial support channels: When an issue involves 2 vendors, they will blame each other! Your job is to first isolate the issue before or during reporting.
It's hard, I know, but that's why you get paid the big bucks :-)
Is to bring in an outside consultant that should be able to study your system and do what we described in part 1 ( above )

Resources