Can SQL Azure scale without any specific technique or administration (partitioning/replication...)? - scalability

Can SQL Azure scale without any specific technique or administration like Google App Engine's BigTable? No manual partitioning or replication required?

Do you mean scale to meet increasing demand, or do you mean increase in size to accommodate additional data?
With respect to size: you pick the "edition" of the database (web or business) - both have different size limitations. You are billed based on size only. max size is 50gb. Once edition is picked, the capacity will increase up to max allowed to accommodate your data. You do nothing special.
With respect to scale to meet performance demands... you are abstracted away from managing really anything that has to do with scalability from SQL Azure perspective... Your database is colocated with other databases on various SQL servers running in MS data center. theoretically your database will be moved to a less-busy server if it becomes too hot... however, SQL Azure is not considered to be highly scalable solution (ie: facebook/twitter quality).
If you need mega-scalability, you'll need to go with Azure Table Storage

For the majority of applications, SQL Azure will scale just fine.
"Will it scale?" Now that is the question a lot of us wonder about SQL Azure. Especially since you can't tell it how much Ram, CPU Cores or replicated servers with load balancing to allocate. With Windows Azure you can tell it how many of each resource you want your application hosted on, but that isn't the case with SQL Azure. This may sound really bad to some, but SQL Azure is designed to "automagically" scale the database server to your needs. What that means I honestly can't say, as I haven't (as of yet) found much official information from Microsoft on that topic.
With extremely high traffic sites, such as Facebook and Twitter, it has been suggested that non-relational databases (such as Azure Table Storage) can scale better since the database has less overhead when querying data. If you need relational database features (such as foreign key relationships and sql join functionality) then you probably want to use SQL Azure.
It's not as clear cut as "to SQL Azure, or not to SQL Azure." There are database architecture design patterns that can be used such as denormalizing database tables (require less joins per query) and Horizontally Partitioning your data to allow your design to better scale.
A Hybrid or Mixed solution of both SQL Azure and Azure Table Storage can be used too. If you have some data that requires relational queries, then put it in SQL Azure. If you have data that does not require a relational database, then you could put it in Azure Table Storage.
Remember, the database design is part of the overall architecture of your application and you should plan it out just as much as you plan whether to use TDD, IOC and Dependency Injection. After all, if your database can't scale, it doesn't matter how awesome the application code is.
Aside, Thinking about this topic makes me wonder what XBox Live and Bing Search use for their database needs. Is it Relational, Non-Relational or Hybrid?

Related

How to handle multitenant data warehouse (each customer has a unique schema)?

so I am trying to set up a data warehouse for a service where each customer has their own database with a unique schema. How do I go about setting up a warehouse so each customer has their own semantic layer / relational model set up automatically (since we (centrally) do not know what is in each database) So that each customer can easily report on their data? Is there any automatic process we can follow? Am I missing something?
It depends on whether you want a consolidated view of the data, or if each customer's data is to remain segregated.
If consolidation is the objective (and there are huge benefits for a multi-tenant SAAS vendor to have a consolidated overview of customer data) then Nithin B's suggestion is good.
If separate warehouses are required, then you'll need to think about how to optimise your costs. The two biggest components will be ETL/ELT, and database hosting.
The fastest way to ETL/ELT is data warehouse automation. You'll find a good list of vendors on our web site (http://ajilius.com/competitors). Look for a solution that will give you the flexibility to meet your deployment options (cloud and/or on-premise), as well as the geographic reach you'll need for accessing customer data.
Will you be hosting your own databases or in the cloud? How much data will each tenant require? A good starting point would be PostgreSQL or SQL Server (SMP), and Ajilius gives you the flexibility to instantly migrate to MPP platforms if your needs outgrow those platforms.
There are many ways to address this.
Land all the tables in a Landing area in different schemas.
Stage the data into appropriate staging tables for dim and fact loads.
Create a dim table to identify the Customer Area. For eg: Dim_Source
Load the data into the fact tables. Any specific customers can filter the data from the facts by using the Dim_Source values.
This design would help overall Enterprise reporting as well.
Hope that helps.
I would start with a Kimball BUS Matrix.
Cheers
Nithin

Graph database in distributed env

I have a question concerning graph databases! Is there a mechanism to use graph databases in a distributed environment?! I mean can you distribute a graph database?! Can we even traverse a graph database in a distributed environment?!
Definitely you can do it.
There are different databases which scale very good nowadays (JanusGraph, OrientDB, ArangoDB etc).
Even if you have a very big database which has to be scaled beyond a single datacenter to multiple geo-distributed datacenters you still has options.
For example, you can use JanusGraph with Cassandra / ScyllaDB storage backends. It will give you an option to asynchronously synchronize all your data from different datacenters.
Of course, there are some issues to be solved like consistency and so on but with today's tools, it's very possible to organize a distributed graph database.
Neo4j enterprise edition features clustering, read more on http://neo4j.com/docs/stable/ha.html.
Yes, you can use all sorts of graph databases in distributed environments. Can you distribute a graph database? Definitely yes.
BUT - distributing the same graph database in many different places (to speed up reads) is quite easy, and done all the time. Distributing a ridiculously massive database (so that parts of the graph database are in a bunch of different places) is quite hard.
I recommend this related question which talks about sharding and distributing databases. Pay particular attention to the bit about "sharding is an anti-pattern".

In-memory database scalability

I have been exploring MMDB systems lately and I haven't been able to find much information with regards to how an in-memory database is supposed to scale. My quite basic assumption is that a main-memory db is constrained by the memory available on the db node, and by the operating system management of this memory. So how can I expand an in-memory system size beyond that of the main memory available? I assume the answer is along the lines of a distributed system but I haven't got it clear in my head how it would work. And of course it's also possible I completely misunderstood the idea of mmdb and i'm missing something obvious.
A bit of background on the question: I am writing a number of cross-platform mobile apps (even though my background is heavily involved with mysql and mongodb), and I don't like native database solutions like sqlite for android and ios. So I thought I'd write my own solution (site and github) in javascript (i'm working on cordova/phonegap). I realised that I could make this a nodejs module and use it as a db for a web app (I'm creating a blog powered by it as an experiment and it's working pretty well), but of course I'm now thinking of making it a separate tier and I started thinking about the obvious limitation of memory size, hence my question.
in-memory databases scale in size the same way as on-disk (aka persistent) databases do: either throw more storage at it (memory, in this case) or distribute it across multiple nodes of a cluster. The latter alternative increases the complexity (both of the DBMS, and your administration of it), relative to an in-memory database on a single system. Consider the difference between vanilla MySQL and MySQL Cluster. And, you'll want to have a really fast network for those times when the DBMS has to perform inter-node operations (e.g. distribute the data or pull data from multiple nodes to satisfy a query).
There's nothing particularly special about in-memory databases in this regard. There are some special optimizations in the database engine when you know storage is memory. But it doesn't change the fundamental principles of database systems.
What you don't want to do is create an in-memory database larger than physical memory. You'll force the OS to swap in-memory database pages in/out of swap space, and the performance will suck. You're better off, in that case, using a conventional DBMS and giving it as much cache as you have memory available for. The DBMS will use the cache more intelligently than the OS' will the swap space.
Current production-ready in-memory databases have mainly focused on scale-up as opposed to scale-out. So-far, they have either managed to integrate main memory tier into their core, existing architecture (IBM via Blu acceleration) or have re-built the database from almost-scratch to leverage the main-memory as primary storage layer (SAP HANA), and in both cases their claim to fame is the obvious speedup that DRAM offers in comparision to the disk.
However very few databases, presently, have a complete offering which scales-out in-memory performance benefit accross multiple nodes. Most of the in-memory databases require the applications to manage the distribution of data/objects across nodes (Ex: SAP HANA).
Oracle's DBIM and MemSQL are a few scalable and distributed options, at this time, that implement distributed in-memory database/tier by collective utilization of memory resources across the cluster (RAC in case of Oracle). MemSQL can be deployed on a cluster of commodity compute nodes and it claims to scale by utilizing aggregate resources, including memory. Oracle RAC is a shared cache architecture that overcomes the limitations of traditional shared-nothing and shared-disk approaches to provide highly scalable and available database solutions, including in-memory benefits.

Using both MongoDB and SQL Server Express with Asp.net MVC

I am building a website with Asp.Net MVC 4 and C#, hosted in VPS. The site has lots of sharing comments/replies features (I need to store a lot of comments). I was planning to use MongoDB as its free and also very suitable for storing blogs/comments, but got a little hesitant after I read this article.
So now I am thinking of using MongoDb for storing comments and replies and SQL Server Express version with 10GB limitation for storing user accounts and other user profile data.
Is it okay to use 2 different databases like this in one web app? Or is there any other stable documents database similar to MongoDb that I can use and don't have to use SQL server ?
Is RavenDb an option ?
Thanks in advance !
Yes, it is okay to use two different databases in one web app. Even more - it's very good to do so, as every type of database has it cons and pros and they are never fit to do every job. You've probably chosen MongoDB because you've heard it's quite good for storing lots of data and heavy quering. This is mostly true but there are some caveats. On the other hand, SQL Server can be much slower (depends on exact workflow), and can be easily overloaded with heavy writes.
First of all - it does work well. Mostly. It tries as hard as it can, when on heavy load. But it's not as write efficient with parallel writes (because of heavy use of locking) as some others and it's not "very-safe" if you have to keep some collections or documents synchronised - e.g. SQL transaction, that does lots of stuff in different tables, as mongo does not give you anything more than document-level transactions (every change on single document is atomic).
There are other stable documents database, e.g. look at CouchDB - it's more write-oriented and has some funny possibilities thanks to Multi-Version Concurrency Control.

How do you build a scalable infrastucture for a family of related but separate websites?

So, my company is talking about building an ecommerce platform that will serve many different clients. Each client would have a different look and feel, and its own set of users, but the backing code (ie: adminstration services, authentication servers, checkout services, possibly admin pages, etc.), and some users would be shared, so a bug fix could be applied to all sites at the same time, a primary admin could log into all websites.
As the entire StackExchange set of websites (with pretty high traffic) run off a small number of servers (two I believe) I wonder what it would involve to serve up many unrelated (but similar) websites through one webapp, or even one database.
In order to have one database, I imagine that every table would have columns identifying which realm the entity belonged to, and every SQL call would filter by that column. That seems like it would become an maintenance nightmare, and (less importantly for me) a DBA's hell.
Another option, with one webapp, but multiple databases, I imagine the realm could be tied to a specific data source, where all non-shared data could be specified. Then when any request was made, the appropriate datasource could be loaded and the webapp would run as if there was only single source. This would have an added benefit of being easily horizontally scalable since the exact same webapp, but a different set of realms and datasources, could be spawned when necessary. Websites could be easily moved to new servers as well by simply copying the webapp and moving the database.
I wondering what other possibilities there are, as well as, specific examples if they're out there.
Note: I'm not talking about Twitter-scale scalability, nor about hardware/languages/etc, but rather design methodologies and patterns.
The architecture you are talking about is called "multi-tenant" architecture. There are diff. ways of going about architecting a multi-tenant application. Broadly speaking, data tier can be architected in 3 ways:
seperate database for each client - easier to code, diff. to maintain
one database, different schema
one database, one schema (with clientid in each table; except for meta-data tables) - more time in coding, easier to maintain
Each has its own advantages/ disadvantages. Take a look at this article by Microsoft on multi-tenancy. http://msdn.microsoft.com/en-us/library/aa479086.aspx
Broadly speaking, I would suggest option 3 as it offers true multi-tenancy. If you have some tables that you expect to become very large, you can partition that table based on the clientid (for eg: if you want 10 partitons, you could do partition based on mod of clientid)

Resources