Data Warehouse: One Database or many? - data-warehouse

At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.
I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.
It seems to make security, backup/restore and performance management issues more manually intensive.
Is this something that is done in the industry? Are there substantial reasons for doing or not doing it?
The platform is Netezza. The size is in terabytes, hundreds of millions of rows.
What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.

In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.

Edit
If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.
If you put TWO instances on the same Physical Server you get:
Negatives:
Half the memory to use
Twice the count of database process
Positives:
You could take the entire staging db down without affecting the DW
So which is more precious to you, outage windows or CPU and Memory?
On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.
Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.
Tell us why you think it's better to have more than one instance.
ORIGINAL POST
Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?
I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.
so let's say you really did mean two separate physical boxes...
Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.
so let's say you meant the same hardware.
"It seems to make security, backup/restore and performance management issues more manually intensive."
Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.
Security
What security do you do at the instance level?
Backup
What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.
PLATFORM: NETEZZA
Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.
Maybe the addition of the Netezza tag will get some traction.

We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.

Better late than never, but for Netezza:
There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.
This means you cannot do:
THISDB(ADMIN)=>INSERT INTO OTHERDB..TBL SELECT * FROM THISDBTABLE;
but you can do \c OTHERDB then
OTHERDB(ADMIN)=>INSERT INTO TBL SELECT * FROM THISDB..THISDBTABLE;
You are also not able to create a materialized view on a cross-database object, for example:
OTHERDB(ADMIN)=>CREATE MATERIALIZED VIEW BLAH AS SELECT * FROM THISDB..THISDBTABLE;
Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.

You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.
There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.
Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.
For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.
I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.

Few points for you to consider
a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database
b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza
c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)

Related

Is it advisable to restart informatica application monthly to improve performance?

We have around 32 datamarts loading around 200+ tables out of which 50% of tables are on 11g Oracle database and 30% on 10g and rest 20 are flat files.
Lately we are facing performance issues while loading the datamarts.
Database parameters as well are network parameters are looking and as throughput is decreasing drastically we are of the opinion now that it is informatica which has problem.
Recently when through put had gone down and server was utilized to its 90% informatica application was restarted and the performance there after was little better than previous performance.
So my question is should we have Informatica restart as a scheduled activity ? Does restart actually improves the performance of the application or there are some other things which can play a role in the same?
What you have here is a systemic problem, but you have not established which component(s) of the system are the cause.
Are all jobs showing exactly the same degradation in performance? If not, what is the common characteristic of those that are? Not all jobs will have the same reliance on the Informatica server -- some will be dependent more on the performance of their target system(s), some on their source system(s), so I would be amazed if all showed exactly the same level of degradation.
What you have here is an exercise in data gathering, and then turning that data into useful information.
If you can isolate the problem to only certain jobs then I would take a log file from a time when the system is performing well, and from a time when it is not, and compare them directly, looking for differences in the performance of their components. You can also look at any database monitoring tools for changes in execution plan.
Rebooting servers? Maybe, but that is not necessarily the solution -- the real problem is the lack of data you have to diagnose your system.
Yes, It is good to do a restart every quarter.
It will refresh the Integration service cache.
Delete files from Cache and storage before you restart.
Since you said you have recently seen some reduced performance recently it might be due various reasons.
Some tips that may help:
Ensure all Indexes are in valid and compiled state.
If you are calling a procedure via worflow check the EXPLAIN plan and Cost ensure it is not doing a full table scan(cost should be less).
3.Gather stats on the source or target tables (especially which have deletes )) - This will help in de fragmentation - deleting the un allocated space. DBMS_STATS
Always good to have an house keeping scheduled weekly to do the above checks on indexes,remove temp/unnecessary files and gather stats (analyze indexes and tables).
Some best practices here performance tips

Erlang OTP based application - architecture ideas

I'm trying to write an Erlang application (OTP) that would parse a list of users and then launch workers that will work 24X7 to collect user-data (using three different APIs) from remote servers and store it in ets.
What would be the ideal architecture for this kind of application. Do I launch a bunch of workers - one for each user (assuming small number users)? What will happen if number of users increases very rapidly?
Also, to call different APIs I need to put up a Timer mechanism in the worker process.
Any hint will be really appreciated.
Spawning new process for each user is not a such bad idea. There are http servers that do this for each connection, and they doing quite fine.
First of all cost of creating new process is minimal. And cost of maintaining processes is even smaller. If one of the has nothing to do, it won't do anything; there is none (almost) runtime overhead from inactive processes, which in the end means that you are doing only the work you have to do (this is in fact the source of Erlang systems reactivity).
Some issue might be memory usage. Each process has it's own memory stack, and in use-case when they actually do not need to store any internal data, you might be allocating some unnecessary memory. But this also could be modified (even during runtime), and in most cases such memory will be garbage collected.
Actually I would not worry about such things too soon. Issues you might encounter might depend on many things, mostly amount of outside data or user activity, and you can not really design this. Most probably you won't encounter any of them for quite some time. There's no need for premature optimization, especially if you could bind yourself to design that would slow down rest of your development process. In Erlang, with processes being main source of abstraction you can easily swap this process-per-user with pool-of-workers, and ets with external service. But only if you really need it.
What's most important is fact that representing "user" as process would be closest to problem domain. "Users" are independent entities, and deserve separate processes (they have their own state, and they can act or react independent to each other). It is quite similar to using Objects and Classes in other languages (it is over-simplification, but it should get you going).
If you were writing this in Python or C++ would you worry about how many objects you were creating? Only in extreme cases. In Erlang the same general rule applies for processes. Don't worry about how many you are creating.
As for architecture, the only element that is an architectural issue in your question is whether you should design a fixed worker pool or a 1-for-1 worker pool. The shape of the supervision tree would be an outcome of whichever way you choose.
If you are scraping data your real bottleneck isn't going to be how many processes you have, it will be how many network requests you are able to make per second on each API you are trying to access. You will almost certainly get throttled.
(A few months ago I wrote a test demonstration of a very similar system to what you are describing. The limiting factor was API request limits from providers like fb, YouTube, g+, Yahoo, not number of processes.)
As always with Erlang, write some system first, and then benchmark it for real before worrying about performance. You will usually find that performance isn't an issue, and the times that it is you will discover that it is much easier to optimize one small part of an existing system than to design an optimized system from scratch. So just go for it and write something that basically does what you want right now, and worry about optimization tweaks after you have something that basically does what you want. After getting some concrete performance data (memory, request latency, etc.) is the time to start thinking about performance.
Your problem will almost certainly be on the API providers' side or your network latency, not congestion within the Erlang VM.

Multiple servers or everything in a single server?

I have a Rails app that uses MySQL, MongoDB, NodeJS (and SocketIO). Right now, the app (everything) is hosted inside 1 box. I would like to know what I should do when the number of users grow. What factors should I take into account to determine whether I need to host a separate element in another box (like MySQL, Node, Mongo in each of its separate box). Should I just make that one single box bigger? Is there a best-practice method that I can go with?
If you guys can provide me with reference, guides, research regarding this topic. Please do. I am super noob at deployment and server configuration.
We faced this dilemma at work a short while ago and found that simply upgrading to a more powerful single box sufficed and would give us room to grow further by up to 3-4 times.
The most important thing would be to identify your potential bottlenecks.
In our case there were 2 bottlenecks. Disk I/O and the database's ability to utilise memory.
On our new server we had the hard drive array configured in such a way as to maximise the disk I/O and we upgraded the database software to allow it to use more memory. In fact the DBMS now keeps the entire database in memory and only performs write operations to the disk as needed. This significantly improved performance.
The short answer is move everything to its own box. The longer answer is: it depends on your app's usage.
I recommend you use Nagios or similar to monitor your app's resource utilization -- that is, how much CPU and RAM each of your services use. When one starts to each up too much resources (and your page load speed is negatively affected), move that to its own box.
Then continue to monitor that box, beef up when necessary or shard out.
The high scalability blog is good for reading on what other people have done.

RavenDB - Planning for scalability

I have been learning RavenDB recently and would like to put it to use.
I was wondering what advice or suggestions people had around building the system in a way that is ready to scale, specifically sharding the data across servers, but that can start on a single server and only grow as needed.
Is it advisable, or even possible, to create multiple databases on a single instance and implement sharding across them. Then to scale it would simply be a matter of spreading these databases across the machines?
My first impression is that this approach would work, but I would be interested to hear the opinions and experiences of others.
Update 1:
I have been thinking more on this topic. I think my problem with the "sort it out later" approach is that it seems to me difficult to spread data evenly across servers in that situation. I will not have a string key which I can range on (A-E,F-M..) it will be done with numbers.
This leaves two options I can see. Either break it at boundaries, so 1-50000 is on shard 1, 50001-100000 is on shard 2, but then with a site that ages, say like this one, your original shards will be doing a lot less work. Alternatively a strategy that round robins the shards and put the shard id into the key will suffer if you need to move a document to a new shard, it would change the key and break urls that have used the key.
So my new idea, and again I am putting it out there for comment, would be to create from day one a bucketting system. Which works like stuffing the shard id into the key, but you start with a large number, say 1000 which you distribute evenly between. Then when it comes time to split the load into a shard, you can say move buckets 501-1000 to the new server and write your shard logic that 1-500 goes to shard 1 and 501-1000 goes to shard 2. Then when a third server comes online you pick another range of buckets and adjust.
To my eye this gives you the ability to split into as many shards as you originally created buckets, spreading the load evenly both in terms of quantity and age. Without having to change keys.
Thoughts?
It is possible, but really unnecessary. You can start using one instance, and then scale when necessary by setting up sharding later.
Also see:
http://ravendb.net/documentation/docs-sharding
http://ayende.com/blog/4830/ravendb-auto-sharding-bundle-design-early-thoughts
http://ravendb.net/documentation/replication/sharding
I think a good solution is to use virtual shards. You can start with one server and point all virtual shard to a single server. Use module on the incremental id to evenly distribute the rows across the virtual shards. With Amazon RDS you have the option to turn a slave into a master, so before you change the sharding configuration (point more virtual shards to the new server), you should make a slave a master, then update your configuration file, and then delete all the records on the new master using modulu that doesn't comply with the shard range that you use for the new instance.
You also need to delete rows from the original server, but by now all the new data with IDs that are modulu based on the new virtual shard ranges will point to the new server. So you actually don't need to move the data, but take advantage of Amazon RDS server promotion feature.
You can then make replica off the original server. You create a unique ID as: Shard ID + Table Type ID + Incremental number. So when you query the database, you know to which shard to go and fetch the data from.
I don't know how it's possible to do it with RavenDB, but it can work pretty well with Amazon RDS, because Amazon already provide you with replication and server promotion feature.
I agree that their should be a solution that right from the start offer seamless sociability and not telling the developer to sort the problems out when those occur. Furthermore, I've find out that many NoSQL solution that evenly distribute data across shards need to work within a cluster with low latency. So you have to take that into consideration. I've tried using Couchbase with two different EC2 machines (not in a dedicated Amazon cluster) and data balancing was very very slow. That adds to the overall cost too.
I also want to add that what pinterest had done to solve their scalability issues, using 4096 virtual shards.
You should also need to look into paging issues with many NoSQL databases. With that approach you can page data quite easily, but maybe not in the most efficient way, because you might need to query several databases. Another problem is changing schema. Pinterest solved this by putting all the data in a JSON Blob in MySQL. When you want to add a new column, you create a new table with the new column data + key, and can use Index on that column. If you need to query the data, for example, by email, you can create another table with the emails + ID and put an index on the email column. Counters are another problem , I mean atomic counters. So it's better taking those counters out from the JSON and put them in a column so you can increment the counter value.
There are great solutions out there, but at the end of the day you find out that they can be very expensive. I preferred spending time on building my own sharding solution and prevent myself the headache later on. If you choose the other path, there are plenty of companies waiting for you to get into trouble and ask for quite a lot of money to solve your problems. Because at the moment that you need them, they know that you will pay everything to make your project work again. That's from my own experience, that's why I am breaking my head to build my own sharding solution using your approach, which also be much cheaper.
Another option is to use middleware solutions for MySQL like ScaleBase or DBshards. So you can continue working with MySQL, but at the time you need to scale, they have well proven solution. And the costs might be much lower then the alternative.
Another tip: when you create your config for shards, put a write_lock attribute that accepts false or true. So when it false, data won't be written to that shard, so when you fetch the list of shards for specific table type (ie. users), it will be written only to the other shards for that same type. This is also good for backup, so you can show a friendly error for visitors when you want to lock all the shard when backing up all the data to get a point-in-time snapshots of all the shards. Although I think you can send a global request for snapshoting all the databases with Amazon RDS and using point-in-time backup.
The thing is that most companies won't spend time working with a DIY sharding solution , they will prefer paying for ScaleBase. Those solution comes from single developers that can afford paying for a scalable solution from the start, but want to rest assured that when they reach to the point they need it, they have a solution. Just look at the prices out there and you can figure out that it will cost you A LOT. I will gladly share my code with you once I'm done. You are going with the best path in my opinion, it's all depends on your application logic. I model my database to be simple, no joins, not complicated aggregation queries - this solves many of my problems. In the future you can use Map Reduce to solve those big data queries needs.

Is no horizontal scalability when it comes to writing a RDBMS defect? or does it happen to all DBMS'?

When you hit a roof on reading from a database, you have two choices, scale vertically by putting more hardware in the server, or scale horizontally by putting a second server to help offload the reads.
Offloading reads to a second server, means that all writes will hit both servers, while read only hits one.
Problem is when you hit a roof with writing, since writing has to happen to all servers, it means that all servers will be overloaded with write requests, and the server comes unusable. Adding more servers to the problem doesn't help, since it only adds more servers that will be overloaded. So you have to scale vertically.
Is this something that is specific to RDBMS'? or is it something that happens with all DBMS'?
I know you can do things on software side, and split the database in two, eg. all entries starting with 0-m in one db while n-z in another, but IMHO it is more of a workaround than a solution to the problem.
I can't see that this would be specific to the relational model. All databases that have to read and write (and that's most of them) will have a similar problem.
For what it's worth, most databases are read far more than written so the write roof occurs less frequently than you might think. In addition, load balancing databases as per your method tends to be an immediate write to the primary with queued writes to all secondaries (at least in my experience).
In that case, you're not actually waiting around for multiple writes as a user, you just wait for the first. The DBMS itself manages the synchronisation between instances. This of course means that secondary databases might not be totally up-to-date but this can be controlled. Technically, this breaks the ACID properties of the system as a whole but this can be architected around.
I think this is the case with any DBMS, although some handle it better than others. Like you mention, partitioning the database in software seems to be the most common solution to this.
In many applications though, partitioning the database like that makes sense anyways if you are at such a huge scale that it becomes necessary. For example, if you had a social networking app, it would probably make sense to partition your database by country or other geographical regions. This would allow you to have your servers located geographically close to the regions they serve. It would also help mitigate any problems with a cross-database "social graph" since peoples friends tend to live nearby.
You're hardly going to "hit a roof with writing, since writing has to happen to all server" because in most of RDBMS installations:
1) Reads are overwhelming more frequent than writes
2) Modern RDBMs have Multi-Version Concurrency Control able to reduce blocking when reading/writing

Resources