master slave exposes technical debt - ruby-on-rails

Using rails and postgresql.
I wrote my app without having in mind to use a master slave configuration.
Now, I've gotten master slave set up in the app and now I'm running into some technical debt. The same process in my app writes to the db and then immediately reads from the db. The read is not taking place on the read db but the data isn't there. Before, this wasn't efficient but it didn't cause any problems because both dbs were the same. Now, this is blowing up in my face.
The problem for me is that its difficult to find all the places in the code where this problem exists. Can someone can please suggest to me a technique to get my tests to run in such a way where the reads and the writes use different dbs that aren't updated so that I can figure out where my issues are?
Other solutions will also be welcomed!

I strongly recommend you rethink your master/slave configuration or whether master/slave is even right for your application.
It's not "tech debt" to build a system that assumes data written to persistent store can be read back immediately. It's normal and correct. While you might reasonably be able to avoid the pattern
write A, ..., look up A.key
with various simple cache schemes, trying to code around e.g.
write A, ..., complex query that *might* fetch A
requires you to retain a copy of A and determine whether it would satisfy the WHERE clause of the query in separate code, simply because you can't rely on the query results. Unless your system is very small and simple, trying to do this system-wide will produce a super-complex, fragile, expensive, and ugly code base. I strongly recommend you don't try it.
The usual purpose of a master/slave persistent store organization is to off-line read traffic that's not time-dependent on writes. For example, if your system mines data to produce summaries accessible to users, you'd offline the metric computation and have it mine the slave. This prevents mining queries from drawing resources away from user request handling. The small delay between write on master and copy to slave is no problem.
If your app is struggling because there's too much load on persistent store, you probably want partitioned data (sometimes called sharding), not master/slave. Partitioning can expose you to a different kind of problem: no cross-partition transactions. But this is usually easier to work through than what you're attempting.

After studying this area, I agree with Gene that master slave should only be used for reads that have been written a significant time before the read.
My ORIGINAL concept was that its better to utilize a functional programming style whereby the process retains all the information in the parameters and then doesn't make recourse to the database. The downside of this approach is that the human mind has a hard time with functional programming and in a massive computer program it makes sense to not insist on this added complication.
If you want to write a functional method or process then that is great and very efficient but there shouldn't be anything in the code that insists on this.

Related

Rails: Caching a Tree in memory on the server

I have a postgresql database which contains multidimensional data. What I did was I wrote a data structure that sorts all database rows into a tree format. Now the database is large and so I dont want to generate the tree every time a request comes in from a browser. What Id like to do is construct the tree once in a certain time period and persist it in memory on the server.
The tree is read only by the way. So now each time a request comes in the tree need not be generated new, its already there.
How can I make this happen. Im not an expert programmer, just a beginner and definitely new to web programming. So some of these concepts are new to me.
But if you could please point me in the right direction in terms of the concepts involved here, I can google the rest.
Or if you have actual links or examples that would be fantastic.
Thanks
There are several ways to approach this problem. It depends on just how close to the application you want the variables. If you're really looking to have them right "on top" of the application, for fastest possible use, then you could look at using a global variable "$tree" and hooking in to the application flow. Other options might include memcached, which is still pretty darn close to the application. Redis would be a good option for an in-memory database that could be shared between instances of an application, as it is a NoSQL database that you query. Not quite as close to the application though.
Generally, those are your primary options. In-application variables that survive requests. Application frameworks that will help variables survive requests and provide you a querying mechanism. Or, an In-Memory databases that will allow you to store and query rapidly from multiple instances. Each is a viable option, though I'm pretty sure you'd get a lot of 'community' flack for using a straight up global variable (such practices are considered unclean for their lack of thread-safety and other such concerns).

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

Avoid writing SQL queries altogether in SSIS

Working on a Data Warehouse project, the guy that gave us the tutorial advised that we stick to using SQL queries over defining a lot of data flow transformations, citing points like it'll consume a lot of memory on the ETL box so we'd rather leave the processing to the DB box. Is this really advisable? Where's the balance between relying on GUI tools over executing a bunch of SQL scripts on your Integration package?
And honestly, I'd like to avoid writing SQL queries as much as I can. (but that's beside the point. I'd really like to look at this objectively.)
The answer is: it depends, but you want to pick one or the other for any given job and avoid mixing the two where possible.
Generally, it's best to either do everything possible within the tool or do everything possible within stored procedure code. When you have significant amounts of logic split between layers the system becomes harder to trace and debug.
Where the tool can do the transformations without the data flows becoming awkward and convoluted you could use the tool and try to have little or no logic in queries. This means that one single layer has the business logic and it should be fairly obvious where to find it. However, ETL tools tend to handle highly complex transformations relatively poorly. The sweet spot for this type of approach is on systems where you have a large number of data sources but relatively simple transformations.
If you have relatively complex transformations you may be better off putting all the business logic and transformation into a layer of stored procedures. SQL code is better at implementing complex transformations in a maintainable way - I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors use this type of architecture for precisely that reason. In this case the ETL tool can be used to implement relatively dumb data copies. Source data can be copied into staging areas essentially verbatim and then picked up by a body of stored procedure code that does the ETL. The ETL tool can be used for data copies, bulk load operations, logging, scheduling and other framework tasks.
In either case you're best off picking one approach. Otherwise, you can end up with business logic spread across extraction layers, database views, data flows, and stored procedure code. Logic spread across multiple layers is much harder to test.
When all of the logic is (for example) contained within stored procedures or focussed ETL transformation jobs you can unit test a given transformation in isolation. The clarity in design also helps with maintenance and auditing.
I find that using SQl code is not only faster to run, but it is faster to develop and much much easier to maintain.
Generally when you want to process each row individually, use a data flow, otherwise it may be better to use a Sql Command.
Personally I'd go with writing the SQL where I can. It's easier to optimise later and (usually) faster as well. Google will give much more detailed answers.
Another factor to think about is the provider you use for your connections.
You need to make the decision based on your needs. We use postgres DB, so we have to create a load of staging tables for some processes, which speeds the whole thing up.
You should also take into consideration the box it is running on, if you have an all powerful DB box, and a little ETL box, there'd be no point in running anything.
If you do all your processing on the ETL box you'll be dragging a lot of data across the network as well.
Check out these links to get you started:
ssistalk.com/category/ssis/ssis-advanced-techniques/
msdn.microsoft.com/en-us/library/ms141031.aspx
weblogs.sqlteam.com/jamesn/Default.aspx
I think this is a difficult question; and an interesting one as well.
One reason to use SSIS is to improve maintainability, IMHO. If you pack all the logic in SQL statements (and you sure can!) you tend to spoil this reason of using SSIS in the first place. You cannot really "see the data flow" anymore.
On the other hand I feel there are times when a well placed SQL statement has its value. For example when you read data from a table and for whatever reason already know you will only ever need the rows satisfying condition X I do not see the reason for reading the whole table and in the next step "conditional-splitting most of it away".
What I do not know is what this means in terms of performance, by the way. Is SSIS smart enough to see what is happening and change the "read-whole-table-and-conditional-split-it" into a "select Y from where X" on the fly (or when building/deploying)?
The big question is where to draw the line. And this depends to a certain extent on the people working on your ETL process. If everyone ever supporting the process knows SQL since its beginning you can better support a higher amount of SQL in your ETL than if you have co-workers (or customers, or successors you care about) that hardly understand what is happening in all your SQL, let alone change/improve/add to it.
So I think the bottom line is that neither not using nor doing everything in SQL is better. Try to make up some simple rules that fit your requirements and that everyone can live with, then follow them. This buys you the most value from using SSIS.
SQL Server does some things well and other things not so well. I use SSIS to import to or export data from SQL Server. During the course of the move I use SSIS where it makes sense. I can easily do work on a per row basis, which is not very efficient in SQL Server (cursors). To say that you shouldn't use transformations and data flows on an ETL box, because it is too expensive on the ETL box is like say 'don't drive your car too fast, because it causes the engine to work'. The purpose of an ETL and SSIS is to take some of the processing that SQL Sever does not do well and move it to an engine that does.
Got to use the right tool for the job. Generally, you do most things in SSIS, with certain things done in "pure" SQL.
For instance, in cases where you do a lot of UPDATE (table difference on dimension table in a dimensional model, say), you really don't want to execute an UPDATE for each row. In this scenario, you do a regular insert into a temporary table and then do the UPDATE in SQL, joining on appropriate keys.

server side db programming: why?

Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!).
Server-side logic is often much faster, even with procedural approach.
You can fine-tune your grant options and hide the data you don't want to show
All queries in one places are more convenient than if they were scattered all around the code.
And here's a (very subjective) article in my blog on the reason I prefer stored procedures:
Schema Junk
BTW, triggers (as opposed to functions / stored procedures / packages) I generally dislike.
They are completely other story.
You're keeping the processing in the database, along with the data.
If you process on the server side, then you have to transfer the data out to a server process across the network, process it, and (optionally) send it back. You have the network bandwidth/latency issues, plus memory overheads.
To clarify - if I have 10m rows of data, my two extreme scenarios are to a) pull those 10m rows across the network and process on the server side, or b) process in place in the database using the server and language (SQL) optimised for this purpose. Note that this is a generalisation and not a hard-and-fast rule, but it's the one I follow for most scenarios.
When many heterogeneous applications and various other systems need to access your single database and be sure through their operations data stays consistent without integrity conflicts. So you put your logic into triggers and stored procedures that will offer an interface to external clients.
Maybe not for most web-based systems, but certainly for enterprise databases. Stored procedures and the like allow you much greater control over security and performance, as well as offering a bit of encapsulation for the database itself. You can change the schema all you want as long as the stored procedure interface remains the same.
In (almost) every situation you would keep the processing that is part of the database in the database. Application code cannot substitute for triggers, you won't get very far before you have updated the database and failed to fire the application's equivalent of the triggers (the first time you use the DBMS's management console, for instance).
Let the database do the database work and let the application to the application's work. If you have a specific performance problem with the database, and that performance problem can be addressed by moving processing from the database, in that case you might want to consider doing so.
But worrying about database performance without a database performance problem existing (which is what you seem to be doing here) is both silly and, sadly, apparently a pre-occupation of many Stackoverlow posters.
Least scalable? SQL???
Look up, "federating."
If the database is shared, having logic in the database is better in order to control everything that happens. If it's not it might just make the system overly complicated.
If you have multiple applications that talk to your database, stored procedures and triggers can enforce correctness more pervasively. Accordingly, if correctness is more important than convenience, putting logic in the database is sensible.
Scalability may be a red herring, though. Sometimes it's easier to express the behavior you want in the domain layer of an OO language, but it can be actually more expensive than doing the idiomatic SQL way.
The security mechanism at a previous company was first built in the service layer, then pushed to the db side. The motivation was actually due to some limitations in a data access framework we were using. The solution turned out to be a bit buggy because our security model was complicated, but the upside was that bugs only had to be fixed in the database; we didn't have to worry about different clients following different rules.
Triggers mean 3rd-party apps can modify the database without creating logical inconsistencies.
If you do that, you are tying your business logic to your model. If you code all your business logic in T-SQL, you aren't going to have a lot of fun if later you need to use Oracle or what have you as your database server. Actually, I'm not sure I understand this question exactly. How do you think this would improve scalability? It really shouldn't.
Personally, I'm really not a fan of triggers, particularly in a database dedicated to a single application. I hate trying to track down why some data is inconsistent, to find it's down to a poorly written trigger (and they can be tricky to get exactly correct).
Security is another advantage of using stored procs. You do not have to set the security at the table level if you don't use dynamic code (Including ithe stored proc). This means your users cannot do anything unless they have a proc to to it. This is one way of reducing the possibility of fraud.
Further procs are easier to performance tune than most application code and even better, when one needs to change, that is all you have to put on production, not recomplie the whole application.
Data integrity must be maintained at the database level. That means constraints, defaults values, foreign keys, possibly triggers (if you have very complex rules or ones involving multiple tables). If you do not do this at the database level, you will eventually have integrity issues. Peolpe will write a quick fix for a problem and run the code in the query window and the required rules are missed creating a larger problem. A millino new records will have to be imported through an ETL program that doesn't access the application because going through the application code would take too long running one record at a time.
If you think you are building an application where scalibility will be an issue, you need to hire a database professional and follow his or her suggestions for design based on performance. Databases can scale to terrabytes of data but only if they are originally designed by someone is a specialist in this kind of thing. When you wait until the while application is runnning slower than dirt and you havea new large client coming on board, it is too late. Database design must consider performance from the beginning as it is very hard to redesign when you already have millions of records.
A good way to reduce scalability of your data tier is to interact with it on a procedural basis. (Fetch row..process... update a row, repeat)
This can be done within a stored procedure by use of cursors or within an application (fetch a row, process, update a row) .. The result (poor performance) is the same.
When people say they want to do processing in their application it sometimes implies a procedural interaction.
Sometimes its necessary to treat data procedurally however from my experience developers with limited database experience will tend to design systems in a way that do not leverage the strenght of the platform because they are not comfortable thinking in terms of set based solutions. This can lead to severe performance issues.
For example to add 1 to a count field of all rows in a table the following is all thats needed:
UPDATE table SET cnt = cnt + 1
The procedural treatment of the same is likely to be orders of magnitude slower in execution and developers can easily overlook concurrency issues that make their process inconsistant. For example this kind of code is inconsistant given the avaliable read isolation levels of many RDMBS platforms.
SELECT id,cnt FROM table
...
foreach row
...
UPDATE table SET cnt = row.cnt+1 WHERE id=row.id
...
I think just in terms of abstraction and ease of servicing a running environment utilizing stored procedures can be a useful tool.
Procedure plan cache and reduced number of network round trips in high latency environments can also have significant performance advantages.
It is also true that trying to be too clever or work very complex problems in the RDBMS's half-baked procedural language can easily become a recipe for disaster.
"Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!)."
What makes you think that "scalability" is the only relevant concern in a system design ? I agree with rexem where he commented that it is very obvious that you are "not" biased ...
Databases are sets of assertions of fact. Those sets become more valuable if they can also be guaranteed to conform to certain integrity rules. Those guarantees are not worth a dime if it is the applications that are expected to enforce such integrity. Triggers and sprocs are the only way SQL systems have to allow such guarantees to be offered by the DBMS itself.
That aspect outweighs "scalability" anytime, anywhere, anyhow.

Difference between BPM and App. workflow?

I know there is a lot of talk about BPM these days and I am conscious that some may see it to be a craze rather than a fundamentally important piece of software.
As someone from what most would call 'The Business', I have been doing my best to learn about BPM to ensure we continue to make decisions that not only make sense to the business, but IT as well.
I have noticed while reading that mention is made to application workflow when sometimes discussing BPM. I hadn't given this much thought until recently.
Therefore, what is the difference? When would you use one and not the other?
BPM is about the process and improving it, which takes into account users and potentially more than one application,e.g. an ERP system may have more than one application to it, though there may be other uses of the term. Note that the process could be viewed without what applications or technologies are used.
Application workflow is how an application is used to go from a to b. Here it is a specific set of code that is used and what happens over the course of an application getting from a to b. In this case, the application is front and center rather than the process.
Does that provide an answer? Another way to think of it is that multiple application workflows can make up a system which is used in a process that can have BPM applied to it.
Late to the game, but workflow is to database as BPMS is to DBMS. (Convenient how the letters line up, huh?)
IOW, BPM(S) is traditionally meant to refer to a particular framework/application that allows you to manage business processes: defining them, storing them, versioning them, measuring them, etc. This is similar to how a DBMS manages databases.
Now, a workflow is a definition, much like a database is a definition. In the former case, it is a definition of operations/work (Fufill Order), steps thereof (Send Invoice) and rules/constraints on the work (If no stock, send notice). In the latter, similar case, it is a definition of data structure (CREATE TABLE) and constraints (InvoiceTotal must be > $0.00).
I think this is a potentially confusing subject, particular as some development environments use a type of process flow model to generate user facing applications (I'm thinking about Outsystems here, for example).
But, for me, the distinction is crystal clear. Application workflow, as people talk about it, refers to a user's path through an application, i.e. the pages they complete/visit, the data they enter, etc. on their way to completing a transaction of some sort. Application orkflow is a poor term for this though, I think application flow would be more meaningful.
BPM on other hand, is about modelling and executing a workflow process. By workflow, in this context, I mean a series of discrete steps (or tasks) that have to be completed (either programmatically or via human interaction) in a certain order to complete a process. These tasks can be implemented as individual application modules (each with their own "application workflow", see above). The job of the workflow engine is to make sure that these separate steps are assigned to the right people (of groups of people) in the right sequence, and that overall the process completes in an orderly way.
I don't think there's a clear answer to this at all. These are words, as opposed to theoretical concepts. If you add the word "checklist" into the mix - that just turns out to be a linear version of a process (but you can have conditionals in checklists - making them a workflow).
I am not sure how to help in reframing this question, but it's almost as if no answer can ever be possible. My own thoughts are at https://tallyfy.com/improving-efficiency-workflow-vs-business-process-management/

Resources