Stored Procedure AND/OR ORM for a BI web app - stored-procedures

I am building a business intelligence web app in php 5 that display informations retrieved from a datawarehouse highly normalized (60+ tables in mysql).
We use MODx as our CMF to organize the code. So far the code is mainly procedura, each page is essentially composed of a bunch of sql query directly in the php code (Snippet as in the MODx terminology) and code to display the info in tables and graphically.
We are in the process of creating objects for our main components and put the sql queries there and use PDO. It is easy to do when the query map to a real object of the domain.
For more BI (aggregation with subqueries, join on 5+ tables) or search oriented query, I find it more difficult to see how to replace the dynamically created sql. For example, we have a search functionality in the web app with a lot of criteria. Depending on which criteria are selected, the php code add or remove tables to join, subqueries and change the 'where' clause.
Do you think an ORM or Stored Proc could improve performance/quality of code in that context ?
Is our model (60+ tables highly normalized) too complicated to be directly accessed from the web app and a kind of datamart (basically denormalized view of the data) would bring more benefits than ORM ?
This question is related to : stored-procedures-or-or-mappers

The bottleneck would surely be the level of normalization - if the option is available to you, adopting a more star-schema style DWH would greatly increase performance as it pre-prepares the data for consumption by your BI app.

Related

Best practices for developing an application with Neo4J

I recently came across an application which uses NEO4j as the backend. In my experience with SQL and other Key-value based databases, I have developed an understanding(which could be refined) that other databases store data and your application derives the information while with NEO4J you store the information. This means that the logic of deriving the information is already captured in the model of NEO4J. I am not able to get my head around this because now I cannot have logic that can be composed and most importantly something that can be tested with unit tests. I can sure have component level tests using embedded neo4j but then that's not the same. Can someone please help me understand the application development philosophy/methodology with NEO4J.
...other databases store data and your application derives the information while with NEO4J you store the information.
Hmmm.... Define data and define information. Mostly it goes: Data is something that requires further processing to become information (that is, something informative - something you can derived some conclusion or insights from).
Anyhow, doubt this has anything to do with Graph databases vs relational/aggregate databases. A database, as the name suggests, stores data.
This means that the logic of deriving the information is already captured in the model of NEO4J.
I'm not sure what you mean by "the logic... is already captured". Some queries are much easier with Neo+Cypher that with say SQL; like "Find all the friends of my friends that live in Berlin", but I would hardly relate this to 'logic'.
I cannot have logic that can be composed and most importantly something that can be tested with unit tests.
What do you mean by 'logic that can be composed'? And unit tests has nothing to do with this I'm afraid - there's no logic being tested if you talk about graph vs other databases.
Can someone please help me understand the application development philosophy/methodology with NEO4J.
There's really not much to it. Neo4J is a database like any other database, only that it uses a different model from relational/aggregate databases.
To highlight two of its strengths:
No joins - That's a pain point with relational/aggregate databases, especially with complex queries. Essentially, nearly all system involve a data model that is a graph (you only need one many-to-many relationship in your data model for that), and not using a graph database is a form of dimensionality reduction. The reasons relational databases prevailed for so many years is nothing short of a set of historical coincidences.
Easier DB migrations - and that's for being a schema-less data base. You ripe the same benefits with any other schema-less database.
I strongly recommend you read the 'NOSQL Overview' appendix of the free Graph Databases. It focus on a lot of these points.

Does GraphQL negate the need for Graph Databases

Most of the reasons for using a graph database seem to be that relational databases are slow when making graph like queries.
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins. I might even be using a No-SQL database which is usually pretty fast at these kinds of flat queries.
If this is the case, is there a use case for Graph databases anymore when combined with GraphQL? Neo4j seems to be promoting GraphQL. I'd like to understand the advantages if any.
GraphQL doesn't negate the need for graph databases at all, the connection is very powerful and makes GraphQL more performant and powerful.
You mentioned:
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins.
This is a curious point, because if you do a lot of SELECT * FROM X and the data is connected by a graph loader, you're still doing the joins, you're just doing them in software outside of the database, at another layer, by another means. If even that software layer isn't joining anything, then what you gain by not doing joins in the database you're losing by executing many queries against the database, plus the overhead of the additional layer. Look into the performance profile of sequencing a series of those individual "easy selects". By not doing those joins, you may have lost 30 years value of computer science research...rather than letting the RDMBS optimize the query execution path, the software layer above it is forcing a particular path by choosing which selects to execute in which order, at which time.
It stands to reason that if you don't have to go through any layer of formalism transformation (relational -> graph) you're going to be in a better position. Because that formalism translation is a cost you must pay every time, every query, no exceptions. This is sort of equivalent to the obvious observation that XML databases are going to be better at executing XPath expressions than relational databases that have some XPath abstraction on top. The computer science of this is straightforward; purpose-built data structures for the task typically outperform generic data structures adapted to a new task.
I recommend Jim Webber's article on the motivations for a native graph database if you want to go deeper on why the storage format and query processing approach matters.
What if it's not a native graph database? If you have a graph abstraction on top of an RDBMS, and then you use GraphQL to do graph queries against that, then you've shifted where and how the graph traversal happens, but you still can't get around the fact that the underlying data structure (tables) isn't optimized for that, and you're incurring extra overhead in translation.
So for all of these reasons, a native graph database + GraphQL is going to be the most performant option, and as a result I'd conclude that GraphQL doesn't make graph databases unnecessary, it's the opposite, it shows where they shine.
They're like chocolate and peanut butter. Both great, but really fantastic together. :)
Yes GraphQL allows you to make some kind of graph queries, you can start from one entity, and then explore its neighborhood, and so on.
But, if you need performances in graph queries, you need to have a native graph database.
With GraphQL you give a lot of power to the end-user. He can make a deep GraphQL query.
If you have an SQL database, you will have two choices:
to compute a big SQL query with a lot of joins (really bad idea)
make a lot of SQL queries to retrieve the neighborhood of the neighborhood, ...
If you have a native graph database, it will be just one query with good performance! It's a graph traversal, and native graph database are made for this.
Moreover, if you use GraphQL, you consider your data model as a graph. So to store it as graph seems obvious and gives you less headache :)
I recommend you to read this post: The Motivation for Native Graph Databases
Answer for Graph Loader
With Graph loader you will do a lot of small queries (it's the second choice on my above answer) but wait no, ... there is a cache record.
Graph loaders just do batch and cache.
For comparaison:
you need to add another library and implement the logic (more code)
you need to manage the cache. There is a lot of documentation about this topic. (more memory and complexity)
due to SELECT * in loaders, you will always get more data than needed Example: I only want the id and name of a user not his email, birthday, ... (less performant)
...
The answer from FrobberOfBits is very good. There are many reasons to add (or avoid) using GraphQL, whether or not a graph database is involved. I wanted to add a small consideration against putting GraphQL in front of a graph. Of course, this is just one of what ought to be many other considerations involved with making a decision.
If the starting point is a relational database, then GraphQL (in front of that datbase) can provide a lot of flexibility to the caller – great for apps, clients, etc. to interact with data. But in order to do that, GraphQL needs to be aligned closely with the database behind it, and specifically the database schema. The database schema is sort of "projected out" to apps, clients, etc. in GraphQL.
However, if the starting point is a native graph database (Neo4j, etc.) there's a world of schema flexibility available to you because it's a graph. No more database migrations, schema updates, etc. If you have new things to model in the data, just go ahead and do it. This is a really, really powerful aspect of graphs. If you were to put GraphQL in front of a graph database, you also introduce the schema concept – GraphQL needs to be shown what is / isn't allowed in the data. While your graph database would allow you to continue evolving your data model as product needs change and evolve, your GraphQL interactions would need to be updated along the way to "know" about what new things are possible. So there's a cost of less flexibility, and something else to maintain over time.
It might be great to use a graph + GraphQL, or it might be great to just use a graph by itself. Of course, like all things, this is a question of trade-offs.

How do you achieve performance on the order of a graph database with an RDBMS?

We have a data model stored in a relational data model that effectively looks like a graph. There is a small number of tables, but the tables are quite large and the types of queries we do are often a 5 join-levels deep. It would be most performant if this data were stored in a graph database, but we dont have that option. How does one achieve graph database-level performance with an RDBMS? What tools can you add on top of the database e.g. caching, search indexes, use an OLAP server that will give you anything close to the performance of a graph database in this situation?
How does one achieve graph database-level performance with an RDBMS?
You don't, at least not in the same way. Both systems have their strengths and weaknesses and trying to apply one to the other will usually end up with a mess. If you are stuck using RDBMS, then design your data model for RDBMS.
What tools can you add on top of the database e.g. caching, search indexes, use an OLAP server that will give you anything close to the performance of a graph database in this situation?
It depends on your data model, perhaps you can elaborate on why you cannot use a Graph Database? Can you have both running side by side?

ADO.NET: do you have lots of stored procedure in your own systems?

hi all
We do know that CommandType property of a SqlCommand object has 3 options: TableDirect, Text and StoredProcedure or "SP".
Knowing that "SP" has benefits over two other options, my question is do you make lots of SP in your own systems?
Or What solution do you have instead of creating SP?
Thank you
Aside of creating Stored Procedures you can use Object Relational Mapping
Such as:
linq to sql
Nhibernate
Entity Framework
Data Access :SP's vs ORMs
Choose the best way that suits you.
In all production system I used SPs and pure ADO.NET Core to access the data. Systems range from having 100-300 tables and about 500-1000 stored procedures.
Most of the Data Access code is generated using a tool. I've posted the source code and sample application on my blog if you're interested in using/modifying it. The tool can generate over 100,000 lines of code in about 20-25 seconds going against a database with about 750 stored procedures.
Data Access Layer - Code Gen
Of course if you're no familiar with Databases, data modeling/design and stored procedures you're probably better off using Linq to SQL or EF4 (Entity Framework version 4) or similar. If you need brute force performance then ADO.NET core along with Stored procedures is the way to go.
Re: your first question
When you go down the path of stored procedures, the number of stored procedures begins to grow continually for the life of the project. Outside of the basic CRUD operations, each stored procedure tends to be tightly bound to a particular problem and not very re-usable. A rule of thumb is that I can expect 8-12 stored procedures for each data table (excluding reference or code tables, such as the list of states or countries).
The very large number of procs makes naming conventions very important so that you can find anything without constantly visually re-scanning the whole list of 400-500 procs.
Re: your second question
There are a lot of ugly things that happen with sql written inside of strings inside of C# or VB.NET -- it's error prone, ugly, etc.
Linq, nHybernate and many others exist, but the "concept count" (the number of things you need to learn to start being productive), is much higher than learning how to write a good stored procedure executer in C#.
I try to make sure that stored procedures are only created for database functionality - not business logic.
It's Database Functionality when I have some database architecture that's a bit obscure and I want to hide that from callers.
It's Business Logic when it is simply the way in which my application adds or updates or how much validation they do, etc., etc.

When developing web applications when would you use a Graph database versus a Document database?

I am developing a web-based application using Rails. I am debating between using a Graph Database, such as InfoGrid, or a Document Database, such as MongoDB.
My application will need to store both small sets of data, such as a URL, and very large sets of data, such as Virtual Machines. This data will be tied to a single user.
I am interested in learning about peoples experiences with either Graph or Document databases and why they would use either of the options.
Thank you
I don't feel enough experienced with both worlds to properly and fully answer your question, however I'm using a document database for some time and here are some personal hints.
The document databases are based on a concept of key,value, and static views and are pretty cool for finding a set of documents that have a particular value.
They don't conceptualize the relations between documents.
So if your software have to provide advanced "queries" where selection criteria act on several 'types of document' or if you simply need to perform a selection using several elements, the [key,value] concept is not appropriate.
There are also a number of other cases where document databases are inappropriate : presenting large datasets in "paged" tables, sortable on several columns is one of the cases where the performances are low and disk space usage is huge.
So in many cases you'll have to perform "server side" processing in order to pick up the pieces, and with rails, or any other ruby based framework, you might run into performance issues.
The graph database are based on the concept of tripplestore, meaning that they also conceptualize the relations between the entities.
The graph can be traversed using the relations (and entity roles), and might be more convenient when performing searches across relation-structured data.
As I have no experience with graph database, I'm not aware if the graph database can be easily queried/traversed with several criterias, however if an advised reader has such an information I'd really appreciate any examples of such queries/traversals.
I'm currently reading about InfoGrid and trying to figure if such databases could by handy in order to perform complex requests on a very large set of data, relations included ....
From what I can read, the InfoGrah should be considered as a "data federator" able to search/mine the data from several sources (Stores) wich can also be a NoSQL database such as Mongo.
Wich means that you could use a mongo store for updates and InfoGraph for data searching, and maybe spare a lot of cpu and disk when it comes to complex searches inside a nosql database.
Of course it might seem a little "overkill" if your app simply stores a large set of huge binary files in a database and all you need is to perform simple key queries and to retrieve the result. In that cas a nosql database such as mongo or couch would probably be handy.
Hope some of this helps ;)
When connecting related documents by edges, will you get a shallow or a deep graph? I think the answer to that question is important when deciding between graphdbs and documentdbs. See Square Pegs and Round Holes in the NOSQL World by Jim Webber for thoughts along these lines.

Resources