I have a self-hosted WCF Data Service (OData) that I'm developing. As I've been testing this, I noticed that most client applications I'm using (Excel, Browsers, etc) timeout on a request to pull a particular query in my service. There are about 140k records in the query. Applications just crash after a long query.
Right now, the only work around is to do client-side paging but if I can simply increase the limit then I would be most grateful for the answer.
Note that my Entity Model is mapped with database Views and not actual tables, just in case it has a relation with the issue.
Cheers!
Do you really need to transfer a so large amount of data?
I think OData is not a protocol for data replication.
The main advantage of OData is the opportunity to query and thus limit the amount of data to be transferred.
In an application that handles a lot of data, a common approach is to first present aggregations then refine querying (depending, for example, of successive choices made by the user).
The AdaptiveLINQ component I developed can help you implement this type of service. This is based on the notion of cube: dimensions and measures are defined as C# expressions.
For example, one can imagine a service to look in a product catalog (containing lots of data) as follows:
List of product categories and for each of them the amount of products available:
http://.../catalogService?$select=Category,ItemQuantity
List of available colors in category "shirt":
http://.../catalogService?$select=Color,ItemQuantity&$filter=Category eq shirt
List of "green shirts":
http://.../catalogService?$select=ProductLabel,ProductID&$filter=Category eq shirt and Color eq green
Related
Say you have zip codes, services and customers. Given a zip and service, I want to find the corresponding customers as fast as possible.
Options:
Customers are connected to zips via a "service" relationship. this seems like the smallest version, search for a particular zip and only one type of relationship (the targeted service)
Customers are connected to service areas, which point to different zips and services. Here we search for all service areas that point to the targeted service and the targeted zip.
Zips each connect to a service node unique to them, which are then connected to customers. so when you search, you go to the zip you want, go to the service, then anything connected to there is what you want (this feels like i may be overly hand holding for neo4J)
Do these different versions have different performance? I am having trouble understanding the theoretical difference in search formats in Neo4J. 2 is an example where the results are limited on two sides at once, where for 1 and 3, you can travel linearly on the graph as you filter, does that make a difference?
Thanks,
Brian
Approach 1 has several major drawbacks. All data about a "service" (which I assume is a company that provides services) would have to be duplicated in every associated service relationship. That wastes storage space in the DB. Also, if you wanted to find all the customers for a specific service (regardless of zip code), you'd have to scan every service relationship.
Approach 2 introduces an extra "service area" layer to the data model that seems to provide no advantage and just makes processing your use case more complicated and slower.
Approach 3 (in which I assume every "service" has a unique node) should be the way to go. There is no data duplication, and no scanning is needed to find the desired customers (whether you start from a zip code, or from a service).
Most of the reasons for using a graph database seem to be that relational databases are slow when making graph like queries.
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins. I might even be using a No-SQL database which is usually pretty fast at these kinds of flat queries.
If this is the case, is there a use case for Graph databases anymore when combined with GraphQL? Neo4j seems to be promoting GraphQL. I'd like to understand the advantages if any.
GraphQL doesn't negate the need for graph databases at all, the connection is very powerful and makes GraphQL more performant and powerful.
You mentioned:
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins.
This is a curious point, because if you do a lot of SELECT * FROM X and the data is connected by a graph loader, you're still doing the joins, you're just doing them in software outside of the database, at another layer, by another means. If even that software layer isn't joining anything, then what you gain by not doing joins in the database you're losing by executing many queries against the database, plus the overhead of the additional layer. Look into the performance profile of sequencing a series of those individual "easy selects". By not doing those joins, you may have lost 30 years value of computer science research...rather than letting the RDMBS optimize the query execution path, the software layer above it is forcing a particular path by choosing which selects to execute in which order, at which time.
It stands to reason that if you don't have to go through any layer of formalism transformation (relational -> graph) you're going to be in a better position. Because that formalism translation is a cost you must pay every time, every query, no exceptions. This is sort of equivalent to the obvious observation that XML databases are going to be better at executing XPath expressions than relational databases that have some XPath abstraction on top. The computer science of this is straightforward; purpose-built data structures for the task typically outperform generic data structures adapted to a new task.
I recommend Jim Webber's article on the motivations for a native graph database if you want to go deeper on why the storage format and query processing approach matters.
What if it's not a native graph database? If you have a graph abstraction on top of an RDBMS, and then you use GraphQL to do graph queries against that, then you've shifted where and how the graph traversal happens, but you still can't get around the fact that the underlying data structure (tables) isn't optimized for that, and you're incurring extra overhead in translation.
So for all of these reasons, a native graph database + GraphQL is going to be the most performant option, and as a result I'd conclude that GraphQL doesn't make graph databases unnecessary, it's the opposite, it shows where they shine.
They're like chocolate and peanut butter. Both great, but really fantastic together. :)
Yes GraphQL allows you to make some kind of graph queries, you can start from one entity, and then explore its neighborhood, and so on.
But, if you need performances in graph queries, you need to have a native graph database.
With GraphQL you give a lot of power to the end-user. He can make a deep GraphQL query.
If you have an SQL database, you will have two choices:
to compute a big SQL query with a lot of joins (really bad idea)
make a lot of SQL queries to retrieve the neighborhood of the neighborhood, ...
If you have a native graph database, it will be just one query with good performance! It's a graph traversal, and native graph database are made for this.
Moreover, if you use GraphQL, you consider your data model as a graph. So to store it as graph seems obvious and gives you less headache :)
I recommend you to read this post: The Motivation for Native Graph Databases
Answer for Graph Loader
With Graph loader you will do a lot of small queries (it's the second choice on my above answer) but wait no, ... there is a cache record.
Graph loaders just do batch and cache.
For comparaison:
you need to add another library and implement the logic (more code)
you need to manage the cache. There is a lot of documentation about this topic. (more memory and complexity)
due to SELECT * in loaders, you will always get more data than needed Example: I only want the id and name of a user not his email, birthday, ... (less performant)
...
The answer from FrobberOfBits is very good. There are many reasons to add (or avoid) using GraphQL, whether or not a graph database is involved. I wanted to add a small consideration against putting GraphQL in front of a graph. Of course, this is just one of what ought to be many other considerations involved with making a decision.
If the starting point is a relational database, then GraphQL (in front of that datbase) can provide a lot of flexibility to the caller – great for apps, clients, etc. to interact with data. But in order to do that, GraphQL needs to be aligned closely with the database behind it, and specifically the database schema. The database schema is sort of "projected out" to apps, clients, etc. in GraphQL.
However, if the starting point is a native graph database (Neo4j, etc.) there's a world of schema flexibility available to you because it's a graph. No more database migrations, schema updates, etc. If you have new things to model in the data, just go ahead and do it. This is a really, really powerful aspect of graphs. If you were to put GraphQL in front of a graph database, you also introduce the schema concept – GraphQL needs to be shown what is / isn't allowed in the data. While your graph database would allow you to continue evolving your data model as product needs change and evolve, your GraphQL interactions would need to be updated along the way to "know" about what new things are possible. So there's a cost of less flexibility, and something else to maintain over time.
It might be great to use a graph + GraphQL, or it might be great to just use a graph by itself. Of course, like all things, this is a question of trade-offs.
I'm playing around with neo4j - seeing what I can and can't do with it before suggesting it for something serious. One of the things I'm looking at now is Data Partitioning. By this I mean having a single data store that contains data from many different customers, and knowing which customer the data belongs to.
In the SQL world, we've always done this by having a customer_id field on the tables that are customer specific, and then always including that in the queries and indices. This works perfectly well for us, but in the Graph DB world it feels like we can do better.
The options that I've come up with some far are:
The same as before - including a property on the nodes that is the Customer ID
Storing a Label on each Node that identifies the Customer. However, as far as I can tell you can't bind parameters to labels so this would mean that the queries are generated slightly awkwardly.
Storing a Customer Node, and linking all of the other nodes to it.
Number #3 seems to be the "correct" Graph DB way of managing this, but I'm concerned with the impact of this on the performance of the data. It's perfectly feasible that there will be hundreds of thousands of links from a single Customer Node to the other data nodes, and there will be hundreds of different Customer Nodes. (Based on the volume of data in the existing SQL database)
What's the recommended way of achieving this level of data partitioning whilst maintaining performance?
We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.
Current situation:
We have a BPMS (business process management suite) in place. There is increasing demand on historical and operative reports. The data model in the BPMS is not designed for historical queries. So we are analysing the possible solutions.
Solution in mind:
The idea is to push data on events in flow to an external database. Typical events in BPM are: new process instance was created, status changed, a step in the process was performed or status of the process instance was changed. Data vault is besides the star schema one of the interesting alternatives. Let’s assume there are two Hubs: PI (processitem instances) and OU (organisational unit) and a Link table LINK_PI_OU. Each time the process item is assigned to an organisational unit a new line will be added to the link table. The LOAD_DATE in the link table contains the datetime when this record was added. The record in the link table with the latest LOAD_DATE shows the current assignment of the process instance.
Question:
Let’ assume the business wants to know to whom all open process instances are currently assigned grouped by organisational unit.
How will a query look like for this report? Can it really be performant?
Or am I on the complete wrong way?
In general terms I didnt think that Data-Vault is intended to be an end user report layer or even a faux transactional system.
Im not completely clear on your archectiture, but in my understanding D-V is a historical repository that keeps all data for an enterprise that feeds a (Kimball/Inmon)datawarehouse. So in high level terms ...
Transaction systems => D-V => DWH => (cubes =>) users
This being the case, I wouldnt be posing queries to a Data Vault, instead I would write some ETL to populate a data warehouse and pose queries at the DWH.
The other view, I guess, is that you could build a set of views on top of the D-V, that would hide the structure from users, but I think I'm a bit of a purist and would go for a DWH.
As #Marcud D said, Data Vault is the model of Data Warehouse and usually when using DV modelling, you have to build data marts from DV for reporting purposes. I think that organizational unit should be modeled as Satellite table, not as Hub table. So, in any way, you should build a query to feed a specific data mart from DV model and then use it for reporting purposes.