Polybase External Tables vs. OPENROWSET serverless sql pool architecture - serverless

I am in search of performance benchmarks for querying parquet ADLS files with the standard dedicated sql pool using external tables with polybase vs. serverless sql pool and OPENROWSET views.
From my base queries on a 1.5 billion record table, it does appears OPENROWSET in serverless sql pool is around 30% more performant given time for the same query, but what are the architecture that power that? Are there any readily available performance benchmarks?

The architecture behind Azure Synapse SQL Serverless Pools and how it achieves such a strong performance is described in this paper, it is called "Polaris".
http://www.vldb.org/pvldb/vol13/p3204-saborit.pdf
Performance benchmarks have been published on multiple blogs. Be aware that this can only be a snapshot in time as those features are being improved constantly.

Related

Neo4j Restful VS Neo4j JDBC

What are the comparative advantages of querying a neo4j DB via
REST API
JDBC
as a Spring Data plugin
Performance will be better within Java using JDBC as opposed to a REST API. Here's a good explanation of why:
When you add complexity the code will run slower. Introducing a REST
service if it's not required will slow the execution down as the
system is doing more.
Abstracting the database is good practice. If you're worried about
speed you could look into caching the data in memory so that the
database doesn't need to be touched to handle the request.
Before optimizing performance though I'd look into what problem you're
trying to solve and the architecture you're using, I'm struggling to
think of a situation where the database options would be direct access
vs REST.
Regarding using neo4j as a plugin you can certainly do so, but I have to imagine the performance would not be as good as using JDBC.
From the book "Graph Databases" - Ian Robinson
Queries run fastest when the portions of the graph needed to satisfy
them reside in main memory (that is, in the filesystem cache and the
object cache). A single graph database instance today can hold many
billions of nodes, relationships, and properties, meaning that some
graphs will be just too big to fit into main memory.
If you add another layer to the app, this will be reflected in performance, so the bare you can consumes your data the better the performance but also the complexity and understanding of the code.

Graph database in distributed env

I have a question concerning graph databases! Is there a mechanism to use graph databases in a distributed environment?! I mean can you distribute a graph database?! Can we even traverse a graph database in a distributed environment?!
Definitely you can do it.
There are different databases which scale very good nowadays (JanusGraph, OrientDB, ArangoDB etc).
Even if you have a very big database which has to be scaled beyond a single datacenter to multiple geo-distributed datacenters you still has options.
For example, you can use JanusGraph with Cassandra / ScyllaDB storage backends. It will give you an option to asynchronously synchronize all your data from different datacenters.
Of course, there are some issues to be solved like consistency and so on but with today's tools, it's very possible to organize a distributed graph database.
Neo4j enterprise edition features clustering, read more on http://neo4j.com/docs/stable/ha.html.
Yes, you can use all sorts of graph databases in distributed environments. Can you distribute a graph database? Definitely yes.
BUT - distributing the same graph database in many different places (to speed up reads) is quite easy, and done all the time. Distributing a ridiculously massive database (so that parts of the graph database are in a bunch of different places) is quite hard.
I recommend this related question which talks about sharding and distributing databases. Pay particular attention to the bit about "sharding is an anti-pattern".

Neo4j sharding aspect

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.
Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.
Does anyone know if it was done or its status if not?
Thanks!
Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.
Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.
Let me set some context first, and then answer your question.
As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.
Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.
More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/
And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)
So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.
Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:
Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...
Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.
TL;DR With 2018 is days away neo4j still does not support sharding as it is typically considered.
Details Neo4j still requires all data to fit on a single node. The node contents can be replicated within a cluster - but actual sharding is not part of the picture.
When neo4j talks of sharding they are referring to caching portions of the database in memory: different slices are cached on different replicated nodes. That differs from say mysql sharding in which each node contains only a portion of the total data.
Here is a summary of their "take" on scalability: their product term is "High Availability" https://neo4j.com/blog/neo4j-scalability-infographic/
. Note that High Availability should not be the same as Scalability: so they do not actually support the latter in the traditional understanding of the term.

In-memory database scalability

I have been exploring MMDB systems lately and I haven't been able to find much information with regards to how an in-memory database is supposed to scale. My quite basic assumption is that a main-memory db is constrained by the memory available on the db node, and by the operating system management of this memory. So how can I expand an in-memory system size beyond that of the main memory available? I assume the answer is along the lines of a distributed system but I haven't got it clear in my head how it would work. And of course it's also possible I completely misunderstood the idea of mmdb and i'm missing something obvious.
A bit of background on the question: I am writing a number of cross-platform mobile apps (even though my background is heavily involved with mysql and mongodb), and I don't like native database solutions like sqlite for android and ios. So I thought I'd write my own solution (site and github) in javascript (i'm working on cordova/phonegap). I realised that I could make this a nodejs module and use it as a db for a web app (I'm creating a blog powered by it as an experiment and it's working pretty well), but of course I'm now thinking of making it a separate tier and I started thinking about the obvious limitation of memory size, hence my question.
in-memory databases scale in size the same way as on-disk (aka persistent) databases do: either throw more storage at it (memory, in this case) or distribute it across multiple nodes of a cluster. The latter alternative increases the complexity (both of the DBMS, and your administration of it), relative to an in-memory database on a single system. Consider the difference between vanilla MySQL and MySQL Cluster. And, you'll want to have a really fast network for those times when the DBMS has to perform inter-node operations (e.g. distribute the data or pull data from multiple nodes to satisfy a query).
There's nothing particularly special about in-memory databases in this regard. There are some special optimizations in the database engine when you know storage is memory. But it doesn't change the fundamental principles of database systems.
What you don't want to do is create an in-memory database larger than physical memory. You'll force the OS to swap in-memory database pages in/out of swap space, and the performance will suck. You're better off, in that case, using a conventional DBMS and giving it as much cache as you have memory available for. The DBMS will use the cache more intelligently than the OS' will the swap space.
Current production-ready in-memory databases have mainly focused on scale-up as opposed to scale-out. So-far, they have either managed to integrate main memory tier into their core, existing architecture (IBM via Blu acceleration) or have re-built the database from almost-scratch to leverage the main-memory as primary storage layer (SAP HANA), and in both cases their claim to fame is the obvious speedup that DRAM offers in comparision to the disk.
However very few databases, presently, have a complete offering which scales-out in-memory performance benefit accross multiple nodes. Most of the in-memory databases require the applications to manage the distribution of data/objects across nodes (Ex: SAP HANA).
Oracle's DBIM and MemSQL are a few scalable and distributed options, at this time, that implement distributed in-memory database/tier by collective utilization of memory resources across the cluster (RAC in case of Oracle). MemSQL can be deployed on a cluster of commodity compute nodes and it claims to scale by utilizing aggregate resources, including memory. Oracle RAC is a shared cache architecture that overcomes the limitations of traditional shared-nothing and shared-disk approaches to provide highly scalable and available database solutions, including in-memory benefits.

Can SQL Azure scale without any specific technique or administration (partitioning/replication...)?

Can SQL Azure scale without any specific technique or administration like Google App Engine's BigTable? No manual partitioning or replication required?
Do you mean scale to meet increasing demand, or do you mean increase in size to accommodate additional data?
With respect to size: you pick the "edition" of the database (web or business) - both have different size limitations. You are billed based on size only. max size is 50gb. Once edition is picked, the capacity will increase up to max allowed to accommodate your data. You do nothing special.
With respect to scale to meet performance demands... you are abstracted away from managing really anything that has to do with scalability from SQL Azure perspective... Your database is colocated with other databases on various SQL servers running in MS data center. theoretically your database will be moved to a less-busy server if it becomes too hot... however, SQL Azure is not considered to be highly scalable solution (ie: facebook/twitter quality).
If you need mega-scalability, you'll need to go with Azure Table Storage
For the majority of applications, SQL Azure will scale just fine.
"Will it scale?" Now that is the question a lot of us wonder about SQL Azure. Especially since you can't tell it how much Ram, CPU Cores or replicated servers with load balancing to allocate. With Windows Azure you can tell it how many of each resource you want your application hosted on, but that isn't the case with SQL Azure. This may sound really bad to some, but SQL Azure is designed to "automagically" scale the database server to your needs. What that means I honestly can't say, as I haven't (as of yet) found much official information from Microsoft on that topic.
With extremely high traffic sites, such as Facebook and Twitter, it has been suggested that non-relational databases (such as Azure Table Storage) can scale better since the database has less overhead when querying data. If you need relational database features (such as foreign key relationships and sql join functionality) then you probably want to use SQL Azure.
It's not as clear cut as "to SQL Azure, or not to SQL Azure." There are database architecture design patterns that can be used such as denormalizing database tables (require less joins per query) and Horizontally Partitioning your data to allow your design to better scale.
A Hybrid or Mixed solution of both SQL Azure and Azure Table Storage can be used too. If you have some data that requires relational queries, then put it in SQL Azure. If you have data that does not require a relational database, then you could put it in Azure Table Storage.
Remember, the database design is part of the overall architecture of your application and you should plan it out just as much as you plan whether to use TDD, IOC and Dependency Injection. After all, if your database can't scale, it doesn't matter how awesome the application code is.
Aside, Thinking about this topic makes me wonder what XBox Live and Bing Search use for their database needs. Is it Relational, Non-Relational or Hybrid?

Resources