How to Define Data Warehouse units in Synapse? - data-warehouse

We have DWUs option in Azure synapse for Dedicated SQL Pool, which start from 100 DWUs which as per document consists of Compute +Memory +IO.
but how to check what type of compute node it is ? because in document it says 100 DWU consists of 1 compute with 60 distributions and 60 GB of Memory.
but here what is the configuration of Compute Node ?
or if we can't find the configuration, how to calculate the required DWUs to process 10GB of Data.

A dedicated SQL pool (formerly SQL DW) represents a collection of analytic resources that are being provisioned. Analytic resources are defined as a combination of CPU, memory, and IO.
These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs). A DWU represents an abstract, normalized measure of compute resources and performance.
What is the configuration of Compute Node ? And how to check what type of compute node it is?
Each Compute node has a node ID that is visible in system views. You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes.
For a list of these system views, see Synapse SQL system views.
How to calculate the required DWUs to process 10GB of Data.
Dedicated SQL pool (formerly SQL DW) is a scale-out system that can provision vast amounts of compute and query sizeable quantities of data.
To see its true capabilities for scaling, especially at larger DWUs, we recommend scaling the data set as you scale to ensure that you have enough data to feed the CPUs.
For more information, refer to Data Warehouse Units (DWUs) for dedicated SQL pool (formerly SQL DW) in Azure Synapse Analytics.

How many cores do you think you need to process 10GB data? It's a pretty complicated question to answer.
What are your queries doing? Are you doing 20 self joins? How much tempdb space is needed?
Breast way to find out will be to run experiments for your particular workload and resize such that you use 80% of available resources.

Related

Apache beam do all keys have to fit into memory on a worker

Assuming I have an unbounded dataset with extremely high cardinity > 1,000,000,000 unique keys, lets say I want to count by key, lets say over fixed windows
My understanding the combine function will essentially maintain an accumulator on each machine in memory for each key.
Question 1
Is the above assumption correct or can workers flush out keys and accumulators to disk when under memory pressure
Question 2 (assuming above correct)
Assuming the data is not naturally partitioned (e.g reading from pubsub) would we run out of memory on each worker since every machine may in theory see every key and have to maintain an in memory structure for each key?
Question 3 (assuming above correct)
If we store the data on kafka and split up the data into partitions based on the key we are counting on. Assuming you have 1 beam worker reading from 1 partition then each worker only see a consistent subset of the keyspace. In this scenario would the memory use of the workers be any different?
Beam is meant to be highly scalable; there are Beam pipelines that run on Dataflow with many trillions of unique keys.
When running a combining operation in Beam a table of keys and aggregated values is kept in memory, but when the table becomes full it is flushed to disk (well, technically, to shuffle) so it will not run out of memory. Another worker will read this data out of shuffle, one value at a time, to compute the final aggregate over all upstream worker outputs.
As for your other two questions, if your input is naturally partitioned by key such that each worker only sees a subset of keys it is possible that more combining could happen before the shuffle, leading to less data being shuffled, but this is by no means certain and the effects would likely be small. In particular, memory considerations won't change.

How does Titan achieve constant time lookup using HBase / Cassandra?

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:
To understand why native graph processing is so much more efficient
than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships.
To traverse a network of m steps, the cost of the indexed approach, at
O(m log n), dwarfs the cost of O(m) for an implementation that uses
index-free adjacency.
It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:
With fixed sized records and pointer-like record IDs, traversals are
implemented simply by chasing pointers around a data structure, which
can be performed at very high speed. To traverse a particular
relationship from one node to another, the database performs several
cheap ID computations (these computations are much cheaper than
searching global indexes, as we’d have to do if faking a graph in a
non-graph native database)
This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?
Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).
Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.
Please see the following blog post that explains the above quantitatively:
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.
Please see the following two blog posts for more information:
http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/
OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.
But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).

Data storage for time series data

I have some scientific measurement data which should be permanently stored in a data store of some sort.
I am looking for a way to store measurements from 100 000 sensors with measurement data accumulating over years to around 1 000 000 measurements per sensor. Each sensor produces a reading once every minute or less frequently. Thus the data flow is not very large (around 200 measurements per second in the complete system). The sensors are not synchronized.
The data itself comes as a stream of triplets: [timestamp] [sensor #] [value], where everything can be represented as a 32-bit value.
In the simplest form this stream would be stored as-is into a single three-column table. Then the query would be:
SELECT timestamp,value
FROM Data
WHERE sensor=12345 AND timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
Unfortunately, with row-based DBMSs this will give a very poor performance, as the data mass is large, and the data we want is dispersed almost evenly into it. (Trying to pick a few hundred thousand records from billions of records.) What I need performance-wise is a reasonable response time for human consumption (the data will be graphed for a user), i.e. a few seconds plus data transfer.
Another approach would be to store the data from one sensor into one table. Then the query would become:
SELECT timestamp,value
FROM Data12345
WHERE timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
This would give a good read performance, as the result would be a number of consecutive rows from a relatively small (usually less than a million rows) table.
However, the RDBMS should have 100 000 tables which are used within a few minutes. This does not seem to be possible with the common systems. On the other hand, RDBMS does not seem to be the right tool, as there are no relations in the data.
I have been able to demonstrate that a single server can cope with the load by using the following mickeymouse system:
Each sensor has its own file in the file system.
When a piece of data arrives, its file is opened, the data is appended, and the file is closed.
Queries open the respective file, find the starting and ending points of the data, and read everything in between.
Very few lines of code. The performance depends on the system (storage type, file system, OS), but there do not seem to be any big obstacles.
However, if I go down this road, I end up writing my own code for partitioning, backing up, moving older data deeper down in the storage (cloud), etc. Then it sounds like rolling my own DBMS, which sounds like reinventing the wheel (again).
Is there a standard way of storing the type of data I have? Some clever NoSQL trick?
Seems like a pretty easy problem really. 100 billion records, 12 bytes per record -> 1.2TB this isn't even a large volume for modern HDDs. In LMDB I would consider using a subDB per sensor. Then your key/value is just 32 bit timestamp/32 bit sensor reading, and all of your data retrievals will be simple range scans on the key. You can easily retrieve on the order of 50M records/sec with LMDB. (See the SkyDB guys doing just that https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1X35alxcJ)
Try VictoriaMetrics as a time series database for big amounts of data.
It is optimized for storing and querying big amounts of time series data.
It uses low disk iops and bandwidth thanks to the storage design based on LSM trees, so it can work quite well on HDD instead of SSD.
It has good compression ratio, so 100 billion typical data points would require less than 100 GB of HDD storage. Read technical details on data compression.

Why is there this Capacity Limit on Nodes and Relationships in neo4j?

I wonder why neo4j has a Capacity Limit on Nodes and Relationships. The limit on Nodes and Relationships is 2^35 1 which is a "little" bit more then the "normal" 2^32 integer. Common SQL Databases for example mysql stores there primary key as int(2^32) or bigint(2^64)2. Can you explain me the advantages of this decision? In my opinion this is a key decision point when choosing a database.
It is an artificial limit. They are going to remove it in the not-too-distant future, although I haven't heard any official ETA.
Often enough, you run into hardware limits on a single machine before you actually hit this limit.
The current option is to manually shard your graphs to different machines. Not ideal for some use cases, but it works in other cases. In the future they'll have a way to shard data automatically--no ETA on that either.
Update:
I've learned a bit more about neo4j storage internals. The reason the limits are what they are exactly, are because the id numbers are stored on disk as pointers in several places (node records, relationship records, etc.). To increase it by another power of 2, they'd need to increase 1 byte per node and 1 byte per relationship--it is currently packed as far as it will go without needing to use more bytes on disk. Learn more at this great blog post:
http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html
Update 2:
I've heard that in 2.1 they'll be increasing these limits to around another order of magnitude higher than they currently are.
As of neo4j 3.0, all of these constraints are removed.
Dynamic pointer compression expands Neo4j’s available address space as needed, making it possible to store graphs of any size. That’s right: no more 34 billion node limits!
For more information visit http://neo4j.com/blog/neo4j-3-0-massive-scale-developer-productivity.

Why don't EMR instances have as many reducers as mappers?

By default during an EMR job, instances are configured to have fewer reducers than mappers. But the reducers aren't given any extra memory so it seems like they should be able to have the same amount. (For instance, extra-large high-cpu instances have 7 mappers, but only 2 reducers, but both mappers and reducers are configured with 512 MB of memory available).
Does anyone know why this is and is there some way I can specify to use as many reducers as mappers?
EDIT: I had the amount wrong, it's 512 MB
Mappers extract data from their input stream (the mapper's STDIN), and what they emit is much more compact. That outbound stream (the mapper's STDOUT) is also then sorted by the key. Therefore, the reducers have smaller, sorted data in their incoming.
That is pretty much the reason why the default configuration for any Hadoop MapReduce cluster, not just EMR, is to have more mappers than reducers, proportional to the number of cores available to the jobtracker.
You have the ability to control the number of mappers and reducers through the jobconf parameter. The configuration variables are mapred.map.tasks and mapred.reduce.tasks.

Resources