Queries performances on ADLS gen 2 - data-warehouse

I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.
So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).
On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)
No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?
So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.
As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t
Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.

Best practice is to use Parquet format, not CSV. It is a columnar format optimized for OLAP-like queries.
With Synapse Preview, you can then use SQL on-demand engine (serverless technology) when you do not need to provision DW cluster and you will be charged per Tb of scanned data.
Or you can spin up Synapse cluster and ingest your data using COPY command into DW (it is in preview as well).

Related

Influxdb schema design for large amounts of fast data, multiple or single database?

We are using influxdb at different industrial sites, where we log up to 10.000 values ranging from 1Hz to 1000Hz sample rate, from 3-5 different machines - resulting in something like 1GB data/hour.
The logging is handles by simple HTTP line-protocal calls to an Influxdb 1.8 server. Running on a xeon 2.5Ghz 10c 64Gb ram 6TB SSD raid 5 array.
Right now the values are stored in the same database with a measurement for each machine, with a retention policy of 20weeks with a shard duration of 1week.
The data is visualized through grafana mostly.
Many people query the database at once through multiple grafana dashboards - which can tend to be fairly slow when I retrieve large amounts of data. No cross measurement calculations are performed, it is only visual plots.
Will I get any read-speed benefits from doing multiple databases instead of a single database with multiple measurements?
When getting data from a database, do influx need to "open" files containing data from all measurements in order to find data from a specific measurement?

UniData External Data Access (EDA) synchronization

Is it a good idea to continuously use External Data Access (EDA) for synchronization of big files (let's say with 10 million records) with RDBMS. Will EDA also handle incremental updates to the source UniData file and automatically reflect those changes(CREATE, UPDATE and DELETE) to the target RDBMS system?
Also, according to the documentation, EDA supports right now MSSQL, Oracle and DB2. Is it possible to configure EDA to work for example with PostgreSQL?

Apache Ignite vs Informix Warehouse Accelerator vs Infinispan?

What's difference between Apache Ignite and IWA (Informix Warehouse Accelerator) and Infinispan ?
I have an application that accept large volume of data and process many transaction in per second. Not only
response time is very important for us, but also Data integrity is very important for us, Which in-memory databases best solutions for me ? , I'm confused
to select them. also i use JEE and application server is jboss.
We are looking for best in-memory database solution to processing data in real time?
Update:
I use relational database , i am looking for in-memory database to select , insert , update from that for decrease response time, Also Data integrity is very important and very important persist data on disk
Apache Ignite and Infinispan are both data grids / memory-centric databases with similar feature lists, with the biggest difference that Apache Ignite has SQL support for querying data.
Informix Warehouse Accelerator seems to be a narrow use case product, so it's hard to say if it's useful for your use case or not.
Otherwise, there's too few information in your question about the specifics of your project to say if either of those are a good fit, or even none of them.

Fetch data subset from gmond

This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.

Controlled data sharding in MongoDB

I am new to MongoDB and I have very basic knowledge of its concepts of sharding. However I was wondering if it is possible to control the split of data yourself? For example a part of the records would be stored on one specific shard?
This will be used together with a rails app.
You can turn off the balancer to stop auto balancing:
sh.setBalancerState(false)
If you know the range of the key you are splitting on you could also presplit your data ranges to the desired servers see PreSplitting example. The management of the shard would be done via the javascript shell and not via your rails application.
You should take care that no shard gets more load (becomes hot) and that is why there is auto balancing by default, using monitoring like the free MMS service will help you monitor that.
The decision to shard is a complex decision and one that you should put a lot of thought into.
There's a lot to learn about sharding, and much of it is non-obvious. I'd suggest reviewing the information at the following links:
Sharding Introduction
Sharding Overview
FAQ
In the context of a shard cluster, a chunk is a contiguous range of shard key values assigned to a particular shard. By default, chunks are 64 megabytes (unless modified as per above). When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks. MongoDB chunks are logical and the data within them is NOT physically located together.
As I've mentioned the balancer moves the chunks around, however, you can do this manually. The balancer will take the decision to re-balance and request a chunk migration if there is a large enough difference ( minumum of 8) between the number of chunks on each shard. The actual moving of the chunks is co-ordinated between the "From" and "To" shard and when this is finished, the original chunks are removed from the "From" shard and the config servers are informed.
Quite a lot of people also pre-split, which helps with their migration. See here for more information.
In order to see documents split among the two shards, you'll need to insert enough documents in order to fill up several chunks on the first shard. If you haven't changed the default chunk size, you'd need to insert a minimum of 512MB of data in order to see data migrated to a second chunk. It's often a good idea to to test this and you can do this by setting your chunk size to 1MB and inserting 10MB of data. Here is an example of how to test this.
Probably http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding addresses your requirement in v2.2
Check out Kristina Chodorow's blog post too for a nice example : http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Why do you want to split data yourself if mongo DB is automatically doing it for you , You can upgrade your rails application layer to talk to mongos instance so that mongos routes the call for any CRUD operation to the place where the data resides . This is achieved using config server .

Resources