Greenplum vs Citus for Data Warehouse - data-warehouse

I'm trying to evaluate Citus and Greenplum in terms of using them as a Data Warehouse. The general idea is that data from multiple OLTP systems will be integrated in real time via Kafka Connect in a central warehouse for analytical queries.
How does Citus compare to Greenplum in this respect? I have read that Citus has some SQL limitations, e.g. correlated subqueries are not supported if the correlation is not on the distribution column, does Greenplum have similar SQL limitations? Will Greenplum work well if data is being streamed into it (as opposed to batch updates)? I'm just having this feeling that Greenplum is more analytics-focused and can sacrifice some OLTP-specific things, which Citus cannot afford since they position themselves as HTAP (not OLAP). Citus also positions itself as a solution for sub second query times, which is not necessary for my use case - several seconds (up to 5) per query will be satisfactory.

I am not aware of any SQL limitations for Greenplum, like the one you mention above. In some cases, i.e. CUBE or percentile_* window functions (ordered-set aggregate functions) GPORCA, the Greenplum database query optimiser, will fall back to the PostgreSQL query optimiser and these queries won't be as performant as GPORCA-enabled queries - but you would still get a response to your query.
I'd say getting streaming data in vs. batch updates is one thing - using Kafka Connection with JDBC, would work out-of-the-box but won't be taking any advantage of the parallel distributed nature of Greenplum as all your data would have to pass through the coordinator.
What would be optimal is to use something like the Greenplum Streaming Server (GPSS) which would write the data delivered from the client directly into the segments of the Greenplum Database cluster and would allow maximum parallelism and best stream loading performance.

Related

Influxdb schema design for large amounts of fast data, multiple or single database?

We are using influxdb at different industrial sites, where we log up to 10.000 values ranging from 1Hz to 1000Hz sample rate, from 3-5 different machines - resulting in something like 1GB data/hour.
The logging is handles by simple HTTP line-protocal calls to an Influxdb 1.8 server. Running on a xeon 2.5Ghz 10c 64Gb ram 6TB SSD raid 5 array.
Right now the values are stored in the same database with a measurement for each machine, with a retention policy of 20weeks with a shard duration of 1week.
The data is visualized through grafana mostly.
Many people query the database at once through multiple grafana dashboards - which can tend to be fairly slow when I retrieve large amounts of data. No cross measurement calculations are performed, it is only visual plots.
Will I get any read-speed benefits from doing multiple databases instead of a single database with multiple measurements?
When getting data from a database, do influx need to "open" files containing data from all measurements in order to find data from a specific measurement?

Apache Ignite vs Informix Warehouse Accelerator vs Infinispan?

What's difference between Apache Ignite and IWA (Informix Warehouse Accelerator) and Infinispan ?
I have an application that accept large volume of data and process many transaction in per second. Not only
response time is very important for us, but also Data integrity is very important for us, Which in-memory databases best solutions for me ? , I'm confused
to select them. also i use JEE and application server is jboss.
We are looking for best in-memory database solution to processing data in real time?
Update:
I use relational database , i am looking for in-memory database to select , insert , update from that for decrease response time, Also Data integrity is very important and very important persist data on disk
Apache Ignite and Infinispan are both data grids / memory-centric databases with similar feature lists, with the biggest difference that Apache Ignite has SQL support for querying data.
Informix Warehouse Accelerator seems to be a narrow use case product, so it's hard to say if it's useful for your use case or not.
Otherwise, there's too few information in your question about the specifics of your project to say if either of those are a good fit, or even none of them.

Difference between database connector/reader nodes in KNIME

While creating some basic workflow using KNIME and PSQL I have encountered problems with selecting proper node for fetching data from db.
In node repo we can find at least:
PostgreSQL Connector
Database Reader
Database Connector
Actually, we can do the same using 2) alone or connecting either 1) or 2) to node 3) input.
I assumed there are some hidden advantages like improved performance with complex queries or better overall stability but on the other hand we are using exactly the same database driver, anyway..
There is a big difference between the Connector Nodes and the Reader Node.
The Database Reader, reads data into KNIME, the data is then on the machine running the workflow. This can be a bad idea for big tables.
The Connector nodes do not. The data remains where it is (usually on a remote machine in your cluster). You can then connect Database nodes to the connector nodes. All data manipulation will then happen within the database, no data is loaded to your machine (unless you use the output port preview).
For the difference of the other two:
The PostgresSQL Connector is just a special case of the Database Connector, that has pre-set configuration. However you can make the same configuration with the Database Connector, which allows you to choose more detailed options for non standard databases.
One advantage of using 1 or 2 is that you only need to enter connection details once for a database in a workflow, and can then use multiple reader or writer nodes. I'm not sure if there is a performance benefit.
1 offers simpler connection details with the bundled postgres jdbc drivers than 2

Apache Hive and record updates

I have streaming data coming into my consumer app that I ultimately want to show up in Hive/Impala. One way would be to use Hive based APIs to insert the updates in batches to the Hive Table.
The alternate approach is to write the data directly into HDFS as a avro/parquet file and let hive detect the new data and suck it in.
I tried both approaches in my dev environment and the 'only' drawback I noticed was high latency writing to hive and/or failure conditions I need to account for in my code.
Is there an architectural design pattern/best practices to follow?

Neo4j Huge database query performance configuration

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.
Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.
What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

Resources