Impala Resource Estimation for queries with Group by

Impala Resource Estimation for queries with Group by - memory

I noticed that Impala "Estimated Per-Host Requirements" grow potentially when my queries use a "group by" with several fields. I suppose it calculates the maximum resouces needed for a join:
EXPLAIN select field1, field2
from mytable where field1=123
group by field1, field2
order by field1, field2
limit 100;
I would like to know if there is a way to reduce the estimated value by Impala, because the real needed resources were far lower (300 MB) than the amount estimated (300 GB).
It is important to say that "field1" and "field2" are String.

Unfortunately it is difficult to estimate the required memory based on information known at query planning time which are based on limited statistics that are available, especially when dealing with aggregations and joins that depend on the selectivity of the grouping/join exprs.
Firstly, are you sure you have up-to-date statistics on the table(s) you're using? Run COMPUTE STATS [table] to do so.
If you still have this issue with the correct stats, you can set the set mem_limit=XM query option to tell Impala that the query shouldn't use more than X MB of memory so it will request that amount of memory from Llama rather than the estimate from planning. If you're sure the query doesn't use more than 300MB, you can issue set mem_limit=300M; and then issue your query. If you're running other queries after from the same session, then clear the query option afterwards.

Related

Performance Issue with neo4j

There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX  on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???

There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).

I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.

Instead of creating appropriate index query is taking large time however neo4 write is fast enough

I am using neo4j-community-3.5.3 server in a system having 64 GB RAM and 32 cores.
My database size is currently 160 GB and it is growing like 1.5GB every day. I keep 12 GB in page cache and 8GB in heap.
Apart from uniqueness constraint I also create indexes on some of my node properties. Since in the current neo4j version lucene_native-1.0 indexing is deprecated I am using the default native-btree-1.0.
So the problem that I am facing is that my write performance is very good. But while reading the query result instead of querying using indexes result comes around 1 minute.
My index size is almost 21 GB. My database size is continuously growing but I am not getting the query performance as I was expected.
Please give me some healthier suggestion so that I can tune my query. Thanks in advance.
Here is a sample of my query with indexing, and some profiles:
PROFILE
OPTIONAL MATCH (u1:USER)<-[p:MENTIONS]-(tw:TWEET)<-[m:POST]-(u2:USER)
USING INDEX tw:TWEET(date)
WHERE tw.date='2019-03-03' AND u1.author_screen_name='xxx'
RETURN
u1.author_screen_name as mentioned_author,
u2.author_name as mentioned_by_author,
count(*) AS weight
ORDER BY weight DESC LIMIT 20
Query_profile1_using_indexing
Query_profile2_using_indexing
Query_profile3_using_indexing
And here is a query without indexing, and some profiles:
PROFILE
OPTIONAL MATCH (u1:USER)<-[p:MENTIONS]-(tw:TWEET)<-[m:POST]-(u2:USER)
WHERE tw.date='2019-03-03' AND u1.author_screen_name='xxx'
RETURN
u1.author_screen_name as mentioned_author,
u2.author_name as mentioned_by_author,
count(*) AS weight
ORDER BY weight DESC LIMIT 20
Query_profile1_without_using_indexing
Query_profile2_without_using_indexing
Query_profile3_without_using_indexing
Without using indexing query time taken is 880572 ms
Using indexing query time taken is 57674 ms for the same query.

In either case you're doing your projections at the same time as your aggregation, which isn't efficient. First of all, since there's only a single u1, project the author_screen_name for this at the beginning, while your cardinality is only at a single row.
Then after your match, do your aggregations, ordering and limiting, based upon the nodes themselves, and once your results are aggregated, THEN do the projections so you're doing a minimal amount of work; you don't want to do property access for a ton of rows that you're only going to discard when you get the limited result set:
MATCH (u1:USER)
WITH u1, u1.author_screen_name as mentioned_author
OPTIONAL MATCH ...
...
WITH mentioned_author, u2, count(*) AS weight
ORDER BY weight DESC
LIMIT 20
RETURN mentioned_author, u2.author_name as mentioned_by_author, weight

Archiving Records: Partitioning, Additional Table, or Status Flag

I'm working on an application where a lot of records need to be archived. For example, in the case of a task, n number hours after it's been marked complete, it becomes read-only. The frontend client queries for "Active" tasks or "Archived" tasks, but never both mixed together. I'm wondering what the ideal way of storing the archived task records would be as, over time, they will greatly outnumber the "Active" tasks.
I'm interested mainly in preventing the "Active" task query from coming in contact with a bunch of archived tasks and taking a performance hit.
Is flagging / indexing an archived: boolean column enough? I was also thinking of partitioning / moving them into their own archived_tasks table for total separation, but I'm not sure that's necessary. Any other ideas?
Extra info: Also filtering based on a foreign key for the current user.

"The cardinality of an index is the number of unique values within it. Your database table may have a billion rows in it, but if it only has 8 unique values among those rows, your cardinality is very low.
A low cardinality index is not a major efficiency gain. Most SQL indexes are binary search trees (B-Trees). Versus a serial scan of every row in a table to find matching constraints, a B-Tree logarithmically reduces the number of comparisons that have to be made. The gains from executing a search against a B-Tree are very low when the size of the tree is small.
So putting an index on a Boolean field? Or an enumerated value field? A cardinality of a very small number of distinct values among a very large number of rows will not yield noticeable efficiency gains. Save your database indexes for fields with very high cardinality to ensure the gains from scanning a B-Tree are largest versus sequential scans."
-- Joshua Ginsberg, Chief Architect, Red Hat.
More about this topic, http://www.ovaistariq.net/733/understanding-btree-indexes-and-how-they-impact-performance/#.W2gT1H6YPEY

Redshift performance tuning on a JOIN query

I'm having trouble with performance on the following query:
SELECT [COLUMNS] FROM TABLE A JOIN TABLE B ON [KEYS]
If I remove the join, leaving only the select the query takes seconds. With the join, it takes 30 minutes.
Table sizes are A (844,082,912) & B (1,540,379,815) rows.
Distribution and sort keys are equivalent to the join KEYS.
Looking on AWS graphs, I see (attached) one node with has some 100% CPU utilisation for a short time.
Looking on system table (svv_diskusage) I am not sure what I see (attached), as it does not indicate (as far as I can tell) if one node has much more data than the others.
if the issue is faulty distribution, how can I see it?
is it something else?

Here https://aws.amazon.com/articles/8341516668711341 (Uneven Distribution) you can see an example of the same graph style: one node is working harder than the others, which indicates your data is not evenly distributed.
Regarding svv_diskusage, it describes the values stored in each slice. If the slices are not relatively evenly used, that's an indicator for a bad distribution key. Try the following query to get a higher abstraction over distribution amooung nodes and not slices:
select owner, host, diskno, used, capacity,
(used-tossed)/capacity::numeric *100 as pctused
from stv_partitions order by owner;
set search_path to '$user', 'public', 'ic';
select * from pg_table_def where tablename = '{TableNameHere}';

DB Selection and Modeling Time Series Data with Ad-Hoc queries

I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.

Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?

Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart