Where statement on an indexed property takes too much time - neo4j

I have a cypher query as follows:
MATCH (u:User {uid:"984172414"})-[ru:EB]->
(c:Co)<-[rf:EB]-(f:User)-[rc :EB]->(cc:Co)
WHERE (cc.uid in ["84161623"]) AND (rc.from IS NOT NULL AND rc.to IS NULL) AND
((
ru.from IS NOT NULL AND
ru.to IS NOT NULL AND
(
(rf.from <= ru.to) OR
(ru.from <= rf.to)
)
) OR (
ru.from IS NOT NULL AND
ru.to IS NULL AND
(
(ru.from <= rf.to) OR
(rf.from IS NOT NULL AND rf.to IS NULL)
)
) OR (
ru.from IS NULL AND
ru.to IS NOT NULL AND
(
(rf.from <= ru.to) OR
(rf.from IS NULL AND rf.to IS NOT NULL)
)
))
RETURN cc.name as coname,
f.name as fname,
cc.uid as cuid,
f.uid as fuid,
labels(f) as flabels,
null as version
LIMIT 20
This takes about 16192 ms to resolve. I have an index on co.uid but seems like it's not working. If I remove the check cc.uid in ["84161623"] and run the following query:
MATCH (u:User {uid:"984172414"})-[ru:EB]->
(c:Co)<-[rf:EB]-(f:User)-[rc :EB]->(cc:Co)
WHERE (rc.from IS NOT NULL AND rc.to IS NULL) AND
((
ru.from IS NOT NULL AND
ru.to IS NOT NULL AND
(
(rf.from <= ru.to) OR
(ru.from <= rf.to)
)
) OR (
ru.from IS NOT NULL AND
ru.to IS NULL AND
(
(ru.from <= rf.to) OR
(rf.from IS NOT NULL AND rf.to IS NULL)
)
) OR (
ru.from IS NULL AND
ru.to IS NOT NULL AND
(
(rf.from <= ru.to) OR
(rf.from IS NULL AND rf.to IS NOT NULL)
)
))
RETURN cc.name as coname,
f.name as fname,
cc.uid as cuid,
f.uid as fuid,
labels(f) as flabels,
null as version
LIMIT 20
The query resolves in only 347 ms. I can't figure out what is wrong with the (cc.uid in ["84161623"]) statement and why does adding this to the query takes 16 seconds to resolve when I already have an index on the uid. Any help will be appreciated.
update
As suggested by #cybersam I tried making the use of USING INDEX but that results in the following error:
Cannot use index hint in this context. Index hints require using an equality comparison or IN condition in WHERE (either directly or as part of a top-level AND). The comparison cannot be between two property values. Note that the label and property comparison must be specified on a non-optional node

Try using a USING INDEX clause to provide a hint to use that index. The Cypher processing code does not always automatically generate the most efficient code.
For example, put this between the MATCH and WHERE clauses:
USING INDEX cc:Co(uuid)
You may also need to use additional USING INDEX clauses if there are other indices. Note, however, that neo4j can not use indices in all situations; and, even if it is possible, the resulting query could theoretically be slower due to other resulting changes to the query. So, take a look at the resulting profile and test the result to make sure you are happy with it.

Sounds like you may need to do some sanity checking.
First of all, is :Co.uid an int or a string? It looks like you're addressing it as if it's a string, yet the values themselves look numeric. If it's an int, you can get rid of the quotes.
Same with :User.uid.
If you've been comparing ints to strings all this time, try fixing that first to see if it solves the problem. If not, you'll want to start profiling and figuring out if/when the query isn't using your index.
Next, try simplifying and profiling the query to see if the indices are actually being used:
PROFILE MATCH (u:User {uid:"984172414"}), (cc:Co)
WHERE (cc.uid in ["84161623"])
RETURN u, cc
If they're both using NodeIndexSeek or NodeUniqueIndexSeek, and if the db hits seem reasonable, you might expand out to your entire path and continue profiling. However, it's worth checking for a performance improvement if you match on the start and end nodes first, like above, then try to do your additional pattern matching. For example:
PROFILE MATCH (u:User {uid:"984172414"}), (cc:Co)
WHERE (cc.uid in ["84161623"])
WITH u, cc
MATCH (u)-[ru:EB]->(c:Co)<-[rf:EB]-(f:User)-[rc :EB]->(cc)
WHERE (rc.from IS NOT NULL AND...

Related

Is there a simpler way to get the argument type list for a snowflake procedure?

I need to transfer ownership of snowflake procedures post clone to a new Role.
To do this I'm using a procedure which works through all objects from the database.information_schema.xxxx views.
The procedures are problematic though, the SHOW PROCEDURE has a column which shows the procedure signature as just argument types, but the information_schema.procedures view shows the actual parameter name as well as its argument type, which if passed into a GRANT command does not work - the grant expects the Argument Type signature only, not the parameter names :/
SHOW PROCEDURE ARGUMENTS => PROCEDURE_NAME(VARCHAR) RETURN VARCHAR
INFORMATION_SCHEMA.PROCEDURES.ARGUMENT_SIGNATURE => PROCEDURE_NAME(P_PARAM1 VARCHAR)
I eventually came upwith this which was fun, but feels rather complicated, the question is - have I missed a simpler approach?
SELECT procedure_name
, concat('(',listagg(argtype, '') within group (order by argindex)) cleanArgTypes
FROM (SELECT procedure_name
, argument_signature
, lf.index argindex
, lf.value argtype
FROM rock_dev_test_1.information_schema.procedures
, lateral flatten(input=>split(decode(argument_signature
,'()','( )'
,argument_signature
),' ')
,outer=>true) lf
WHERE lf.index/2 != round(lf.index/2)
)
GROUP BY procedure_name
, argument_signature
ORDER by 1,2;
cleanArgTypes => (VARCHAR)
This takes the overspecific argument_signature splits it into an array using space as a delimiter, then laterally flatten the return set into rows, discard the parameter names (always at an even index) then groups by parameter name and signature and uses ListAgg to put the parameter argument types back into a string.
Small wrinkle in that () doesn't work, so has to be shifted to ( )
Whilst I enjoyed dabbling with some of Snowflakes Semi-structured capabilities, If there was a simpler approach I'd rather use it!
Mostly the same code, but it doesn't need to be nested, I swapped from the arg_sig (the input) to using the SEQ of the split, but mostly the same still:
SELECT p.procedure_name
,'( '|| listagg(split_part(trim(t.value),' ',2), ', ') within group (order by t.index) || ')' as out
FROM information_schema.procedures as p
,table(split_to_table(substring(p.argument_signature, 2,length(p.argument_signature)-2), ',')) t
group by 1, t.seq;
for the toy procedures in my stack overflow schema I get:
PROCEDURE_NAME
OUT
DATE_HANDLER
( DATE)
TODAYS_DELIVERY_AMOUNT
( VARCHAR)
ABC
( TIMESTAMP_NTZ, TIMESTAMP_NTZ, VARCHAR)
ABC_DAILY
( )
STRING_HANDLER
( VARCHAR)
I don't think there's a built-in way to do this. Here's an alternate way:
with A as
(
select PROCEDURE_NAME, split(replace(trim(ARGUMENT_SIGNATURE, '()'), ','), ' ') ARGS
,ARGUMENT_SIGNATURE
from test.information_schema.procedures P
)
select PROCEDURE_NAME
,listagg(VALUE::string, ',') as ARGS
from A, table(flatten(ARGS))
where index % 2 = 1
group by PROCEDURE_NAME
;
You could also use a result scan after the show command to get the name of the procedure and argument signature in a single string:
show procedures;
select split("arguments", ')')[0] || ')' as SIGNATURE from table(result_scan(last_query_id()));
I wrote this to pull out the list of procedures from the information schema with properly formatted argument signature, using a combination of splitting the string up with SPLIT, putting each value on a separate row with LATERAL FLATTEN, filtering to only get the data types using the INDEX, then re-grouping with LISTAGG. No subquery needed either.
SELECT PROCEDURE_SCHEMA || '.' || PROCEDURE_NAME || '(' ||
REPLACE(LISTAGG(C.Value,' ') WITHIN GROUP (ORDER BY C.INDEX),'(','') AS "Procedure"
FROM INFORMATION_SCHEMA.PROCEDURES,
LATERAL FLATTEN (INPUT => SPLIT(ARGUMENT_SIGNATURE,' ')) C
WHERE INDEX % 2 = 1 OR ARGUMENT_SIGNATURE = '()'
GROUP BY PROCEDURE_SCHEMA, PROCEDURE_NAME, ARGUMENT_SIGNATURE

join / union in presto to keep email in one column

I'm trying to join two tables together in presto,
select o.email
, o.user_id
, c.email
, c.sessions
from datasource o
full join datasource2 c
on o.email = c.email
this yields:
email user_id email sessions
jeff#sessions.com 123 NULL NULL
mike#berkley.com 987 NULL NULL
jared#swiss.com 384 jared#swiss.com 14
steph#berk.com 333 NULL NULL
NULL NULL lisa#hart.com 12
the problem with this is that I want to do multiple joins on multiple data sources using email, the only workaround I can think of is to use this as a subquery, and create a new column that takes one, and if null, takes the other, then perform the full join on datasource3, rinse repeat.
You want to use COALESCE which will chose the not null of the two values.
COALESCE is very useful for a lot of things. It can take more than two values and will return the first non NULL value it gets. If all of them are NULL it will simply return NULL.
SELECT
COALLESCE(o.email, c.email) AS email
, o.user_id
, c.sessions
FROM datasource o
FULL JOIN datasource2 c
ON o.email = c.email
For the official documentation on COALESCE see here:
https://prestodb.io/docs/current/functions/conditional.html

MDX join with the same dimension

I'm writing some MDX to join a dimension to itself based on two different periods to get a common list, then do a count against this list for both.
In short, I need to
get a list of Student.UniqueId's for Period1 which has a flag (IsValid) that is set that isn't set within the Period2 data
get a full list of Students for Period2
join the two lists and produce two records (one for each period) with the same count (these counts will be used for calculated member calculations within each period)
I have tried doing it via subselect and exists clause with filter
SELECT
{
[Measures].[FactStudentCount]
} on COLUMNS,
{ NONEMPTY
(
[TestEvent].[TestEvents].[Name].ALLMEMBERS
* [TestEvent].[PeriodName].[PeriodName].ALLMEMBERS
)
} ON ROWS
FROM ( SELECT ( {
exists
(
filter([Student].[UniqueId].[UniqueId].MEMBERS
,([TestEvent].[Key].&[Period1], [IsValid].[Code].&[Yes]))
,
filter([Student].[UniqueId].[UniqueId].MEMBERS
,[TestEvent].[Key].&[Period2])
)
}) ON COLUMNS
FROM [MyCube])
...however this doesn't give the correct result
(To obtain context) I have also tried similar exists/filter within a where clause
SELECT
{
[Measures].[FactStudentCount]
} on COLUMNS,
{ NONEMPTY
(
[TestEvent].[TestEvents].[Name].ALLMEMBERS
* [TestEvent].[PeriodName].[PeriodName].ALLMEMBERS
)
} ON ROWS
FROM [MyCube]
where (
exists
(
filter([Student].[UniqueId].[UniqueId].MEMBERS
,([TestEvent].[Key].&[Period1], [IsValid].[Code].&[Yes]))
,
filter([Student].[UniqueId].[UniqueId].MEMBERS
,[TestEvent].[Key].&[Period2])
)
)
...however again this doesn't produce the correct result
I have tried tweaking the filter statements (within the exists) to something like
(filter(existing([Student].[UniqueId].[UniqueId].allmembers),[TestEvent].[Key].CurrentMember.MemberValue = 'Period1'), [IsValid].[Code].&[Yes])
,
(filter(existing([Student].[UniqueId].[UniqueId].allmembers),[TestEvent].[Key].CurrentMember.MemberValue = 'Period2'))
...however this only returns one row (for Period1) - that said it is the correct total
I have also tried via a CrossJoin with NonEmpty however it fails because the fields come from the same hierarchy - the message "The Key hierarchy is used more than once in the Crossjoin function"
Does any one have any insight into how to resolve the above scenario ?
This is what I did
NonEmpty(
NonEmpty(
{([Student].[UniqueId].[UniqueId].members)},{([TestEvent].[Key].&[Period1], [IsValid].[Code].&[Yes])}
)
,
{([Student].[UniqueId].[UniqueId].members,[TestEvent].[Key].&[Period2])}
)
This gets all Period1 elements, with IsValid='Yes' then 'left joins' this with records in Period2

PSQL group by vs. aggregate speed

So, the general question is, what's faster, taking an aggregate of a field or having extra expressions in the GROUP BY clause. Here are the two queries.
Query 1 (extra expressions in GROUP BY):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN some_date_field IS NOT NULL
THEN
FLOOR(((some_date_field - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
some_integer * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1, some_date_field, some_integer
) subquery
Query 2 (using an (arbitrary, because each record for the table 2 fields in question here have the same value (in this dataset)) aggregate function):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN MAX(some_date_field) IS NOT NULL
THEN
FLOOR(((MAX(some_date_field) - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
MAX(some_integer) * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1
) subquery
As far as I can tell, psql doesn't provide good benchmarking tools. \timing on only times for one query, so running a benchmark with enough trials for meaningful results is... tedious at best.
For the record, I did do this at about n=50 and saw the aggregate method (Query 2) run faster on average, but a p value of ~.13, so not quite conclusive.
'sup with that?
The general answer - should be +- same. There's a chance to hit/miss function based index when using/not using functions on a field, but not aggregation function and in where clause more then in column list. But this is speculation only.
What you should use for analyzing execution is EXPLAIN ANALYZE. In plan you not only see scan types, but also number of iterations, cost and individual operations time. And of course you can use it with psql

Why do NULL fields get called equal?

I'm writing triggers to detect changes in fields in a database, and it appears I have to do really obnoxious things like
(SELECT SalesPrice FROM __old) <> (SELECT SalesPrice FROM __new)
or ((SELECT SalesPrice FROM __old) IS NULL and (SELECT SalesPrice FROM __new) IS NOT NULL)
or ((SELECT SalesPrice FROM __old) IS NOT NULL and (SELECT SalesPrice FROM __new) IS NULL)
rather than just
(SELECT SalesPrice FROM __old) <> (SELECT SalesPrice FROM __new)
to accurately detect if a field changed.
Am I missing something, or does Advantage effectively claim that NULL == any value? Is there a good reason for this behavior? Is this some weird thing in the SQL definition? Is there a more succinct way this that doesn't do 3 checks in place of one?
This is unfortunately how SQL works with NULL values. NULL is not equal to anything, it is UNKNOWN. For example,
somevalue == NULL -> unknown
somevalue <> NULL -> unknown
As a result it will never pass a "true" check
Null Values - Wikipedia
There are a couple of options:
A) Do not allow null values (I recommend combining this with a default value)
B) Use IFNULL to set the field to some value such as
(SELECT IFNULL(SalesPrice, -9999) FROM __OLD) <> (SELECT IFNULL(SalesPrice, -9999) FROM __NEW)
But I don't know if I necessarily like this since a value must be picked that would not be valid.
In SQL, NULL does not compare to anything, except the IS [NOT] NULL expression. If I understand you question correctly, the problem here is that NULL must equal to NULL. If that is the case, the check may be simplified to:
( SELECT CASE WHEN n.SalesPrice IS NULL and o.SalePrice IS NULL THEN TRUE
ELSE n.SalesPrice = o.SalesPrice END
FROM __old o, __new n )

Resources