I have the following query
SELECT domain, path, name, title_part, name_part FROM dataset.match_making_name_parts
LEFT JOIN dataset.raw_title_parts ON LOWER(title_part) = LOWER(name_part)
The problem I am trying to solve, is match making between a list of names (of assumed owners of specific web-sites) and names mentioned within the tags and other such places at the raw HTML of a web-site in a given domain and path, which we have crawled.
I want to build a map of partial matches, so I can rank the matches according to some heuristics (how many name parts of the full name got matched, in how many uris and were there matches with other domains).
Example rows:
//raw_title_parts (domain, path, title_part):
cloud.google.com, bigquery/query-plan-explanation, Query
cloud.google.com, bigquery/query-plan-explanation, Plan
cloud.google.com, bigquery/query-plan-explanation, Google
cloud.google.com, bigquery/query-plan-explanation, Cloud
cloud.google.com, bigquery/query-plan-explanation, Platform
//match_making_name_parts (name, name_part):
Google Cloud Platform, Google
Google Cloud Platform, Cloud
Google Cloud Platform, Platform
I assumed, that using JOIN would be easier for the Query Planner, but turns out, it attempts to run the query with insane amount of Stages; the first two and last two perform other task, but most of the stages are just doing repartitions:
READ $2, $1, $20
FROM __PSRC___PA1_2
WRITE $2, $1, $20
TO __PA1_REPARTITION2
BY HASH($20)
I have some ideas about alternative solutions to my problem, but this Resources Exceeded error comes around quite often and I would like to understand it more deeply.
What would be the best way to solve my problem in BigQuery? How can I better predict, which operations are problematic to the Query Planner of BigQuery and how to better use paradigms, that can avoid the problems?
EDIT: It seems that the problems start escalating relatively late on. With LIMIT 10000000, the query runs in 30 seconds, with LIMIT 1000000000 it no longer runs at all. Considering the numbers, this might be an issue with "memory pagination", that happens behind the scenes (does the set fit in RAM)?
Related
I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!
I am working on an import runner for a new graph database.
It needs to work with:
Amazon Neptune - Gremlin implementation, has great infrastructure support in production, but a pain to work with locally, and does not support Cypher. No visualization tool provided.
Janusgraph - easy to work with locally as a Gremlin implementation, but requires heavy investment to support in production, hence using Amazon Neptune. No visualization tool provided.
Neo4j - Excellent visualization tool, Cypher language feels very familiar, even works with Gremlin clients, but requires heavy investment to support in production, and there appears to be no visualization tool that is anywhere nearly as good as the one found in Neo4j that works with Gremlin implementations.
So I am creating the graph where the Entity (Nodes/Verticies) have multiple Types (Labels), some being orthogonal to each other, as well as multi-dimensional.
For example, an Entity representing an order made online would be labeled as Order, Online, Spend, Transaction.
| Spend Chargeback
----------------------------------------
Transaction | Purchase Refund
Line | Sale Return
Zooming into the Spend column.
| Online Instore
----------------------------------------
Purchase | Order InstorePurchase
Sale | OnlineSale InstoreSale
In Neo4j and its Cypher query language, this proves to be very powerful for creating Relationships/Edges across multiple types without explicitly knowing what transaction_id values are in the graph :
MATCH (a:Transaction), (b:Line)
WHERE a.transaction_id = b.transaction_id
MERGE (a)<-[edge:TRANSACTED_IN]-(b)
RETURN count(edge);
Problem is, Gremlin/Tinkerpop does not natively support multiple Labels for its Verticies.
Server implementations like AWS Neptune will support this using a delimiter eg. Order::Online::Spend::Transaction and the Gremlin client does support it for a Neo4j server but I haven't been able to find an example where this works for JanusGraph.
Ultimately, I need to be able to run a Gremlin query equivalent to the Cypher one above:
g
.V().hasLabel("Line").as("b")
.V().hasLabel("Transaction").as("a")
.where("b", eq("a")).by("transaction_id")
.addE("TRANSACTED_IN").from("b").to("a")';
So there are multiple questions here:
Is there a way to make JanusGraph accept multiple vertex labels?
If not possible, or this is not the best approach, should there be an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label (Transaction) or the low-level label (Order)?
Is there a way to make JanusGraph accept multiple vertex labels?
No, there is not a way to have multiple vertex labels in JanusGraph.
If not possible, or this is not the best approach, should there be
an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label
(Transaction) or the low-level label (Order)?
I'll answer these two together. Based on what you have described above I would create a single label, probably named Transaction, and with different properties associated with them such as Location (Online or InStore) and Type (Purchase, Refund, Return, Chargeback, etc.). Looking at how you describe the problem above you are really talking only about a single entity, a Transaction where all the other items you are using as labels (Online/InStore, Spend/Refund) are really just additional metadata about how that Transaction occurred. As such the above approach would allow for simple filtering on one or more of these attributes to achieve anything that could be done with the multiple labels you are using in Neo4j.
I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.
admittedly, most of my database experience is relational. one of the tenets in that space is to avoid moving data over the network. this manifests by using something like:
select * from person order by last_name limit 10
which will presumably order and limit within the database engine vs using something like:
select * from person
and subsequently ordering and taking the top 10 at the client which could have disastrous effects if there are a million person records.
so, with Gremlin (from Groovy), if i do something like:
g.V().has('#class', 'Person').order{println('!'); it.a.last_name <=> it.b.last_name}[0..9]
i am seeing the ! printed, so i am assuming that this bringing all Person records into the address space of my client prior to the order and limit steps which is not the desired effect.
do my options for processing queries entirely in the database engine become product specific (e.g. for orient-db perhaps submit the query in their flavor of SQL), or is there something about Gremlin that i am missing?
If you want the implementer's query optimizer to kick in, you need to use as many Gremlin steps as possible and avoid pure Groovy/in-memory processing of your graph traversals.
You're most likely looking for something like this (as of TinkerPop v3.2.0):
g.V().has('#class', 'Person').order().by('last_name', incr).limit(10)
If you find yourself using lambdas, chances are often high that this could be done with pure Gremlin steps. Favor Gremlin steps over lambdas.
See TinkerPop v3.2.0 documentation:
Order By step
Limit step
I want to use the twitter search api to search for some famous person. For instance I want to search for a particular "Mr Patrick Lee C K". I would construct my search term to be something like:
http://search.twitter.com/search?q=%22lee+c+k%22+OR+%22patrick+lee%22
However, knowing that tweets are often informal, I know that sometimes people can address him by his initials 'lck'. To increase the precision of my search, I figure it would be better if my query can associate with his company, for instance my query could also be lck microsoft.
Now, i want to string these 3 search terms "patrick lee"/"lee c k"/lck microsoft together in one query. I probably will use OR. Then again, my last search term should not be a fixed phrase i.e word lck and microsoft can be some distance from each other.
Can anyone tell me how should i link these search terms together inside one query?
Stringing your queries together with "OR" is the best way to do this. The best way to do this is by searching "patrick lee" OR "lee c k" OR lck microsoft. Note that proximity queries are not supported by the Twitter Search API.
There's a few reasons why: 1) The search query only counts towards one count of your API limit, despite it being a fairly expensive query and 2) even though you can't really do a proximity query for "lck microsoft", Tweets are only 140 characters and chances are that those terms would be fairly close to each other regardless. In fact, eliminating the quotes around "lee c k" might actually raise your recall without compromising too much precision.
The features available on the Twitter Advanced Search page are the full list of features that you can use in your search query.