I'm trying to load a Dask dataframe from a SQL connection. Per the read_sql_table documentation, it is necessary to pass in an index_col. What should I do if there's a possibility that there are no good columns to act as index?
Could this be a suitable replacement?
# Break SQL Query into chunks
chunks = []
num_chunks = math.ceil(num_records / chunk_size)
# Run query for each chunk on Dask workers
for i in range(num_chunks):
query = 'SELECT * FROM ' + table + ' LIMIT ' + str(i * chunk_size) + ',' + str(chunk_size)
chunk = dask.delayed(pd.read_sql)(query, sql_uri)
chunks.append(chunk)
# Aggregate chunks
df = dd.from_delayed(chunks)
dfs[table] = df
Unfortunately, LIMIT/OFFSET is not in general a reliable way to partition a query in most SQL implementations. In particular, it is often the case that, to get to an offset and fetch later rows from a query, the engine must first parse through earlier rows, and thus the work to generate a number of partitions is much magnified. In some cases, you might even end up with missed or duplicated rows.
This was the reasoning behind requiring boundary values in the dask sql implementation.
However, there is nothing in principle wrong with the way you are setting up your dask dataframe. If you can show that your server does not suffer from the problems we were anticipating, then you are welcome to take that approach.
Related
I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.
I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)
I am executing one SQL statement in Informix Data Studio 12.1. It takes around 50 to 60 ms for execution(One day date).
SELECT
sum( (esrt.service_price) * (esrt.confirmed_qty + esrt.pharmacy_confirm_quantity) ) AS net_amount
FROM
episode_service_rendered_tbl esrt,
patient_details_tbl pdt,
episode_details_tbl edt,
ms_mat_service_header_sp_tbl mmshst
WHERE
esrt.patient_id = pdt.patient_id
AND edt.patient_id = pdt.patient_id
AND esrt.episode_id = edt.episode_id
AND mmshst.material_service_sp_id = esrt.material_service_sp_id
AND mmshst.bill_heads_id = 1
AND esrt.delete_flag = 1
AND esrt.customer_sp_code != '0110000006'
AND pdt.patient_category_id IN(1001,1002,1003,1004,1005,1012,1013)
AND edt.episode_type ='ipd'
AND esrt.generated_date BETWEEN '2017-06-04' AND '2017-06-04';
When i am trying to execute the same by creating function it takes around 35 to 40 Seconds for execution.
Please find the code below.
CREATE FUNCTION sb_pharmacy_account_summary_report_test1(START_DATE DATE,END_DATE DATE)
RETURNING VARCHAR(100),DECIMAL(10,2);
DEFINE v_sale_credit_amt DECIMAL(10,2);
BEGIN
SELECT
sum( (esrt.service_price) * (esrt.confirmed_qty +
esrt.pharmacy_confirm_quantity) ) AS net_amount
INTO
v_sale_credit_amt
FROM
episode_service_rendered_tbl esrt,
patient_details_tbl pdt,
episode_details_tbl edt,
ms_mat_service_header_sp_tbl mmshst
WHERE
esrt.patient_id = pdt.patient_id
AND edt.patient_id = pdt.patient_id
AND esrt.episode_id = edt.episode_id
AND mmshst.material_service_sp_id = esrt.material_service_sp_id
AND mmshst.bill_heads_id = 1
AND esrt.delete_flag = 1
AND esrt.customer_sp_code != '0110000006'
AND pdt.patient_category_id IN(1001,1002,1003,1004,1005,1012,1013)
AND edt.episode_type ='ipd'
AND esrt.generated_date BETWEEN START_DATE AND END_DATE;
RETURN 'SALE CREDIT','' with resume;
RETURN 'IP SB Credit Amount',v_sale_credit_amt;
END
END FUNCTION;
Can someone tell me what is the reason for this time variation?
..in very easy words.
If you create a function the sql is parsed and stored with some optimization stuff in the database. If you call the function, optimizer knows about the sql and execute it. So optimization is done only once, if you create the function.
If you run the SQL, Optimizer parse the sql, optimizes it and then execute it, every time you execute the SQL.
This explains the time difference.
I would say the difference in time is due the parametrized query.
The first SQL has hardcoded dates values, the one in the SPL has parameters. That may cause a different query plan (e.g: which index to follow) to be applied to the query in the SPL than the one executed from Data Studio.
You can try getting the query plan (using set explain) from the first SQL and then use directives in the SPL to force the engine to use that same path.
have a look at:
https://www.ibm.com/support/knowledgecenter/SSGU8G_12.1.0/com.ibm.perf.doc/ids_prf_554.htm
it explains how to use optimizer directives to speed up queries.
I have some cypher queries that I execute against my neo4j database. The query is in this form
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ 'VERY_LONG_LIST')
RETURN count(r1) AS number_iframes;
If you can't understand what I am doing. This is a much simpler query
MATCH (s:WORD)
WHERE NOT (s.text=~"badword1|badword2|badword3")
RETURN s
I am basically trying to match some words against specific list
The problem is that this list is very large as you can see my job_id=5000 and I have more than 20000 jobs, so if my whitelist length is 1MB then I will end up with very large queries. I tried 500 jobs and end up with 200 MB queries file.
I was trying to execute these queries using transactions from py2neo but this is wont be feasible because my post request length will be very large and it will timeout. As a result, I though of using
neo4j-shell -file <queries_file>
However as you can see the file size is very large because of the large whitelist. So my question is there anyway that I can store this "whitelist" in a variable in neo4j using cypher??
I wish if there is something similar to this
SAVE $whitelist="word1,word2,word3,word4,word5...."
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ $whitelist)
RETURN count(r1) AS number_iframes;
What datatype is your netloc?
If you have an index on netloc you can also use t.netloc IN {list} where {list} is a parameter provided from the outside.
Such large regular expressions will not be fast
What exactly is your regexp and netloc format like? Perhaps you can change that into a split + index-list lookup?
In general also for regexps you can provide an outside parameter.
You can also use "IN" + index for job_ids.
You can also run a separate job that tags the jobs within your whitelist with a label and use that label for additional filtering e.g. in the match already.
Why do you have to check this twice ? Isn't it enough that the job has id=5000?
j.job_id =5000 and r1.job_id=5000
what's the syntax to get random records from a specific node_auto_index using cypher?
I suppose there is this example
START x=node:node_auto_index("uname:*") RETURN x SKIP somerandomNumber LIMIT 10;
Is there a better way that won't return a contiguous set?
there is no feature similar to SQL's Random() in neo4j.
you must either declare the random number in the SKIP random section before you use cypher (in case you are not querying directly from console and you use any upper language with neo4j)
- this will give a random section of nodes continuously in a row
or you must retrieve all the nodes and than make your own random in your upper language across these nodes - this will give you a random set of ndoes.
or, to make a pseudorandom function in cypher, we can try smthing like this:
START x=node:node_auto_index("uname:*")
WITH x, length(x.uname) as len
WHERE Id(x)+len % 3 = 0
RETURN x LIMIT 10
or make a sophisticated WHERE part in this query based upon the total number of uname nodes, or the ordinary ascii value of uname param, for example