py2neo: minimizing write-time when creating graph - neo4j

I would write a huge graph to neo4j. Using my code would take slightly less than two months.
I took the data from Kaggle's events recommendation challenge, the user_friends.csv file I am using looks like
user,friends
3197468391,1346449342 3873244116 4226080662, ...
I used the py2neo batch facility to produce the code. Is it the best I can do or is there another way to significantly reduce the running time?
Here 's the code
#!/usr/bin/env python
from __future__ import division
from time import time
import sqlite3
from py2neo import neo4j
graph = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
batch = neo4j.WriteBatch(graph)
people = graph.get_or_create_index( neo4j.Node,"people")
friends = graph.get_or_create_index( neo4j.Relationship,"friends")
con = sqlite3.connect("test.db")
c = con.cursor()
c.execute("SELECT user, friends FROM user_friends LIMIT 2;")
t=time()
for u_f in c:
u_node = graph.get_or_create_indexed_node("people",'name',u_f[0])
for f in u_f[1].split(" "):
f_node = graph.get_or_create_indexed_node("people",'name', f)
if not f_node.is_related_to(u_node, neo4j.Direction.BOTH,"friends"):
batch.create((u_node,'friends',f_node))
batch.submit()
print time()-t
Also I cannot find a way to create an undirected graph using the high level py2neo facilities? I knowcypher can do this with someting like create (node(1) -[:friends]-node(2))
Thanks in advance.

YOu should create connections not with Direction.BOTH. Chose one direction, and then ignore using Direction.BOTH it when traversing - it has no performance impact but the relationship directions are then deterministic. Cypher does exactly that when you say a--b.

Related

How can I compare protein sequences to find closest match

How could I build a tool to help with this scenario :
I work in a lab where we use plasmids to express recombinant proteins. We have a database containing all the plasmid identifiers and the sequence of the protein that they code for.
When a new protein is requested, I would like to be able to input the new desired protein sequence and search in our database for the plasmid that has the closest match to that sequence, with the highest identity score. The objective is to then use that existing plasmid and use it as a cloning template for the new plasmid.
In other words, I want to build a tool similar to NCBI blast that would work locally with proprietary sequences that are in an SQL database.
Would Python be able to achieve that ?
Thanks !
How about creating your own local BLAST database with makeblastdb? Then you can use something like this:
from Bio.Blast.Applications import NcbiblastnCommandline
run_command = NcbiblastnCommandline(query=YOUR_SEQUENCE_FASTA_PATH,
db=DATABASE_PATH,
out=RESULT_PATH,
outfmt=5,
[… other parameters …],
evalue=1e-10
)
stdout, stderr = run_command()

How to obtain Node2Vec vectors all of the nodes?

I have tried nodevectors , fastnode2vec. But I cannot get vectors of all nodes. Why?
e.g.
The code is
from fastnode2vec import Node2Vec
graph = Graph(_lst, directed=True, weighted=True)
model = Node2Vec(_graph, dim=300, walk_length=100, context=10, p=2.0, q=0.5, workers=-1)
model.train(epochs=epochs)
I have 10,000 nodes. When I check:
model.index_to_key
there are only 502 nodes.
Why is that?
How to set parameters so I can get the vectors of all nodes?
It's possible that your settings are not generating enough appearances of all nodes to meet other requirements for inclusion, such as the min_count=5 used by the related Word2Vec superclass to discard tokens with too few example usages to model well.
See this other related answer for related considerations & possible fixes (though in the context of the nodevectors package rather than the fastnode2vec package you're using):
nodevectors not returning all nodes
If that doesn't help resolve your issue, you should include more details about your graph - such as demonstrating via displayed output that it really has 10,000 nodes, & they're all sufficiently conneted, & that the random-walks generated by your node2vec library sufficiently revisit all of them for the training purposes.

Subtrees visualization in dtreeviz

I am new to dtreeviz.
I am struggling with a very deep decision tree that is very difficult to visualize (overfitting is not an issue for my task). I would like to know if there is a way to visualize only some nodes of the three (e.g., first 5 nodes).
Thanks!
from dtreeviz.models.xgb_decision_tree import ShadowXGBDTree
from dtreeviz import trees
xgb_shadow = ShadowXGBDTree(xgb_model_reg, 0, d.loc[:, d.columns != output_quantitativi[0]],
d[output_quantitativi[0]], d.loc[:, d.columns != output_quantitativi[0]].columns,output_quantitativi[0])
trees.dtreeviz(xgb_shadow)
for dtreeviz method it was just added the parameter depth_range_to_display, which allows you to specify a range of tree levels that you want to display.
For viz_leaf_samples() you can play with min_samples and max_samples values if the tree contains a lot of leaf nodes.

Dask DataFrame MemoryError when calling to_csv

I'm currently using Dask in the following way...
there are a list of files on S3 in the following format:
<day1>/filetype1.gz
<day1>/filetype2.gz
<day2>/filetype1.gz
<day2>/filetype2.gz
...etc
my code reads all files of filetype1 and builds up a dataframe and sets the index (e.g: df1 = ddf.read_csv(day1/filetype1.gz, blocksize=None, compression='gzip').set_index(index_col))
reads through all files of filetype2 and builds up a big dataframe (similar to above).
merges the two dataframes together via merged_df = ddf.merge(df1, df2, how='inner', left_index=True, right_index=True).
Writes the results out to S3 via: merged_df.to_csv(<s3_output_location>)
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant. I thought Dask would manage the larger join in a memory-aware way based on the following line from the docs(https://docs.dask.org/en/latest/dataframe-joins.html):
If enough memory can not be found then Dask will have to read and write data to disk, which may cause other performance costs.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way. Any guidance or suggestions on how I should be using Dask to accomplish what I am trying to do? Thanks in advance.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way
In general Dask does chunk things up and operate in the way that you expect. Doing distributed joins in a low-memory way is hard, but generally doable. I don't know how to help more here without more information, which I appreciate is hard to deliver concisely on Stack Overflow. My usual recommendation is to watch the dashboard closely.
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant
In general your intuition is correct that giving more work to Dask at once is good. However in this case it looks like you know something about your data that Dask doesn't know. You know that each file only interacts with one other file. In general joins have to be done in a way where all rows of one dataset may interact with all rows of the other, and so Dask's algorithms have to be pretty general here, which can be expensive.
In your situation I would use Pandas along with Dask delayed to do all of your computation at once.
lazy_results = []
for fn in filenames:
left = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
right = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
merged = left.merge(right)
out = merged.to_csv(...)
lazy_results.append(out)
dask.compute(*lazy_results)

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Resources