Neo4j 2.0 + Gremlin + Graphml label support - neo4j

I am using neo4j 2.0 to store a lot of data. The data massive amount is generated using a ruby script and saved in a graphml file and then imported into neo4j using Gremlin.
g.loadGraphML('graphml.xml')
With neo4j 2.0, there is a new cool support for labels on a node which I would like to take advantage of. Is it possible to specify which labels a node should have in this way? Or do I really have to make queries afterwards for all nodes, setting their labels.
Thanks

I do not believe there is a way, at least within Blueprints, which is the interface which Gremlin is built upon. I don't see a way to add a label for a Vertex, nor do I see anything like that in GraphML.
Perhaps Neo4j will update their code to put the label(s) into a property for a Vertex in blueprints, but currently there is no way to get/set the labels using Gremlin/Tinkerpop.
It should also be noted that Blueprints only supports stable versions of Neo4j as we know, so something like 2.0 which is a Milestone wouldn't be fully supported in Blueprints yet.

If you're using Neo4j version 2 you can set a label by getting the underlying Neo4j Node from the Blueprint Vertex. Note that this breaks the encapsulation and adds a dependence on Neo4j, but this may be required. Also, due to issues with the latest version of Blueprint I'm still not able to use this code properly, but this is how it'd work.
import com.tinkerpop.blueprints.Vertex;
import com.tinkerpop.blueprints.impls.neo4j.Neo4jGraph;
import com.tinkerpop.blueprints.impls.neo4j.Neo4jVertex;
import org.neo4j.graphdb.Node;
// ...
Vertex vertex = graph.addVertex(null);
Neo4jVertex neo4jVertex = (Neo4jVertex) vertex;
Node node = neo4jVertex.getRawVertex();
node.addLabel("SomeLabel");

Related

How to obtain Node2Vec vectors all of the nodes?

I have tried nodevectors , fastnode2vec. But I cannot get vectors of all nodes. Why?
e.g.
The code is
from fastnode2vec import Node2Vec
graph = Graph(_lst, directed=True, weighted=True)
model = Node2Vec(_graph, dim=300, walk_length=100, context=10, p=2.0, q=0.5, workers=-1)
model.train(epochs=epochs)
I have 10,000 nodes. When I check:
model.index_to_key
there are only 502 nodes.
Why is that?
How to set parameters so I can get the vectors of all nodes?
It's possible that your settings are not generating enough appearances of all nodes to meet other requirements for inclusion, such as the min_count=5 used by the related Word2Vec superclass to discard tokens with too few example usages to model well.
See this other related answer for related considerations & possible fixes (though in the context of the nodevectors package rather than the fastnode2vec package you're using):
nodevectors not returning all nodes
If that doesn't help resolve your issue, you should include more details about your graph - such as demonstrating via displayed output that it really has 10,000 nodes, & they're all sufficiently conneted, & that the random-walks generated by your node2vec library sufficiently revisit all of them for the training purposes.

How to access "key" in combine.perKey in beam

In How to create custom Combine.PerKey in beam sdk 2.0, I asked and got a correct answer on how to create a custom Combine.PerKey in the new beam sdk 2.0. However, I now need to create a custom combinePerKey such that within my custom CombinePerKey logic, I need to be able to access the contents of the key. This was easily possible in dataflow 1.x, but in the new beam sdk 2.0, I'm unsure how to do so. Any little code snippet/example would be extremely useful.
EDIT #1 (per Ben Chambers's request)
The real use case is hard to explain, but I'm going to try:
We have a 3d space composed of millions of little hills. We try to determine the apex of these millions of hills as follows: we create billions of "rectangular probes" for the whole 3d space, and then we ask each of these billions of probes to "move" in a greedy way to the apex. Once it hits the apex, it stops. The probe then returns the apex and itself. The apex is the KEY for which we'll do a custom combine by key.
Now, the custom combine function is going to finally return a final object (called a feature) which is derived from all the probes that reach the same apex (ie the same key). When generating this "feature" object, we need to know infomration about the final apex/key (ie the top of the hill). Hence, I need this key info.
One way to solve this is using a group by key, but that was slow (at least in df 1.x); we got it to be fast (in df 1.x) using a custom combine fn. So, we'd like the key. That said, groupbykey works in beam skd 2.0.
Alternatively, we could stick the "apex" information into the "probe" objects itself, but this means that each of our billions of probe objects now needs to be tripled in size just to hold this apex information (and this apex information repeats itself, since there are only say 1 million apexes but 1 billion probes, so this intuitively feels highly inefficient.)
Rather than relying on the CombineFn to compute the entire result, could you instead have the ComibneFn compute some partial result based only on information about the probes? Then your Combine.perKey(...) returns a PCollection<KV<Apex, InfoAboutProbes>> and you can use a ParDo to combine the information about the apex with the summary information about the probes. This allows you to use the CombineFn for efficiently combining information about many probes, while using a ParDo to access the key.

OSM - Boundary Export from XML file

I've been attempting to export boundary information from an OSM file. My process is nearly there however I have an issue with the polygon I'm generating drawing random lines.
I would appreciate some insight on where I may be going wrong.
Step 1: Export the OSM data into XML
osmfilter -v greater-london-latest.osm --keep="boundary= admin_level= place=" > b.txt
Step 2: Run a script to process the XML.
cycle each relation node
load the member ways
load the nodes from each specified way
record the lat/lon and build a poly set
This produces a series of lat/lon which when I build them as a polygon give the correct overall shape I'm looking for. However, there are issues with the connecting lines I assume..
My polygon output
I'm actually looking for this, which is similar but Im obviously missing something.
Actual Poly Im looking to generate
Again, thanks for any help.
Ways in relations are not necessarily sorted. See answers to this question on how to sort ways, especially the answer by user geocodezip.
Alternatively you can make use of various tools/libraries to do the sorting for you. Unfortunately I can't point you directly to one but there are various tools capable of sorting relation members, including the OSM website itself, JOSM, overpass turbo (I guess), some JS stuff, [...].
Maybe some other user can help out with pointing to some good examples?

Poor performance of Neo4j Cypher query for transitive closure

I have a graph with ~89K nodes and ~1.2M relationships, and am trying to get the transitive closure of a single node via the following Cypher query:
start n=NODE(<id of a single node of interest>)
match (n)-[*1..]->(m)
where has(m.name)
return distinct m.name
Unfortunately, this query goes away and doesn't seem to come back (although to be fair I've only given it about an hour of execution time at this point).
Any suggestions on ways to optimise what I've got here, or better ways to achieve the requirement?
Notes:
Neo4J v2.0.0 (installed via Homebrew).
Mac OSX 10.8.5
Oracle Java 1.7.0_51
8GB physical RAM (neo4j JVM assigned whatever the default is)
Database is hosted on an SSD volume.
Query is submitted via the admin web UI's "Data browser".
"name" is an auto-indexed field.
CPU usage is fairly low - averaging around 20% of 8 cores.
I haven't gotten into the weeds of profiling the Neo4J server yet - my first attempt locked up VisualVM.
That's probably a combinatorial explosion of path, care to try this?
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match shortestPath((n)-[*]->(m))
return m.name
without shortest-path it would look like that, but as you are only interested in the reachable nodes from n the above should be good enough.
start n=NODE(<id of a single node of interest>),m=node:node_auto_index("name:*")
match (n)-[*]->(m)
return distnct m.name
Try query - https://code.google.com/p/gueryframework/ - this is a standalone library but is has a neo4j adapter. I.e., you will have to rewrite your queries in the query format.
Better support for transitive closure was one of the main reasons for developing query, we mainly use this in software analysis tools where we need reachability / pattern analysis (e.g., the antipattern queries in http://xplrarc.massey.ac.nz/ are computed using query).
There is a brief discussion about this in the neo4j google group:
https://groups.google.com/forum/#!searchin/neo4j/jens/neo4j/n69ksEJxDtQ/29DNKyWKur4J
and an (older, not maintained) project with some benchmarking code:
https://code.google.com/p/graph-query-benchmarks/
Cheers, Jens

How do I plot benchmark data in a Jenkins matrix project

I have several Jenkins matrix projects in where I output benchmark results (i.e. execution times) in a CSV file. I'd like to plot these execution times as a function of the build number, so I can see if my projects are regressing over time.
I can confirm Plot Plugin is a correct and quite useful approach. BTW, it supports CSV as well: plot configuration example
I've been using it for several years without any problem. Benchmarks results were generated as a property file. Benchmark id (series id) was used as a key and result as a value. One build produces one result for each benchmark. Having that data it is quite easy to create plot configuration ant track performance.
This may help you:
https://wiki.jenkins-ci.org/display/JENKINS/Plot+Plugin
It adds plotting capabilities to Jenkins.

Resources