Neo4j : Difference between cypher execution and Java API call? - neo4j

Neo4j : Enterprise version 3.2
I see a tremendous difference between the following two calls in terms for speed. Here are the settings and query/API.
Page Cache : 16g | Heap : 16g
Number of row/nodes -> 600K
cypher code (ignore syntax if any) | Time Taken : 50 sec.
using periodic commit 10000
load with headers from 'file:///xyx.csv' as row with row
create(n:ObjectTension) set n = row
From Java (session pool, with 15 session at time as an example):
Thread_1 : Time Taken : 8 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
Thread_2 : Time taken is 9 sec / 10K
Map<String,Object> pList = new HashMap<String, Object>();
try(Transaction tx = Driver.session().beginTransaction()){
for(int i = 0; i< 10000; i++){
pList.put(i, i * i);
params.put("props",pList);
String query = "Create(n:Label {props})";
// String query = "create(n:Label) set n = {props})";
tx.run(query, params);
}
.
.
.
Thread_3 : Basically the above code is reused..It's just an example.
Thread_N where N = (600K / 10K)
Hence, the over all time taken is around 2 ~ 3 mins.
The question are the following?
How does CSV load handles internally? Like does it open single session and multiple transactions within?
Or
Create multiple session based on the parameter passed as "Using periodic commit 10000", with this 600K/10000 is 60 session? etc
What's the best way to write via Java?
The idea is achieve the same write performance as CSV load via Java. As the csv load 12000 nodes in ~5 seconds or even better.

Your Java code is doing something very different than your Cypher code, so it really makes no sense to compare processing times.
You should change your Java code to read from the same CSV file. File IO is fairly expensive, but your Java code is not doing any.
Also, whereas your pure Cypher query is creating nodes with a fixed (and presumably relatively small) number of properties, your Java pList is growing in size with every loop iteration -- so that each Java loop creates nodes with between 1 to 10K properties! This may be the main reason why your Java code is much slower.
[UPDATE 1]
If you want to ignore the performance difference between using and not using a CSV file, the following (untested) code should give you an idea of what similar logic would look like in Java. In this example, the i loop assumes that your CSV file has 10 columns (you should adjust the loop to use the correct column count). Also, this example gives all the nodes the same properties, which is OK as long as you have not created a contrary uniqueness constraint.
Session session = Driver.session();
Map<String,Object> pList = new HashMap<String, Object>();
for (int i = 0; i < 10; i++) {
pList.put(i, i * i);
}
Map<String, Map> params = new HashMap<String, Map>();
params.put("props", pList);
String query = "create(n:Label) set n = {props})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
for(int k = 0; k < 10000; k++){
tx.run(query, params);
}
}
}
[UPDATE 2 and 3, copied from chat and then fixed]
Since the Cypher planner is able to optimize, the actual internal logic is probably a lot more efficient than the Java code I provided (above). If you want to also optimize your Java code (which may be closer to the code that Cypher actually generates), try the following (untested) code. It sends 10000 rows of data in a single run() call, and uses the UNWIND clause to break it up into individual rows on the server.
Session session = Driver.session();
Map<String, Integer> pList = new HashMap<String, Integer>();
for (int i = 0; i < 10; i++) {
pList.put(Integer.toString(i), i*i);
}
List<Map<String,Integer>> rows = Collections.nCopies(1, pList);
Map<String, List> params = new HashMap<String, List>();
params.put("rows", rows);
String query = "UNWIND {rows} AS row CREATE(n:Label) SET n = {row})";
for (int j = 0; j < 60; j++) {
try (Transaction tx = session.beginTransaction()) {
tx.run(query, params);
}
}

You can try are creating the nodes using Java API, instead of relying on Cypher:
createNode - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/GraphDatabaseService.html#createNode-org.neo4j.graphdb.Label...-
setProperty - http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/PropertyContainer.html#setProperty-java.lang.String-java.lang.Object-
Also, as predecessor had mentioned, props variable has different values for your cases.
Additionally, notice that every iteration you are performing query parsing (String query = "Create(n:Label {props})";) - unless it is optimized out by neo4j itself.

Related

Script fails when running normally but in debug its fine

I'm developing a google spreadsheet that is automatically requesting information from a site, below is the code. The variable 'tokens' is an array consisting of about 60 different 3 letter unique identifiers. The problem that i have been getting is that the code keeps failing to request all information on the site. Instead it falls back (at random) on the validation part, and fills the array up with "Error!" strings. Sometimes its row 5, then 10-12, then 3, then multiple rows, etc. When i run it in debug mode everythings fine, can't seem to be able to reproduce the problem.
Already tried to place a sleep (100ms) but that fixed nothing. Also looked at the amount of traffic the API accepts (10 requests per second, 1.200 per minute, 100.000 per day) , it shouldn't be a problem.
Runtime is limited so i need it to be as efficient as possible. I'm thinking it is an issue of computational power after i pushed all values in the json request into the 'tokens' array. Is there a way to let the script wait as long as necessary for the changes to be committed?
function newGetOrders() {
var starttime = new Date().getTime().toString();
var refreshTime = new Date();
var tokens = retrieveTopBin();
var sheet = SpreadsheetApp.openById('aaafFzbXXRzSi-eXBu9Xh81Ne2r09vM8rLFkA4fY').getSheetByName("Sheet37");
sheet.getRange('A2:OL101').clear();
for (var i=0; i<tokens.length; i++) {
var request = UrlFetchApp.fetch("https://api.binance.com/api/v1/depth?symbol=" + tokens[i][0] + "BTC", {muteHttpExceptions:true});
var json = JSON.parse(request.getContentText());
tokens[i].push(refreshTime);
Utilities.sleep(100);
for (var k in json.bids) {
tokens[i].push(json.bids[k][0]);
tokens[i].push(json.bids[k][1]);
}
for (var k in json.asks) {
tokens[i].push(json.asks[k][0]);
tokens[i].push(json.asks[k][1]);
}
if (tokens[i].length < 402) {
for (var x=tokens[i].length; x<402; x++) {
tokens[i].push("ERROR!");
}
}
}
sheet.getRange(2, 1, tokens.length, 402).setValues(tokens);
}

Split events based on criteria and handle in order

Having the following problem: given a list of events that have a partitionId property (0-10 for example), I'd like incoming events to be split according to the paritionId so that events with same partitionId are handled in order they're received.
With more or less even distribution, that would lead to 10 events (for each partition) being handled in parallel.
Besides creating 10 single-threaded dispatchers and sending the event to the right dispatcher, is there a way to accomplish the above using Project Reactor ?
Thanks.
The code below
splits source stream into partitions,
creates ParallelFlux, one "rail" per partition,
schedules "rails" into separate threads,
collects the results
Having dedicated thread for each partition guaranties its values are processed in original order.
#Test
public void partitioning() throws InterruptedException {
final int N = 10;
Flux<Integer> source = Flux.range(1, 10000).share();
// partition source into publishers
Publisher<Integer>[] publishers = new Publisher[N];
for (int i = 0; i < N; i++) {
final int idx = i;
publishers[idx] = source.filter(v -> v % N == idx);
}
// create ParallelFlux each 'rail' containing single partition
ParallelFlux.from(publishers)
// schedule partitions into different threads
.runOn(Schedulers.newParallel("proc", N))
// process each partition in its own thread, i.e. in order
.map(it -> {
String threadName = Thread.currentThread().getName();
Assert.assertEquals("proc-" + (it % 10 + 1), threadName);
return it;
})
// collect results on single 'rail'
.sequential()
// and on single thread called 'subscriber-1'
.publishOn(Schedulers.newSingle("subscriber"))
.subscribe(it -> {
String threadName = Thread.currentThread().getName();
Assert.assertEquals("subscriber-1", threadName);
});
Thread.sleep(1000);
}

Single array in the hdf5 file

Image of my dataset:
I am using the HDF5DotNet with C# and I can read only full data as the attached image in the dataset. The hdf5 file is too big, up to nearly 10GB, and if I load the whole array into the memory then it will be out of memory.
I would like to read all data from rows 5 and 7 in the attached image. Is that anyway to read only these 2 rows data into memory in a time without having to load all data into memory first?
private static void OpenH5File()
{
var h5FileId = H5F.open(#"D:\Sandbox\Flood Modeller\Document\xmdf_results\FMA_T1_10ft_001.xmdf", H5F.OpenMode.ACC_RDONLY);
string dataSetName = "/FMA_T1_10ft_001/Temporal/Depth/Values";
var dataset = H5D.open(h5FileId, dataSetName);
var space = H5D.getSpace(dataset);
var dataType = H5D.getType(dataset);
long[] offset = new long[2];
long[] count = new long[2];
long[] stride = new long[2];
long[] block = new long[2];
offset[0] = 1; // start at row 5
offset[1] = 2; // start at column 0
count[0] = 2; // read 2 rows
count[0] = 165701; // read all columns
stride[0] = 0; // don't skip anything
stride[1] = 0;
block[0] = 1; // blocks are single elements
block[1] = 1;
// Dataspace associated with the dataset in the file
// Select a hyperslab from the file dataspace
H5S.selectHyperslab(space, H5S.SelectOperator.SET, offset, count, block);
// Dimensions of the file dataspace
var dims = H5S.getSimpleExtentDims(space);
// We also need a memory dataspace which is the same size as the file dataspace
var memspace = H5S.create_simple(2, dims);
double[,] dataArray = new double[1, dims[1]]; // just get one array
var wrapArray = new H5Array<double>(dataArray);
// Now we can read the hyperslab
H5D.read(dataset, dataType, memspace, space,
new H5PropertyListId(H5P.Template.DEFAULT), wrapArray);
}
You need to select a hyperslab which has the correct offset, count, stride, and block for the subset of the dataset that you wish to read. These are all arrays which have the same number of dimensions as your dataset.
The block is the size of the element block in each dimension to read, i.e. 1 is a single element.
The offset is the number of blocks from the start of the dataset to start reading, and count is the number of blocks to read.
You can select non-contiguous regions by using stride, which again counts in blocks.
I'm afraid I don't know C#, so the following is in C. In your example, you would have:
hsize_t offset[2], count[2], stride[2], block[2];
offset[0] = 5; // start at row 5
offset[1] = 0; // start at column 0
count[0] = 2; // read 2 rows
count[1] = 165702; // read all columns
stride[0] = 1; // don't skip anything
stride[1] = 1;
block[0] = 1; // blocks are single elements
block[1] = 1;
// This assumes you already have an open dataspace with ID dataspace_id
H5Sselect_hyperslab(dataspace_id, H5S_SELECT_SET, offset, stride, count, block)
You can find more information on reading/writing hyperslabs in the HDF5 tutorial.
It seems there are two forms of H5D.read in C#, you want the second form:
H5D.read(Type) Method (H5DataSetId, H5DataTypeId, H5DataSpaceId,
H5DataSpaceId, H5PropertyListId, H5Array(Type))
This allows you specify the memory and file dataspaces. Essentially, you need one dataspace which has information about the size, stride, offset, etc. of the variable in memory that you want to read into; and one dataspace for the dataset in the file that you want to read from. This lets you do things like read from a non-contiguous region in a file to a contiguous region in an array in memory.
You want something like
// Dataspace associated with the dataset in the file
var dataspace = H5D.get_space(dataset);
// Select a hyperslab from the file dataspace
H5S.selectHyperslab(dataspace, H5S.SelectOperator.SET, offset, count);
// Dimensions of the file dataspace
var dims = H5S.getSimpleExtentDims(dataspace);
// We also need a memory dataspace which is the same size as the file dataspace
var memspace = H5S.create_simple(rank, dims);
// Now we can read the hyperslab
H5D.read(dataset, datatype, memspace, dataspace,
new H5PropertyListId(H5P.Template.DEFAULT), wrapArray);
From your posted code, I think I've spotted the problem. First you do this:
var space = H5D.getSpace(dataset);
then you do
var dataspace = H5D.getSpace(dataset);
These two calls do the same thing, but create two different variables
You call H5S.selectHyperslab with space, but H5D.read uses dataspace.
You need to make sure you are using the correct variables consistently. If you remove the second call to H5D.getSpace, and change dataspace -> space, it should work.
Maybe you want to have a look at HDFql as it abstracts yourself from the low-level details of HDF5. Using HDFql in C#, you can read rows #5 and #7 of dataset Values using a hyperslab selection like this:
float [,]data = new float[2, 165702];
HDFql.Execute("SELECT FROM Values(5:2:2:1) INTO MEMORY " + HDFql.VariableTransientRegister(data));
Afterwards, you can access these rows through variable data. Example:
for(int x = 0; x < 2; x++)
{
for(int y = 0; y < 165702; y++)
{
System.Console.WriteLine(data[x, y]);
}
}

Count of the biggest bin in histogram, C#, sharp

I want to make histogram of my data so, I use histogram class at c# using MathNet.Numerics.Statistics.
double[] array = { 2, 2, 5,56,78,97,3,3,5,23,34,67,12,45,65 };
Vector<double> data = Vector<double>.Build.DenseOfArray(array);
int binAmount = 3;
Histogram _currentHistogram = new Histogram(data, binAmount);
How can I get the count of the biggest bin? Or just the index of the bigest bin? I try to get it by using GetBucketOf but to do this I need the element in this bucket :(
Is there any other way to do this? I read the documentation and Google and I can't find anything.
(Hi, I would use a comment for this but i just joined so today and don't yet have 50 reputation to comment!) I just had a look at - http://numerics.mathdotnet.com/api/MathNet.Numerics.Statistics/Histogram.htm. That documentation page (footer says it was built using http://docu.jagregory.com/) shows a public property named Item which returns a Bucket. I'm wondering if that is the property you need to use because the automatically generated documentation states that the Item property "Gets' the n'th bucket" but isn't clear how the Item property acts as an indexer. Looking at your code i would try _currentHistogram.Item[n] first (if that doesn't work try _currentHistogram[n]) where you are iterating the Buckets in the histogram using something like -
var countOfBiggest = -1;
var indexOfBiggest = -1;
for (var n = 0; n < _currentHistogram.BucketCount; n++)
{
if (_currentHistogram.Item[n].Count > countOfBiggest)
{
countOfBiggest = _currentHistogram.Item[n].Count;
indexOfBiggest = n;
}
}
The code above assumes that Histogram uses 0-based and not 1-based indexing.

I need to get more than 100 pages in my query

I want to get all video information posible from Youtube for my proyect. I know that the limit page is 100.
I do the next code:
ArrayList<String> videos = new ArrayList<>();
int i = 1;
String peticion = "http://gdata.youtube.com/feeds/api/videos?category=Comedy&alt=json&max-results=50&page=" + i;
URL oracle = new URL(peticion);
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine = in.readLine();
while (in.readLine() != null)
{
inputLine = inputLine + in.readLine();
}
System.out.println(inputLine);
JSONObject jsonObj = new JSONObject(inputLine);
JSONObject jsonFeed = jsonObj.getJSONObject("feed");
JSONArray jsonArr = jsonFeed.getJSONArray("entry");
while(i<=100)
{
for (int j = 0; j < jsonArr.length(); j++) {
videos.add(jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
System.out.println("Numero " + videosTotales + jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
videosTotales++;
}
i++;
}
When the program finish, I have 5000 videos per category, but I need much more, much much more, but the limit is page = 100.
So, how can I get more than 10 millions of videos?
Thank you!
Are those 5000 also unique id's ?
I see the use of max-results=50, but not a start-index parameter in your url.
There is a limit on the results you can get per request. There is also a limit on the number of requests that you can send within some time interval. By checking the statuscode of the response and any error message you can find these limits, as they may change in time.
Besides the category parameter, use some other parameters too. For instance, you may vary the q parameter (used with some keywords) and/or order parameter to get a different results set.
See the documentation for available parameters.
Note, that you are using api version 2, which is deprecated. There is an api version 3.

Resources