Batch cypher queries generated by RestCypherQueryEngine - neo4j

I am trying to batch together a few cypher queries with the REST API (using the java bindings library) so that only one call is made over the wire. But it seems to not respect the batching on the client side and gives this error:
java.lang.RuntimeException: Error reading as JSON ''
at org.neo4j.rest.graphdb.util.JsonHelper.readJson(JsonHelper.java:57)
at org.neo4j.rest.graphdb.util.JsonHelper.jsonToSingleValue(JsonHelper.java:62)
at org.neo4j.rest.graphdb.RequestResult.toEntity(RequestResult.java:114)
at org.neo4j.rest.graphdb.RequestResult.toMap(RequestResult.java:123)
at org.neo4j.rest.graphdb.batch.RecordingRestRequest.toMap(RecordingRestRequest.java:138)
at org.neo4j.rest.graphdb.ExecutingRestAPI.query(ExecutingRestAPI.java:489)
at org.neo4j.rest.graphdb.ExecutingRestAPI.query(ExecutingRestAPI.java:509)
at org.neo4j.rest.graphdb.RestAPIFacade.query(RestAPIFacade.java:233)
at org.neo4j.rest.graphdb.query.RestCypherQueryEngine.query(RestCypherQueryEngine.java:50)
...
Caused by: java.io.EOFException: No content to map to Object due to end of input
at org.codehaus.jackson.map.ObjectMapper._initForReading(ObjectMapper.java:2766)
at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:2709)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1854)
at org.neo4j.rest.graphdb.util.JsonHelper.readJson(JsonHelper.java:55)
... 41 more
This is how I am trying to batch them:
graphDatabaseService.getRestAPI().executeBatch(new BatchCallback<Void>() {
#Override
public Void recordBatch(RestAPI batchRestApi) {
String query = "CREATE accounts=({userId:{userId}})-[r:OWNS]->({facebookId:{facebookId}})";
graphDatabaseService.getQueryEngine().query(query, map("userId", 1, "facebookId", "1"));
graphDatabaseService.getQueryEngine().query(query, map("userId", 2, "facebookId", "2"));
graphDatabaseService.getQueryEngine().query(query, map("userId", 3, "facebookId", "3"));
return null;
}
});
I am using noe4j version 1.9 and the corresponding client library. Should this be possible?

Here is a JUnit sample code that works for your batch. Here no string template is used but native methods on the RestAPI object:
public static final DynamicRelationshipType OWNS = DynamicRelationshipType.withName("OWNS");
#Autowired
private SpringRestGraphDatabase graphDatabaseService;
#Test
public void batchTest()
{
Assert.assertNotNull(this.graphDatabaseService);
this.graphDatabaseService.getRestAPI().executeBatch(new BatchCallback<Void>()
{
#Override
public Void recordBatch(RestAPI batchRestApi)
{
for (int counter = 1; counter <= 3; counter++)
{
RestNode userId = batchRestApi.createNode(map("userId", Integer.valueOf(counter)));
RestNode facebookId = batchRestApi.createNode(map("facebookId", Integer.valueOf(counter).toString()));
batchRestApi.createRelationship(userId, facebookId, OWNS, map());
}
return null;
}
});
}

Related

Publish FluxSink of different object types

I have an rsocket endpoint that responds with a flux:
#MessageMapping("responses")
Flux<?> deal(#Payload String message) {
return myService.generateResponses(message);
}
The responses can be any of 3 different types of objects produced asynchronously using the following code (if it worked):
public Flux<?> generateResponses(String request) {
// Setup response sinks
final FluxProcessor publish = EmitterProcessor.create().serialize();
final FluxSink<Response1> sink1 = publish.sink();
final FluxSink<Response2> sink2 = publish.sink();
final FluxSink<Response3> sink3 = publish.sink();
// Get async responses: starts new thread to gather responses and update sinks
new MyResponses(request, sink1, sink2, sink3)
// Return the Flux
Flux<?> output = Flux
.from(publish
.log());
}
The problem is that when I populate the sinks with different objects only the first sink is actually publishing back to the subscriber.
public class MyResponses extends CacheListenerAdapter {
private FluxSink<Response1> sink1;
private FluxSink<Response2> sink2;
private FluxSink<Response3> sink3;
// Constructor is omitted for brevity
#Override
public void afterCreate(EntryEvent event) {
if (event.getNewValue() instanceof Response1) {
Response1 r1 = (Response1)event.getNewValue();
sink1.next(r1);
}
if (event.getNewValue() instanceof Response2) {
Response2 r2 = (Response2)event.getNewValue();
sink2.next(r2);
}
if (event.getNewValue() instanceof Response3) {
Response3 r3 = (Response3)event.getNewValue();
sink3.next(r3);
}
}
}
If I make the sinks of type <?> then there's a .next error:
The method next(capture#2-of ?) in the type FluxSink<capture#2-of ?> is not applicable for the arguments (Response1)
Is there a better approach to this requirement?
The reason this did not work with different object was to do with Spring Boot Data Geode serialization of underlying object types. The way to get the object Flux to work was use 1 sink of type <Object>
public Flux<Object> generateResponses(String request) {
// Setup the Flux
EmitterProcessor<Object> emitter = EmitterProcessor.create();
FluxSink<Object> sink = emitter.sink(FluxSink.OverflowStrategy.LATEST);
// Get async responses: starts new thread to gather responses and update sinks
new MyResponses(request, sink)
// Setup an output Flux to publish the input Flux
Flux<Object> out = Flux
.from(emitter
.log(log.getName()));
}
The event handler then used the 1 sink
public class MyResponses extends CacheListenerAdapter {
private FluxSink<Object> sink;
// Constructor is omitted for brevity
#Override
public void afterCreate(EntryEvent event) {
if (event.getNewValue() instanceof Response1) {
Response1 r1 = (Response1)event.getNewValue();
sink.next(r1);
}
if (event.getNewValue() instanceof Response2) {
Response2 r2 = (Response2)event.getNewValue();
sink.next(r2);
}
if (event.getNewValue() instanceof Response3) {
Response3 r3 = (Response3)event.getNewValue();
sink.next(r3);
}
}
}

Neo4j return nested object from extension procedure

I am trying to create extension procedure to Neo4j that will return complex Object (mean object that conation another object).
public static class A {
public final String a;
public A(String a) {
this.a = a;
}
}
public static class Output {
public Object out;
public Output(Object out) {
this.out = out;
}
}
#Procedure(value = "my_proc", mode = Mode.READ)
public Stream<Output> myProc() {
return Stream.of(new Output(new A("a")));
}
When I execute call my_proc(); using Neo4j Browser I it just show the progress circle and never return.
When I execute the same using Java driver, I am getting the following exception:
SEVERE: [0xedd70cbd] Fatal error occurred in the pipeline
org.neo4j.driver.internal.shaded.io.netty.handler.codec.DecoderException: Failed to read inbound message:
at org.neo4j.driver.internal.async.inbound.InboundMessageHandler.channelRead0(InboundMessageHandler.java:87)
at org.neo4j.driver.internal.async.inbound.InboundMessageHandler.channelRead0(InboundMessageHandler.java:35)
....
at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.neo4j.driver.internal.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.neo4j.driver.internal.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: readerIndex(49) + length(1) exceeds writerIndex(49): PooledDuplicatedByteBuf(ridx: 49, widx: 49, cap: 133, unwrapped: PooledUnsafeDirectByteBuf(ridx: 51, widx: 89, cap: 133))
at org.neo4j.driver.internal.shaded.io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
at org.neo4j.driver.internal.shaded.io.netty.buffer.AbstractByteBuf.readByte(AbstractByteBuf.java:707)
at org.neo4j.driver.internal.async.inbound.ByteBufInput.readByte(ByteBufInput.java:45)
at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackLong(PackStream.java:479)
at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.unpackValue(PackStreamMessageFormatV1.java:479)
at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.unpackRecordMessage(PackStreamMessageFormatV1.java:464)
at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:390)
at org.neo4j.driver.internal.async.inbound.InboundMessageHandler.channelRead0(InboundMessageHandler.java:83)
... 39 more
Is there any way to return nested object without serialize it to json before return it?
#Procedure(value = "ebc.neo4j.justamap", mode = Mode.READ)
public Stream<MapResult> justamap() {
HashMap<String, Object> v1Map = new HashMap<String, Object>(1);
HashMap<String, Object> v2Map = new HashMap<String, Object>(1);
v2Map.put("a", "a string");
v1Map.put("map inside", v2Map);
return Stream.of(new MapResult(v1Map));
}
public static class MapResult {
public Map internalmap;
public MapResult(Map aInternalId) {
this.internalmap = aInternalId;
}
}
Something like the above is possible (as deep as you want) but returning custom objects to be used in Cypher (which is the idea when you're writing a procedure) is not going to work as Cypher has no idea what those custom objects are. So you will have to do some sanitizing.
Hope this helps.
Regards,
Tom

Recommendation Engine using Apache Spark MLIB showing up Zero recommendations after processing all operations

I am a newbie when it comes to Implementation of ML Algorithms. I wanted to implement a recommendation Engine and Got to know after little experimenting that collaborative-filtering can be used for the same. I am using Apache Spark for the same. I got help from one of the blogs and tried to implement the same in my local. PFB Code that I tried out. Every time I execute this the Count of Recommendations that is getting printed is always zero. I don see any Evident Error as such. Could someone please help me understand this. Also, please feel free to provide any other reference that can be referred in this regard.
package mllib.example;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
import scala.Tuple2;
public class RecommendationEngine {
public static void main(String[] args) {
// Create Java spark context
SparkConf conf = new SparkConf().setAppName("Recommendation System Example").setMaster("local[2]").set("spark.executor.memory","1g");
JavaSparkContext sc = new JavaSparkContext(conf);
// Read user-item rating file. format - userId,itemId,rating
JavaRDD<String> userItemRatingsFile = sc.textFile(args[0]);
System.out.println("Count is "+userItemRatingsFile.count());
// Read item description file. format - itemId, itemName, Other Fields,..
JavaRDD<String> itemDescritpionFile = sc.textFile(args[1]);
System.out.println("itemDescritpionFile Count is "+itemDescritpionFile.count());
// Map file to Ratings(user,item,rating) tuples
JavaRDD<Rating> ratings = userItemRatingsFile.map(new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(",");
return new Rating(Integer.parseInt(sarray[0]), Integer
.parseInt(sarray[1]), Double.parseDouble(sarray[2]));
}
});
System.out.println("Ratings RDD Object"+ratings.first().toString());
// Create tuples(itemId,ItemDescription), will be used later to get names of item from itemId
JavaPairRDD<Integer,String> itemDescritpion = itemDescritpionFile.mapToPair(
new PairFunction<String, Integer, String>() {
#Override
public Tuple2<Integer, String> call(String t) throws Exception {
String[] s = t.split(",");
return new Tuple2<Integer,String>(Integer.parseInt(s[0]), s[1]);
}
});
System.out.println("itemDescritpion RDD Object"+ratings.first().toString());
// Build the recommendation model using ALS
int rank = 10; // 10 latent factors
int numIterations = Integer.parseInt(args[2]); // number of iterations
MatrixFactorizationModel model = ALS.trainImplicit(JavaRDD.toRDD(ratings),
rank, numIterations);
//ALS.trainImplicit(arg0, arg1, arg2)
// Create user-item tuples from ratings
JavaRDD<Tuple2<Object, Object>> userProducts = ratings
.map(new Function<Rating, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Rating r) {
return new Tuple2<Object, Object>(r.user(), r.product());
}
});
// Calculate the itemIds not rated by a particular user, say user with userId = 1
JavaRDD<Integer> notRatedByUser = userProducts.filter(new Function<Tuple2<Object,Object>, Boolean>() {
#Override
public Boolean call(Tuple2<Object, Object> v1) throws Exception {
if (((Integer) v1._1).intValue() != 0) {
return true;
}
return false;
}
}).map(new Function<Tuple2<Object,Object>, Integer>() {
#Override
public Integer call(Tuple2<Object, Object> v1) throws Exception {
return (Integer) v1._2;
}
});
// Create user-item tuples for the items that are not rated by user, with user id 1
JavaRDD<Tuple2<Object, Object>> itemsNotRatedByUser = notRatedByUser
.map(new Function<Integer, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Integer r) {
return new Tuple2<Object, Object>(0, r);
}
});
// Predict the ratings of the items not rated by user for the user
JavaRDD<Rating> recomondations = model.predict(itemsNotRatedByUser.rdd()).toJavaRDD().distinct();
// Sort the recommendations by rating in descending order
recomondations = recomondations.sortBy(new Function<Rating,Double>(){
#Override
public Double call(Rating v1) throws Exception {
return v1.rating();
}
}, false, 1);
System.out.println("recomondations Total is "+recomondations.count());
// Get top 10 recommendations
JavaRDD<Rating> topRecomondations = sc.parallelize(recomondations.take(10));
// Join top 10 recommendations with item descriptions
JavaRDD<Tuple2<Rating, String>> recommendedItems = topRecomondations.mapToPair(
new PairFunction<Rating, Integer, Rating>() {
#Override
public Tuple2<Integer, Rating> call(Rating t) throws Exception {
return new Tuple2<Integer,Rating>(t.product(),t);
}
}).join(itemDescritpion).values();
System.out.println("recommendedItems count is "+recommendedItems.count());
//Print the top recommendations for user 1.
recommendedItems.foreach(new VoidFunction<Tuple2<Rating,String>>() {
#Override
public void call(Tuple2<Rating, String> t) throws Exception {
System.out.println(t._1.product() + "\t" + t._1.rating() + "\t" + t._2);
}
});
Also, I see that this job is Running for real Long time. Every time it creates a model.Is there a way I can Create the Model once, persist it and Load the same for consecutive Predictions. Can we by any chance improve the Speed of execution of this job
Thanks in Advance

DymanicDestinations in Apache Beam

I have a PCollection [String] say "X" that I need to dump in a BigQuery table.
The table destination and the schema for it is in a PCollection[TableRow] say "Y".
How to accomplish this in the simplest manner?
I tried extracting the table and schema from "Y" and saving it in static global variables (tableName and schema respectively). But somehow oddly the BigQueryIO.writeTableRows() always gets the value of the variable tableName as null. But it gets the schema. I tried logging the values of those variables and I can see the values are there for both.
Here is my pipeline code:
static String tableName;
static TableSchema schema;
PCollection<String> read = p.apply("Read from input file",
TextIO.read().from(options.getInputFile()));
PCollection<TableRow> tableRows = p.apply(
BigQueryIO.read().fromQuery(NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `BigqueryTest.configuration` WHERE file='" + filename +"'";
}
})).usingStandardSql().withoutValidation());
final PCollectionView<List<String>> dataView = read.apply(View.asList());
tableRows.apply("Convert data read from file to TableRow",
ParDo.of(new DoFn<TableRow,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
String[] schemas = c.element().get("schema").toString().split(",");
List<TableFieldSchema> fields = new ArrayList<>();
for(int i=0;i<schemas.length;i++) {
fields.add(new TableFieldSchema()
.setName(schemas[i].split(":")[0]).setType(schemas[i].split(":")[1]));
}
schema = new TableSchema().setFields(fields);
//My code to convert data to TableRow format.
}}).withSideInputs(dataView));
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Everything works fine. Only BigQueryIO.write operation fails and I get the error TableId is null.
I also tried using SerializableFunction and returning the value from there but i still get null.
Here is the code that I tried for it:
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new GetTable(tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
public static class GetTable implements SerializableFunction<String,String> {
String table;
public GetTable() {
this.table = tableName;
}
#Override
public String apply(String arg0) {
return "ProjectId:DatasetId."+table;
}
}
I also tried using DynamicDestinations but I get an error saying schema is not provided. Honestly I'm new to the concept of DynamicDestinations and I'm not sure that I'm doing it correctly.
Here is the code that I tried for it:
tableRows2.apply(BigQueryIO.writeTableRows()
.to(new DynamicDestinations<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableDestination getTable(TableRow dest) {
List<TableRow> list = sideInput(bqDataView); //bqDataView contains table and schema
String table = list.get(0).get("table").toString();
String tableSpec = "ProjectId:DatasetId."+table;
String tableDescription = "";
return new TableDestination(tableSpec, tableDescription);
}
public String getSideInputs(PCollectionView<List<TableRow>> bqDataView) {
return null;
}
#Override
public TableSchema getSchema(TableRow destination) {
return schema; //schema is getting added from the global variable
}
#Override
public TableRow getDestination(ValueInSingleWindow<TableRow> element) {
return null;
}
}.getSideInputs(bqDataView)));
Please let me know what I'm doing wrong and which path I should take.
Thank You.
Part of the reason your having trouble is because of the two stages of pipeline execution. First the pipeline is constructed on your machine. This is when all of the applications of PTransforms occur. In your first example, this is when the following lines are executed:
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
The code within a ParDo however runs when your pipeline executes, and it does so on many machines. So the following code runs much later than the pipeline construction:
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
...
schema = new TableSchema().setFields(fields);
...
}
This means that neither the tableName nor the schema fields will be set at when the BigQueryIO sink is created.
Your idea to use DynamicDestinations is correct, but you need to move the code to actually generate the schema the destination into that class, rather than relying on global variables that aren't available on all of the machines.

How can I do batch deletes millions on entities using DatastoreIO and Dataflow

I'm trying to use Dataflow to delete many millions of Datastore entities and the pace is extremely slow (5 entities/s). I am hoping you can explain to me the pattern I should follow to allow that to scale up to a reasonable pace. Just adding more workers did not help.
The Datastore Admin console has the ability to delete all entities of a specific kind but it fails a lot and takes me a week or more to delete 40 million entities. Dataflow ought to be able to help me delete millions of entities that match only certain query parameters as well.
I'm guessing that some type of batching strategy should be employed (where I create a mutation with 1000 deletes in it for example) but its not obvious to me how I would go about that. DatastoreIO gives me just one entity at a time to work with. Pointers would be greatly appreciated.
Below is my current slow solution.
Pipeline p = Pipeline.create(options);
DatastoreIO.Source source = DatastoreIO.source()
.withDataset(options.getDataset())
.withQuery(getInstrumentQuery(options))
.withNamespace(options.getNamespace());
p.apply("ReadLeafDataFromDatastore", Read.from(source))
.apply("DeleteRecords", ParDo.of(new DeleteInstrument(options.getDataset())));
p.run();
static class DeleteInstrument extends DoFn<Entity, Integer> {
String dataset;
DeleteInstrument(String dataset) {
this.dataset = dataset;
}
#Override
public void processElement(ProcessContext c) {
DatastoreV1.Mutation.Builder mutation = DatastoreV1.Mutation.newBuilder();
mutation.addDelete(c.element().getKey());
final DatastoreV1.CommitRequest.Builder request = DatastoreV1.CommitRequest.newBuilder();
request.setMutation(mutation);
request.setMode(DatastoreV1.CommitRequest.Mode.NON_TRANSACTIONAL);
try {
DatastoreOptions.Builder dbo = new DatastoreOptions.Builder();
dbo.dataset(dataset);
dbo.credential(getCredential());
Datastore db = DatastoreFactory.get().create(dbo.build());
db.commit(request.build());
c.output(1);
count++;
if(count%100 == 0) {
LOG.info(count+"");
}
} catch (Exception e) {
c.output(0);
e.printStackTrace();
}
}
}
There is no direct way of deleting entities using the current version of DatastoreIO. This version of DatastoreIO is going to be deprecated in favor of a new version (v1beta3) in the next Dataflow release. We think there is a good use case for providing a delete utility (either through an example or PTransform), but still work in progress.
For now you can batch your deletes, instead of deleting one at a time:
public static class DeleteEntityFn extends DoFn<Entity, Void> {
// Datastore max batch limit
private static final int DATASTORE_BATCH_UPDATE_LIMIT = 500;
private Datastore db;
private List<Key> keyList = new ArrayList<>();
#Override
public void startBundle(Context c) throws Exception {
// Initialize Datastore Client
// db = ...
}
#Override
public void processElement(ProcessContext c) throws Exception {
keyList.add(c.element().getKey());
if (keyList.size() >= DATASTORE_BATCH_UPDATE_LIMIT) {
flush();
}
}
#Override
public void finishBundle(Context c) throws Exception {
if (keyList.size() > 0) {
flush();
}
}
private void flush() throws Exception {
// Make one delete request instead of one for each element.
CommitRequest request =
CommitRequest.newBuilder()
.setMode(CommitRequest.Mode.NON_TRANSACTIONAL)
.setMutation(Mutation.newBuilder().addAllDelete(keyList).build())
.build();
db.commit(request);
keyList.clear();
}
}

Resources