Build 3 Node cluster in testing environment and used Neo4j-JDBC connection to save JSON data into Neo4j.
When creating just 2000 nodes and 2000 relations through JSON statistics are: Total time to save topology data in Neo4j: 456688 ms and links size: 2000, nodes size: 2000.
Saved without checking duplicacy of nodes/relations(Removed checkVertex and checkRelation methods):
Total time to save topology data in Neo4j: 446979 ms and links size: 2000, nodes size: 4000 (As we are not checking duplicacy, double nodes has been created).
Code:
public Connection getConnection(String masterNodeIp, String password) throws Exception {
return(Connection)DriverManager.getConnection("jdbc:neo4j:http://"+masterNodeIp+"/?user=neo4j,password="+password+"");
}
//By iterating through edges, Added source and target nodes.
try {
for (Links link : topology.getL2links()) {
if(conn != null) {
long srcId = etGraphIdByUniquenessOfOrphan(clientId,link.getSrcMgmtIP());
GraphId srcGraphId = prepareGraphId(srcId, "DEVICE");
long tgtId = etGraphIdByUniquenessOfOrphan(clientId,link.getTgtMgmtIP());
GraphId tgtGraphId = prepareGraphId(tgtId, "DEVICE");
String srcQuery = createNode(conn, link, false,clientId,discProfileId,
srcGraphId);
if(srcQuery!=null && !srcQuery.isEmpty())
stmt.execute(srcQuery);
String tgtQuery = createNode(conn, link, true,clientId,discProfileId,
tgtGraphId);
if(tgtQuery != null && !tgtQuery.isEmpty())
stmt.execute(tgtQuery);
String relationQuery = processRelation(conn, link,srcGraphId,tgtGraphId);
if(relationQuery!=null && !relationQuery.isEmpty())
stmt.execute(relationQuery);
}
}
} catch(Exception e) {
System.out.println("Exception in processJsonData ::: "+e.getMessage());
throw e;
} finally {
stmt.close();
conn.close();
}
//Before creating node checked whether node is already existed or not in order to avoid duplicacy
private boolean checkVertex(Connection conn, String ip, String hostName, long clientId, long discPId, GraphId graphId) throws Exception{
Statement stmt = null;
ResultSet rs = null;
boolean result=false;
try {
stmt = conn.createStatement();
StringBuffer queryBuffer = new StringBuffer();
queryBuffer.append(" MATCH (node) WHERE node.id ='"+graphId.getId()+"' AND node.sourceType = '"+graphId.getSourceType()+"'");
queryBuffer.append(" RETURN node");
rs = (ResultSet) stmt.executeQuery(queryBuffer.toString());
while(rs.next()) {
result=true;
break;
}
} catch(Exception e) {
System.out.println("Exception in fetching node ::: "+e.getMessage());
throw e;
} finally {
rs.close();
stmt.close();
}
return result;
}
//Before creating Relation also checked duplicacy for relationships.
private boolean checkRelation(Connection conn, Links link, GraphId srcGraphId, GraphId tgtGraphId) throws SQLException {
Statement stmt = null;
ResultSet rs = null;
boolean result=false;
try {
stmt = conn.createStatement();
StringBuffer queryBuffer = new StringBuffer();
queryBuffer.append(" MATCH (src:resource)-[r:topology]->(tgt:resource) WHERE src.id='"+srcGraphId.getId()
+"' AND tgt.id='"+tgtGraphId.getId()+"' AND r.srcInt='"+link.getSrcInt()+"'AND r.tgtInt='"+link.getTgtInt()+"'");
queryBuffer.append(" RETURN r");
rs=(ResultSet) stmt.executeQuery(queryBuffer.toString());
while(rs.next()) {
result=true;
break;
}
}
catch(Exception e) {
System.out.println("Exception in fetching node ::: "+e.getMessage());
} finally {
rs.close();
stmt.close();
}
return result;
}
We created indexes for those duplicacy check queries but still performance is slow.
And also please let us know how to use "Node key" unique constraint in Java level so that we can skip once checkVertex query. We tried to catch "constraintViolationexception" and added log instead of throwing it but it's throwing exception not saving any nodes.
There are a lot of things that you can improve:
for mass data imports use the Java Driver directly, JDBC adds an indirection layer
Use parameters!
Use batching, either with UNWIND or by executing multiple prepared statemts as batch
Don't construct queries with literal values.
Make sure you have indexes/constraints for your keys. Your queries don't use any indexes because you didn't provide any labels!
Use MERGE if you don't want to have constraint exceptions.
Don't use StringBuffer, ever.
Use try-with-resources
Use executeUpdate
For Batching:
https://medium.com/#mesirii/5-tips-tricks-for-fast-batched-updates-of-graph-structures-with-neo4j-and-cypher-73c7f693c8cc
For parameters:
http://neo4j-contrib.github.io/neo4j-jdbc/#_minimum_viable_snippet
Related
In my recent project, for each user's payment, it needs to insert a Receipt and an Invoice into the SQL DB. I have 2 functions for it, InsertReceipt() and InsertInvoice(), so the code is like:
void DoPayment()
{
InsertReceipt();
InsertInvoice();
}
bool InsertReceipt()
{
// insert to SQL with a ReceiptId
// return true or false;
}
bool InsertInvoice()
{
// insert to SQL with an InvoiceId
// return true or false;
}
ReceiptId and InvoiceId have to be unique and consecutive here.
My question is, how can I do to make InsertReceipt() and InsertInvoice() all successful or all failure? Or I have to make a new function InsertReceiptAndInvoice(), and use SQL Transaction?
If you use a stored procedure, you absolutely can use transaction. You can read about transaction in Microsoft docs:
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/begin-transaction-transact-sql?view=sql-server-ver15
With C#, you can use transaction with SqlConnection. Code example:
private static void ExecuteSqlTransaction(string connectionString)
{
using (SqlConnection connection = new SqlConnection(connectionString))
{
connection.Open();
SqlCommand command = connection.CreateCommand();
SqlTransaction transaction;
// Start a local transaction.
transaction = connection.BeginTransaction("SampleTransaction");
// Must assign both transaction object and connection
// to Command object for a pending local transaction
command.Connection = connection;
command.Transaction = transaction;
try
{
command.CommandText =
"Insert into Region (RegionID, RegionDescription) VALUES (100, 'Description')";
command.ExecuteNonQuery();
command.CommandText =
"Insert into Region (RegionID, RegionDescription) VALUES (101, 'Description')";
command.ExecuteNonQuery();
// Attempt to commit the transaction.
transaction.Commit();
Console.WriteLine("Both records are written to database.");
}
catch (Exception ex)
{
Console.WriteLine("Commit Exception Type: {0}", ex.GetType());
Console.WriteLine(" Message: {0}", ex.Message);
// Attempt to roll back the transaction.
try
{
transaction.Rollback();
}
catch (Exception ex2)
{
// This catch block will handle any errors that may have occurred
// on the server that would cause the rollback to fail, such as
// a closed connection.
Console.WriteLine("Rollback Exception Type: {0}", ex2.GetType());
Console.WriteLine(" Message: {0}", ex2.Message);
}
}
}
}
You can read docs: https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlconnection.begintransaction?view=dotnet-plat-ext-3.1
Is it possible to get access to line numbers with the lines read into the PCollection from TextIO.Read? For context here, I'm processing a CSV file and need access to the line number for a given line.
If not possible through TextIO.Read it seems like it should be possible using some kind of custom Read or transform, but I'm having trouble figuring out where to begin.
You can use FileIO to read the file manually, where you can determine the line number when you read from the ReadableFile.
A simple solution can look as follows:
p
.apply(FileIO.match().filepattern("/file.csv"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(strings())
.via((FileIO.ReadableFile f) -> {
List<String> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
int lineNr = 1;
String line = br.readLine();
while (line != null) {
result.add(lineNr + "," + line);
line = br.readLine();
lineNr++;
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return result;
}));
The solution above just prepends the line number to each input line.
I am using paho library Classes for Mqtt Connections org.eclipse.paho.client.mqttv3.MqttClient. (not MqttAsyncClient)
In my case when I publish using
mqttClient.publish(uid + "/p", new MqttMessage(payload.toString().getBytes()));
This method does the task for me but doesn't return anything so I can't check the latency between publish and pubAck.
To get the latency I use the following instead of directly calling publish function of mqttClient.
public long publish(JsonObject payload , String uid, int qos) {
try {
MqttTopic topic = mqttClient.getTopic(uid + "/p");
MqttMessage message = new MqttMessage(payload.toString().getBytes());
message.setQos(qos);
message.setRetained(true);
long publishTime = System.currentTimeMillis();
MqttDeliveryToken token = topic.publish(message);
token.waitForCompletion(10000);
long pubCompleted = System.currentTimeMillis();
if (token.getResponse() != null && token.getResponse() instanceof MqttPubAck) {
return pubCompleted-publishTime;
}
return -1;
} catch (Exception e) {
e.printStackTrace();
return -1;
}
}
This gets the work done, but I am not sure whether this is the right approach or not. Please let me know in case there is some other way to to do this.
I am trying to do my batch insertion to an existing database but I got the following exception:
Exception in thread "GC-Monitor" java.lang.OutOfMemoryError: Java heap
space at java.util.Arrays.copyOf(Arrays.java:2245) at
java.util.Arrays.copyOf(Arrays.java:2219) at
java.util.ArrayList.grow(ArrayList.java:242) at
java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216) at
java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208) at
java.util.ArrayList.add(ArrayList.java:440) at
java.util.Formatter.parse(Formatter.java:2525) at
java.util.Formatter.format(Formatter.java:2469) at
java.util.Formatter.format(Formatter.java:2423) at
java.lang.String.format(String.java:2792) at
org.neo4j.kernel.impl.cache.MeasureDoNothing.run(MeasureDoNothing.java:64)
Fail: Transaction was marked as successful, but unable to commit
transaction so rolled back.
Here is the structure of my insertion code :
public void parseExecutionRecordFile(Node episodeVersionNode, String filePath, Integer insertionBatchSize) throws Exception {
Gson gson = new Gson();
BufferedReader reader = new BufferedReader(new FileReader(filePath));
String aDataRow = "";
List<ExecutionRecord> executionRecords = new LinkedList<>();
Integer numberOfProcessedExecutionRecords = 0;
Integer insertionCounter = 0;
ExecutionRecord lastProcessedExecutionRecord = null;
Node lastProcessedExecutionRecordNode = null;
Long start = System.nanoTime();
while((aDataRow = reader.readLine()) != null) {
JsonReader jsonReader = new JsonReader(new StringReader(aDataRow));
jsonReader.setLenient(true);
ExecutionRecord executionRecord = gson.fromJson(jsonReader, ExecutionRecord.class);
executionRecords.add(executionRecord);
insertionCounter++;
if(insertionCounter == insertionBatchSize || executionRecord.getType() == ExecutionRecord.Type.END_MESSAGE) {
lastProcessedExecutionRecordNode = appendEpisodeData(episodeVersionNode, lastProcessedExecutionRecordNode, executionRecords, lastProcessedExecutionRecord == null ? null : lastProcessedExecutionRecord.getTraceSequenceNumber());
executionRecords = new LinkedList<>();
lastProcessedExecutionRecord = executionRecord;
numberOfProcessedExecutionRecords += insertionCounter;
insertionCounter = 0;
}
}
}
public Node appendEpisodeData(Node episodeVersionNode, Node previousExecutionRecordNode, List<ExecutionRecord> executionRecordList, Integer traceCounter) {
Iterator<ExecutionRecord> executionRecordIterator = executionRecordList.iterator();
Node previousTraceNode = null;
Node currentTraceNode = null;
Node currentExecutionRecordNode = null;
try (Transaction tx = dbInstance.beginTx()) {
// some graph insertion
tx.success();
return currentExecutionRecordNode;
}
}
So basically, I read json object from a file (ca. 20,000 objects) and insert it to neo4j every 10,000 records. If I have only 10,000 JSON objects in the file, then it works fine. But when I have 20,000, it throws the exception.
Thanks in advance and any help would be really appreciated!
If with 10000 objects works, just try to at least duplicate the heap memory.
Take a look at the following site: http://neo4j.com/docs/stable/server-performance.html
The wrapper.java.maxmemory option could resolve your problem.
As you also insert several k properties all that tx state will be held in memory. So I think 10k batch size is just fine for that amount of heap.
You also don't close your JSON reader so it might linger around with the StringReader inside.
You should also use an ArrayList initialized at your batch-size and use list.clear() instead of recreation/reassignment.
Question:
Assume an email message with an attachment (assume a JPEG attachment). How do I parse (not using the Tika facade classes) the email message and return the distinct pieces--a) the email text contents and b) the email attachment?
Configuration:
Tika 1.2
Java 1.7
Details:
I have been able to properly parse email messages in basic email message formats. However, after the parsing, I need to know a) the email's text contents and b) the the contents of any attachment to the email. I will store these items in my database as essentially parent email with child attachments.
What I cannot figure out is how I can "get back" the distinct parts and know that the parent email has attachments and be able to separately store those attachments referenced to the mail. This is, I believe, essentially similar to extracting ZipFile contents.
Code Example:
private Message processDocument(String fullfilepath) {
try {
File filename = new File(fullfilepath) ;
return this.processDocument(filename) ;
} catch (NullPointerException npe) {
Message error = new Message(false) ;
error.appendErrorMessage("The file name was null.") ;
return error ;
}
}
private Message processDocument(File filename) {
InputStream stream = null;
try {
stream = new FileInputStream(filename) ;
} catch (FileNotFoundException fnfe) {
// TODO Auto-generated catch block
fnfe.printStackTrace();
System.out.println("FileNotFoundException") ;
return diag ;
}
int writelimit = -1 ;
ContentHandler texthandler = new BodyContentHandler(writelimit);
this.safehandlerbodytext = new SafeContentHandler(texthandler);
this.meta = new Metadata() ;
ParseContext context = new ParseContext() ;
AutoDetectParser autodetectparser = new AutoDetectParser() ;
try {
autodetectparser.parse(
stream,
texthandler,
meta,
context) ;
this.documenttype = meta.get("Content-Type") ;
diag.setSuccessful(true);
} catch (IOException ioe) {
// if the document stream could not be read
System.out.println("TikaTextExtractorHelper IOException " + ioe.getMessage()) ;
//FIXME -- add real handling
} catch (SAXException se) {
// if the SAX events could not be processed
System.out.println("TikaTextExtractorHelper SAXException " + se.getMessage()) ;
//FIXME -- add real handling
} catch (TikaException te) {
// if the document could not be parsed
System.out.println("TikaTextExtractorHelper TikaException " + te.getMessage()) ;
System.out.println("Exception Filename = " + filename.getName()) ;
//FIXME -- add real handling
}
}
When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.
So, what you probably want to do is something like:
public static class HandleEmbeddedParser extends AbstractParser {
public List<File> found = new ArrayList<File>();
Set<MediaType> getSupportedTypes(ParseContext context) {
// Return what you want to handle
HashSet<MediaType> types = new HashSet<MediaType>();
types.put(MediaType.application("pdf"));
types.put(MediaType.application("zip"));
return types;
}
void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context
) throws IOException {
// Do something with the child documents
// eg save to disk
File f = File.createTempFile("tika","tmp");
found.add(f);
FileOutputStream fout = new FileOutputStream(f);
IOUtils.copy(stream,fout);
fout.close();
}
}
ParseContext context = new ParseContext();
context.set(Parser.class, new HandleEmbeddedParser();
parser.parse(....);