Graph Database Not updated Neo4j - neo4j

I am using neo4j version 3.1.0 . i have created 20 million graph.Each graph consists of 9 nodes with their respective properties.I am using embedded mode (api 3.1.0) .I am trying to change the property of a node .In the runtime I changed property of a node but it is not reflecting in the database when i am trying to fetch the node property using neo4j shell.
The database size is almost 20GB.I am running it on a server. i Guess that the changes are taking place in cache but it is not reflected in graph database.What changes should i make to get updated in the graph database???
try(Transaction tx=GraphCreation.getInstance().dbService.beginTx()){
Node n= GraphCreation.getInstance().dbService.findNode(MyLabels.IMSI,"IMSI" ,"A"+0);
n.setProperty("MSISDN", "Z"+0);
tx.success(); // commit
} catch(Exception e)
{
e.printStackTrace();
}
try(Transaction tx=GraphCreation.getInstance().dbService.beginTx())
{
String cql ="match(a:IMSI{IMSI : 'A0'}) return a.MSISDN";
Result result= GraphCreation.getInstance().dbService.execute(cql);
while ( result.hasNext() )
{
Map<String, Object> row = result.next();
for ( String key : result.columns() )
{
System.out.printf( "%s = %s%n", key, row.get( key ) );
}
}
tx.success();
timerContext.close();
} catch(Exception e)
{
e.printStackTrace();
}

The issue has been solved .Graph DB hasn't been shutdown properly when trying to stop the code by pressing Cntrl +C.Hence to handle this case add this snippet to your code.
private static void registerShutdownHook( final GraphDatabaseService graphDb )
{
// Registers a shutdown hook for the Neo4j instance so that it
// shuts down nicely when the VM exits (even if you "Ctrl-C" the
// running application).
Runtime.getRuntime().addShutdownHook( new Thread()
{
#Override
public void run()
{
graphDb.shutdown();
}
} );
}

Related

Enable Streaming Data transformation using Cloud Dataflow

I am trying to implementing streaming based data transformation from one CloudSQL table to another CloudSQL table using Windowing method (UnBounded PCollections).
My dataflow package is completed with success. However it is not keep on running when new data comes to the first table. Am I missed anything in code level in order to make it run streaming mode.
Run Command:
--project=<your project ID> --stagingLocation=gs://<your staging bucket>
--runner=DataflowRunner --streaming=true
Code Snippet:
PCollection<TableRow> tblRows = p.apply(JdbcIO.<TableRow>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"org.postgresql.Driver", connectionString)
.withUsername(<UserName>).withPassword(<PWD>))
.withQuery("select id,order_number from public.tableRead")
.withCoder(TableRowJsonCoder.of())
.withRowMapper(new JdbcIO.RowMapper<TableRow>() {
public TableRow mapRow(ResultSet resultSet) throws Exception
{
TableRow result = new TableRow();
result.set("id", resultSet.getString("id"));
result.set("order_number", resultSet.getString("order_number"));
return result;
}
})
);
PCollection<TableRow> tblWindow = tblRows.apply("window 1s", Window.into(FixedWindows.of(Duration.standardMinutes(5))));
PCollection<KV<Integer,TableRow>> keyedTblWindow= tblRows.apply("Process", ParDo.of(new DoFn<TableRow, KV<Integer,TableRow>>()
{
#ProcessElement
public void processElement(ProcessContext c) {
TableRow leftRow = c.element();
c.output(KV.of(Integer.parseInt(leftRow.get("id").toString()), leftRow) );
}}));
PCollection<KV<Integer, Iterable<TableRow>>> groupedWindow = keyedTblWindow.apply(GroupByKey.<Integer, TableRow> create());
groupedWindow.apply(JdbcIO.<KV<Integer, Iterable<TableRow>>>write()
.withDataSourceConfiguration( DataSourceConfiguration.create("org.postgresql.Driver", connectionString)
.withUsername(<UserName>).withPassword(<PWD>))
.withStatement("insert into streaming_testing(id,order_number) values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, Iterable<TableRow>>>() {
public void setParameters(KV<Integer, Iterable<TableRow>> element, PreparedStatement query)
throws SQLException {
Iterable<TableRow> rightRowsIterable = element.getValue();
for (Iterator<TableRow> i = rightRowsIterable.iterator(); i.hasNext(); ) {
TableRow mRow = (TableRow) i.next();
query.setInt(1, Integer.parseInt(mRow.get("id").toString()));
query.setInt(2, Integer.parseInt(mRow.get("order_number").toString()));
}
}
})
);
JdbcIO just work on the snapshot of data produced at the time of execution and does not poll for changes.
I suppose you have to build a custom ParDo to detect change in the database.

Stateful ParDo not working on Dataflow Runner

Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}
This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.

Sharing BigTable Connection object among DataFlow DoFn sub-classes

I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the BigTable documentation the note that connecting to BigTable is an 'expensive' operation and that it's a good idea to do it only once and to share the connection among the functions that need it.
However, if I declare the Connection object as a public static variable in the main class and first connect to BigTable in the main function, I get the NullPointerException when I subsequently try to reference the connection in instances of DoFn sub-classes' processElement() function as part of my DataFlow pipeline.
Conversely, if I declare the Connection as a static variable in the actual DoFn class, then the operation works successfully.
What is the best-practice or optimal way to do this?
I'm concerned that if I implement the second option at scale, I will be wasting a lot of time and resources. If I keep the variable as static in the DoFn class, is it enough to ensure that the APIs don't try to re-establish the connection every time?
I realize there is a special BigTable I/O call to sync DataFlow pipeline objects with BigTable, but I think I need to write one on my own to build-in some special logic into the DoFn processElement() function...
This is what the "working" code looks like:
class DigitizeBT extends DoFn<String, String>{
private static Connection m_locConn;
#Override
public void processElement(ProcessContext c)
{
try
{
m_locConn = BigtableConfiguration.connect("projectID", "instanceID");
Table tbl = m_locConn.getTable(TableName.valueOf("TableName"));
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes("CF1"),
Bytes.toBytes("SomeName"),
Bytes.toBytes("SomeValue"));
tbl.put(put);
}
catch (IOException e)
{
e.printStackTrace();
System.exit(1);
}
}
}
This is what updated code looks like, FYI:
public void SmallKVJob()
{
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId(DEF.ID_PROJ)
.withInstanceId(DEF.ID_INST)
.withTableId(DEF.ID_TBL_UNITS)
.build();
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
// options.setNumWorkers(3);
// options.setMaxNumWorkers(5);
// options.setRunner(BlockingDataflowPipelineRunner.class);
options.setRunner(DirectPipelineRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(DEF.ID_BAL))
.apply(ParDo.of(new DoFn1()))
.apply(ParDo.of(new DoFn2()))
.apply(ParDo.of(new DoFn3(config)));
m_log.info("starting to run the job");
p.run();
m_log.info("finished running the job");
}
}
class DoFn1 extends DoFn<String, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
c.output(KV.of(c.element().split("\\,")[0],Integer.valueOf(c.element().split("\\,")[1])));
}
}
class DoFn2 extends DoFn<KV<String, Integer>, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
int max = c.element().getValue();
String name = c.element().getKey();
for(int i = 0; i<max;i++)
c.output(KV.of(name, 1));
}
}
class DoFn3 extends AbstractCloudBigtableTableDoFn<KV<String, Integer>, String>
{
public DoFn3(CloudBigtableConfiguration config)
{
super(config);
}
#Override
public void processElement(ProcessContext c)
{
try
{
Integer max = c.element().getValue();
for(int i = 0; i<max; i++)
{
String owner = c.element().getKey();
String rnd = UUID.randomUUID().toString();
Put p = new Put(Bytes.toBytes(owner+"*"+rnd));
p.addColumn(Bytes.toBytes(DEF.ID_CF1), Bytes.toBytes("Owner"), Bytes.toBytes(owner));
getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS)).put(p);
c.output("Success");
}
} catch (IOException e)
{
c.output(e.toString());
e.printStackTrace();
}
}
}
The input .csv file looks something like this:
Mary,3000
John,5000
Peter,2000
So, for each row in the .csv file, I have to put in x number of rows into BigTable, where x is the second cell in the .csv file...
We built AbstractCloudBigtableTableDoFn ( Source & Docs ) for this purpose. Extend that class instead of DoFn, and call getConnection() instead of creating a Connection yourself.
10,000 small rows should take a second or two of actual work.
EDIT: As per the comments, BufferedMutator should be used instead of Table.put() for best throughput.

How can I do batch deletes millions on entities using DatastoreIO and Dataflow

I'm trying to use Dataflow to delete many millions of Datastore entities and the pace is extremely slow (5 entities/s). I am hoping you can explain to me the pattern I should follow to allow that to scale up to a reasonable pace. Just adding more workers did not help.
The Datastore Admin console has the ability to delete all entities of a specific kind but it fails a lot and takes me a week or more to delete 40 million entities. Dataflow ought to be able to help me delete millions of entities that match only certain query parameters as well.
I'm guessing that some type of batching strategy should be employed (where I create a mutation with 1000 deletes in it for example) but its not obvious to me how I would go about that. DatastoreIO gives me just one entity at a time to work with. Pointers would be greatly appreciated.
Below is my current slow solution.
Pipeline p = Pipeline.create(options);
DatastoreIO.Source source = DatastoreIO.source()
.withDataset(options.getDataset())
.withQuery(getInstrumentQuery(options))
.withNamespace(options.getNamespace());
p.apply("ReadLeafDataFromDatastore", Read.from(source))
.apply("DeleteRecords", ParDo.of(new DeleteInstrument(options.getDataset())));
p.run();
static class DeleteInstrument extends DoFn<Entity, Integer> {
String dataset;
DeleteInstrument(String dataset) {
this.dataset = dataset;
}
#Override
public void processElement(ProcessContext c) {
DatastoreV1.Mutation.Builder mutation = DatastoreV1.Mutation.newBuilder();
mutation.addDelete(c.element().getKey());
final DatastoreV1.CommitRequest.Builder request = DatastoreV1.CommitRequest.newBuilder();
request.setMutation(mutation);
request.setMode(DatastoreV1.CommitRequest.Mode.NON_TRANSACTIONAL);
try {
DatastoreOptions.Builder dbo = new DatastoreOptions.Builder();
dbo.dataset(dataset);
dbo.credential(getCredential());
Datastore db = DatastoreFactory.get().create(dbo.build());
db.commit(request.build());
c.output(1);
count++;
if(count%100 == 0) {
LOG.info(count+"");
}
} catch (Exception e) {
c.output(0);
e.printStackTrace();
}
}
}
There is no direct way of deleting entities using the current version of DatastoreIO. This version of DatastoreIO is going to be deprecated in favor of a new version (v1beta3) in the next Dataflow release. We think there is a good use case for providing a delete utility (either through an example or PTransform), but still work in progress.
For now you can batch your deletes, instead of deleting one at a time:
public static class DeleteEntityFn extends DoFn<Entity, Void> {
// Datastore max batch limit
private static final int DATASTORE_BATCH_UPDATE_LIMIT = 500;
private Datastore db;
private List<Key> keyList = new ArrayList<>();
#Override
public void startBundle(Context c) throws Exception {
// Initialize Datastore Client
// db = ...
}
#Override
public void processElement(ProcessContext c) throws Exception {
keyList.add(c.element().getKey());
if (keyList.size() >= DATASTORE_BATCH_UPDATE_LIMIT) {
flush();
}
}
#Override
public void finishBundle(Context c) throws Exception {
if (keyList.size() > 0) {
flush();
}
}
private void flush() throws Exception {
// Make one delete request instead of one for each element.
CommitRequest request =
CommitRequest.newBuilder()
.setMode(CommitRequest.Mode.NON_TRANSACTIONAL)
.setMutation(Mutation.newBuilder().addAllDelete(keyList).build())
.build();
db.commit(request);
keyList.clear();
}
}

Neo4j ServerPlugin Configuration

I'm building a Neo4J Server plugin. I'd like to have some configuration values that I can manually set in neo4j.properties or neo4j-server.properties and then the plugin can read and utilize these value. How can I access config values from a ServerPlugin?
Clarification:
I'd really like something that will work across future releases of Neo4J, so something that is part of the public API and is not deprecated would be best.
Using Neo4j's internal dependency mechanism, you can access a instance of Config (https://github.com/neo4j/neo4j/blob/master/community/kernel/src/main/java/org/neo4j/kernel/configuration/Config.java). This class gives you access to the configuration.
Take the following untested snippet as a guideline:
...
import org.neo4j.kernel.configuration.Config
...
#Description( "An extension to the Neo4j Server accessing config" )
public class ConfigAwarePlugin extends ServerPlugin
{
#Name( "config" )
#Description( "Do stuff with config" )
#PluginTarget( GraphDatabaseService.class )
public void sample( #Source GraphDatabaseService graphDb ) {
Config config = ((GraphDatabaseAPI)graphDb).getDependencyResolver().resolveDependency(Config.class);
// do stuff with config
}
}
I use Properties:
Properties props = new Properties();
try
{
FileInputStream in = new FileInputStream("./conf/neo4j.properties");
props.load(in);
}
catch (FileNotFoundException e)
{
try
{
FileInputStream in = new FileInputStream("./neo4j.properties");
props.load(in);
}
catch (FileNotFoundException e2)
{
logger.warn(e2.getMessage());
}
catch (IOException e2)
{
logger.warn(e2.getMessage());
}
}
catch (IOException e)
{
logger.warn(e.getMessage());
}
String myPropertyString = props.getProperty("myProperty");
if (myPropertyString != null)
{
myProperty = Integer.parseInt(myPropertyString);
}
else
{
myProperty = 100;
}
and in neo4j.properties I have:
...
# Enable shell server so that remote clients can connect via Neo4j shell.
#remote_shell_enabled=true
# Specify custom shell port (default is 1337).
#remote_shell_port=1234
myProperty=100

Resources