Is there a way to apply a side input to a BigQueryIO.read() operation in Apache Beam.
Say for example I have a value in a PCollection that I want to use in a query to fetch data from a BigQuery table. Is this possible using side input? Or should something else be used in such a case?
I used NestedValueProvider in a similar case but I guess we can use that only when a certain value depends on my runtime value. Or can I use the same thing here? Please correct me if I'm wrong.
The code that I tried:
Bigquery bigQueryClient = start_pipeline.newBigQueryClient(options.as(BigQueryOptions.class)).build();
Tabledata tableRequest = bigQueryClient.tabledata();
PCollection<TableRow> existingData = readData.apply("Read existing data",ParDo.of(new DoFn<String,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
List<TableRow> list = c.sideInput(bqDataView);
String tableName = list.get(0).get("table").toString();
TableDataList table = tableRequest.list("projectID","DatasetID",tableName).execute();
for(TableRow row:table.getRows())
{
c.output(row);
}
}
}).withSideInputs(bqDataView));
The error that I get is:
Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize BeamTest.StarterPipeline$1#86b455
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
at org.apache.beam.sdk.util.SerializableUtils.clone(SerializableUtils.java:90)
at org.apache.beam.sdk.transforms.ParDo$SingleOutput.<init>(ParDo.java:569)
at org.apache.beam.sdk.transforms.ParDo.of(ParDo.java:434)
at BeamTest.StarterPipeline.main(StarterPipeline.java:158)
Caused by: java.io.NotSerializableException: com.google.api.services.bigquery.Bigquery$Tabledata
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:49)
... 4 more
The Beam model does not currently support this kind of data-dependent operation very well.
A way of doing it is to code your own DoFn that receives the side input and connects directly to BQ. Unfortunately, this would not give you any parallelism, as the DoFn would run completely on the same thread.
Once Splittable DoFns are supported in Beam, this will be a different story.
In the current state of the world, you would need to use the BQ client library to add code that would query BQ as if you were not in a Beam pipeline.
Given the code in your question, a rough idea on how to implement this is the following:
class ReadDataDoFn extends DoFn<String,TableRow>(){
private Tabledata tableRequest;
private Bigquery bigQueryClient;
private Bigquery createBigQueryClientWithinDoFn() {
// I'm not sure how you'd implement this, but you had the right idea
}
#Setup
public void setup() {
bigQueryClient = createBigQueryClientWithinDoFn();
tableRequest = bigQueryClient.tabledata();
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
List<TableRow> list = c.sideInput(bqDataView);
String tableName = list.get(0).get("table").toString();
TableDataList table = tableRequest.list("projectID","DatasetID",tableName).execute();
for(TableRow row:table.getRows())
{
c.output(row);
}
}
}
PCollection<TableRow> existingData = readData.apply("Read existing data",ParDo.of(new ReadDataDoFn()));
Related
I'm implementing a KStream-GlobalKTable-Join using Spring-Cloud-Stream and I'm facing the problem, that the join operation doesn't get any matches, but it definitely should. The code looks as follows:
#Component
#EnableBinding(CustomProcessor.class)
public class MyProcessor {
private static final Log LOGGER =
LogFactory.getLog(MyProcessor.class);
#Autowired
private InteractiveQueryService interactiveQueryService;
ReadOnlyKeyValueStore<Object, Object> keyValueStore;
#StreamListener
#SendTo(CustomProcessor.OUTPUT)
public KStream<EventKey, EventEnriched> process(
#Input(CustomProcessor.INPUT) KStream<EventKey, EventEnriched> inputStream,
#Input(CustomProcessor.LOOKUP) GlobalKTable<LookupKey, LookupData> lookupStore
) {
keyValueStore = interactiveQueryService.getQueryableStore("lookupStore", QueryableStoreTypes.keyValueStore());
LOGGER.info("Lookup: " + keyValueStore.get(new LookupKey("google.de")));
return inputStream.leftJoin(
lookupStore,
(inputKey, inputValue) -> {
return new LookupKey(inputValue.getDomain().replace("www.", ""));
},
this::enrichData
);
}
public EventEnriched enrichData(EventEnriched input, LookupData lookupRecord) {
...
}
}
Here the CustomProcessor:
public interface CustomProcessor extends KafkaStreamsProcessor {
String INPUT = "input";
String OUTPUT = "output";
String LOOKUP = "lookupTable";
#Input(CustomProcessor.LOOKUP)
GlobalKTable<LookupKey, ?> lookupTable();
}
Without calling the line in MyProcessor
keyValueStore.get(...)
the code runs fine, but the GlobalKTable seems to be null. But if I call
LOGGER.info("Lookup: " + keyValueStore.get(new LookupKey("google.de")));
in order to inpect the GlobalKTable, runnig the application fails with:
Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2019-06-26T09:04:00.000 [ERROR] [main-858] [org.springframework.boot.SpringApplication] [reportFailure:858] Application run failed
org.springframework.beans.factory.BeanInitializationException: Cannot setup StreamListener for public org.apache.kafka.streams.kstream.KStream MyProcessor.process(org.apache.kafka.streams.kstream.KStream,org.apache.kafka.streams.kstream.GlobalKTable); nested exception is java.lang.reflect.InvocationTargetException
at org.springframework.cloud.stream.binder.kafka.streams.KafkaStreamsStreamListenerSetupMethodOrchestrator.orchestrateStreamListenerSetupMethod(KafkaStreamsStreamListenerSetupMethodOrchestrator.java:214)
at org.springframework.cloud.stream.binding.StreamListenerAnnotationBeanPostProcessor.doPostProcess(StreamListenerAnnotationBeanPostProcessor.java:226)
at org.springframework.cloud.stream.binding.StreamListenerAnnotationBeanPostProcessor.lambda$postProcessAfterInitialization$0(StreamListenerAnnotationBeanPostProcessor.java:196)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at org.springframework.cloud.stream.binding.StreamListenerAnnotationBeanPostProcessor.injectAndPostProcessDependencies(StreamListenerAnnotationBeanPostProcessor.java:330)
at org.springframework.cloud.stream.binding.StreamListenerAnnotationBeanPostProcessor.afterSingletonsInstantiated(StreamListenerAnnotationBeanPostProcessor.java:113)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:866)
at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:877)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:549)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:142)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:775)
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:397)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:316)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1260)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1248)
at Transformer.main(Transformer.java:31)
Caused by: java.lang.reflect.InvocationTargetException: null
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.springframework.cloud.stream.binder.kafka.streams.KafkaStreamsStreamListenerSetupMethodOrchestrator.orchestrateStreamListenerSetupMethod(KafkaStreamsStreamListenerSetupMethodOrchestrator.java:179)
... 15 common frames omitted
Caused by: java.lang.NullPointerException: null
at MyProcessor.process(MyProcessor.java:62)
... 20 common frames omitted
Process finished with exit code 1
Does anybody see a problem in the code? How can I inspect the content of the GlobaKTable?
Best regards
Martin
Now I'm getting closer to the problem. I have tried to query the lookupStore. If I use
final ReadOnlyKeyValueStore<LookupKey, LookupData> lookupStore =
interactiveQueryService.getQueryableStore("myStore", QueryableStoreTypes.<LookupKey, LookupData>keyValueStore())
Then
lookupStore.get(key)
never returns a value. But if I create a HashMap like this:
final KeyValueIterator<LookupKey, LookupData> lookups = lookupStore.all();
Map<LookupKey, LookupData> lookupMap = new HashMap<>();
while (lookups.hasNext()) {
KeyValue<LookupKey, LookupData> nextLookup = lookups.next();
lookupMap.put(nextLookup.key, nextLookup.value);
}
lookups.close();
the hashMap contains the correct data and is returning the correct value to each key. But the GlobalKTable itself cannot be joined for some reason. It never gets any matches.
I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?
Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.
We are trying to run a daily Dataflow pipeline that reads off Bigtable and dumps data into GCS (using HBase's Scan and BaseResultCoder as coder) as follows (just to highlight the idea):
Pipeline pipeline = Pipeline.create(options);
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1);
scan.addFamily(Bytes.toBytes("f"));
CloudBigtableScanConfiguration btConfig = BCloudBigtableScanConfiguration.Builder().withProjectId("aaa").withInstanceId("bbb").withTableId("ccc").withScan(scan).build();
pipeline.apply(Read.from(CloudBigtableIO.read(btConfig))).apply(TextIO.Write.to("gs://bucket/dir/file").withCoder(HBaseResultCoder.getInstance()));
pipeline.run();
This seems to run perfectly as expected.
Now, we want to be able to use the dumped file in GCS for a recovery job if needed. That is, we want to have a dataflow pipeline which reads the dumped data (which is PCollection) from GCS and creates Mutations ('Put' objects, basically). For some reason, the following code fails with a bunch of NullPointerExceptions. We are unsure why that would be the case -- if-statements below which check for null or 0-length strings were added to see if that would make a difference, but it did not.
// Part of DoFn<Result,Mutation>
#Override
public void processElement(ProcessContext c) {
Result result = c.element();
byte[] row = result.getRow();
if (row == null || row.length == 0) { // NullPointerException at this line
return;
}
Put mutation = new Put(result.getRow());
// go through the column/value entries of this row, and create a corresponding put mutation.
for (Entry<byte[], byte[]> entry : result.getFamilyMap(Bytes.toBytes(cf)).entrySet()) {
byte[] qualifier = entry.getKey();
if (qualifier == null || qualifier.length == 0) {
continue;
}
byte[] val = entry.getValue();
if (val == null || val.length == 0) {
continue;
}
mutation.addImmutable(cf_bytes, qualifier, entry.getValue());
}
c.output(mutation);
}
The error we get is the following (line 83 is marked above):
(2a6ad6372944050d): java.lang.NullPointerException at some.package.RecoveryFromGcs$CreateMutationFromResult.processElement(RecoveryFromGcs.java:83)
I have two questions:
1. Has someone experienced something like this when they try to ParDo on PCollection to get PCollection which is to be written to a bigtable?
2. Is this a reasonable approach? The end-goal is to be able to leave a daily snapshot of our bigtable (for a specific column family) on a regular basis by means of a back-up in case something bad happens. We wish to be able to read the back-up data via dataflow, and write it to bigtable when we need to.
Any suggestions and help will be really appreciated!
-------- Edit
Here is the code that scans Bigtable and dumps data to GCS:
(Some details are hidden if they are not relevant.)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
final String cf = "f"; // some specific column family.
Scan scan = new Scan();
scan.setCacheBlocks(false).setMaxVersions(1); // Disable caching and read only the latest cell.
scan.addFamily(Bytes.toBytes(cf));
CloudBigtableScanConfiguration btConfig =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), "some-bigtable-name").withScan(scan).build();
PCollection<Result> result = pipeline.apply(Read.from(CloudBigtableIO.read(btConfig)));
PCollection<Mutation> mutation =
result.apply(ParDo.of(new CreateMutationFromResult(cf))).setCoder(new HBaseMutationCoder());
mutation.apply(TextIO.Write.to("gs://path-to-files").withCoder(new HBaseMutationCoder()));
pipeline.run();
}
}
The job that reads the output of the above code has the following code:
(This is the one throwing exception when reading from GCS)
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
PCollection<Mutation> mutations = pipeline.apply(TextIO.Read
.from("gs://path-to-files").withCoder(new HBaseMutationCoder()));
CloudBigtableScanConfiguration config =
BigtableUtils.getCloudBigtableScanConfigurationBuilder(options.getProject(), btTarget).build();
if (config != null) {
CloudBigtableIO.initializeForWrite(pipeline);
mutations.apply(CloudBigtableIO.writeToTable(config));
}
pipeline.run();
}
}
The error I am getting (https://jpst.it/Qr6M) is a bit confusing as the mutations are all Put objects, but the error is about 'Delete' object.
It's probably best to discuss this issue on the cloud bigtable client github issues page. We are currently working on import / export features like this one, so we'll respond quickly. We'll also explore this approach on our own, even if you don't add the github issue. The github issue will allow us to communicate better.
FWIW, I don't understand how you could get an NPE on the line you highlighted. Are you sure you have the right line?
EDIT (12/12):
The following processElement() method should work to convert a Result to a Put:
#Override
public void processElement(DoFn<Result, Mutation>.ProcessContext c) throws Exception {
Result result = c.element();
byte[] row = result.getRow();
if (row != null && row.length > 0) {
Put put = new Put(row);
for (Cell cell : result.rawCells()) {
put.add(cell);
}
c.output(put);
}
}
I am currently using Neo4j TimeTree REST API and is there any way to navigate to the time before and after a given timestamp? My resolution is Second and I just realize that if the minute has changed, then there is no 'NEXT' relationship bridging the previous Second in previous Minute to the current Second. This makes the cypher query quite complicated and I just don't want to reinvent the wheel again if it's already available.
Thanks in advance and your response would be really appreciated!
EDIT
I've got to reproduce the missing NEXT relationship issue again, as you can see in the picture below. This starts to happen from the third time I add a new Second time instant.
I actually create a NodeEntity to operate with the Second nodes. The class is like below.
#NodeEntity(label = "Second")
public class TimeTreeSecond {
#GraphId
private Long id;
private Integer value;
#Relationship(type = "CREATED_ON", direction = Relationship.INCOMING)
private FilterVersionChange relatedFilterVersionChange;
#Relationship(type = "NEXT", direction = Relationship.OUTGOING)
private TimeTreeSecond nextTimeTreeSecond;
#Relationship(type = "NEXT", direction = Relationship.INCOMING)
private TimeTreeSecond prevTimeTreeSecond;
public TimeTreeSecond() {
}
public Long getId() {
return id;
}
public void next(TimeTreeSecond nextTimeTreeSecond) {
this.nextTimeTreeSecond = nextTimeTreeSecond;
}
public FilterVersionChange getRelatedFilterVersionChange() {
return relatedFilterVersionChange;
}
}
The problem here is the Incoming NEXT relationship. When I omit that, everything works fine.
Sometimes I even get this kind of exception in my console when I create the time instant repetitively with short delay.
Exception in thread "main" org.neo4j.ogm.session.result.ResultProcessingException: Could not initialise response
at org.neo4j.ogm.session.response.GraphModelResponse.<init>(GraphModelResponse.java:38)
at org.neo4j.ogm.session.request.SessionRequestHandler.execute(SessionRequestHandler.java:55)
at org.neo4j.ogm.session.Neo4jSession.load(Neo4jSession.java:108)
at org.neo4j.ogm.session.Neo4jSession.load(Neo4jSession.java:100)
at org.springframework.data.neo4j.repository.GraphRepositoryImpl.findOne(GraphRepositoryImpl.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.executeMethodOn(RepositoryFactorySupport.java:452)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.doInvoke(RepositoryFactorySupport.java:437)
at org.springframework.data.repository.core.support.RepositoryFactorySupport$QueryExecutorMethodInterceptor.invoke(RepositoryFactorySupport.java:409)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:99)
at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:281)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:96)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:136)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207)
at com.sun.proxy.$Proxy32.findOne(Unknown Source)
at de.rwthaachen.service.core.FilterDefinitionServiceImpl.createNewFilterVersionChange(FilterDefinitionServiceImpl.java:100)
at sampleapp.FilterLauncher.main(FilterLauncher.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: org.neo4j.ogm.session.result.ResultProcessingException: "errors":[{"code":"Neo.ClientError.Statement.InvalidType","message":"Expected a numeric value for empty iterator, but got null"}]}
at org.neo4j.ogm.session.response.JsonResponse.parseErrors(JsonResponse.java:128)
at org.neo4j.ogm.session.response.JsonResponse.parseColumns(JsonResponse.java:102)
at org.neo4j.ogm.session.response.JsonResponse.initialiseScan(JsonResponse.java:46)
at org.neo4j.ogm.session.response.GraphModelResponse.initialiseScan(GraphModelResponse.java:66)
at org.neo4j.ogm.session.response.GraphModelResponse.<init>(GraphModelResponse.java:36)
... 27 more
2015-05-23 01:30:46,204 INFO ork.data.neo4j.config.Neo4jConfiguration: 62 - Intercepted exception
Below is one REST call example which I use to create the time instant nodes:
http://localhost:7474/graphaware/timetree/1202/single/1432337658713?resolution=Second&timezone=Europe/Amsterdam
method that I use to create the data :
public FilterVersionChange createNewFilterVersionChange(String projectName,
String filterVersionName,
String filterVersionChangeDescription,
Set<FilterState> filterStates)
{
Long filterVersionNodeId = filterVersionRepository.findFilterVersionByName(projectName, filterVersionName);
FilterVersion newFilterVersion = filterVersionRepository.findOne(filterVersionNodeId, 2);
// Populate all the existing filters in the current project
Map<String, Filter> existingFilters = new HashMap<String, Filter>();
try
{
for(Filter filter : newFilterVersion.getProject().getFilters())
{
existingFilters.put(filter.getMatchingString(), filter);
}
}
catch(Exception e) {}
// Map the filter states to the populated filters, if any. Otherwise, create new filter for it.
for(FilterState filterState : filterStates)
{
Filter filter = existingFilters.get(filterState.getMatchingString());
if(filter == null)
{
filter = new Filter(filterState.getMatchingString(), filterState.getMatchingType(), newFilterVersion.getProject());
}
filterState.stateOf(filter);
}
Long now = System.currentTimeMillis();
TimeTreeSecond timeInstantNode = timeTreeSecondRepository.findOne(timeTreeService.getFilterTimeInstantNodeId(projectName, now));
FilterVersionChange filterVersionChange = new FilterVersionChange(filterVersionChangeDescription, now, filterStates, filterStates, newFilterVersion, timeInstantNode);
FilterVersionChange addedFilterVersionChange = filterVersionChangeRepository.save(filterVersionChange);
return addedFilterVersionChange;
}
Leaving aside for a moment the specific use of TimeTree, I'd like to describe how to generally manage a doubly-linked list using SDN 4, specifically for the case where the underlying graph uses a single relationship type between nodes, e.g.
(post:Post)-[:NEXT]->(post:Post)
What you can't do
Due to limitations in the mapping framework, it is not possible to reliably declare the same relationship type twice in two different directions in your object model, i.e. this (currently) will not work:
class Post {
#Relationship(type="NEXT", direction=Relationship.OUTGOING)
Post next;
#Relationship(type="NEXT", direction=Relationship.INCOMING)
Post previous;
}
What you can do
Instead we can combine the #Transient annotation with the use of annotated setter methods to obtain the desired result:
class Post {
Post next;
#Transient Post previous;
#Relationship(type="NEXT", direction=Relationship.OUTGOING)
public void setNext(Post next) {
this.next = next;
if (next != null) {
next.previous = this;
}
}
}
As a final point, if you then wanted to be able to navigate forwards and backwards through the entire list of Posts from any starting Post without having to continually refetch them from the database, you can set the fetch depth to -1 when you load the post, e.g:
findOne(post.getId(), -1);
Bear in mind that an infinite depth query will fetch every reachable object in the graph from the matched one, so use it with care!
Hope this is helpful
The Seconds are linked to each other via a NEXT relationship, even across minutes.
Hope this is what you meant
This is my scenario: we are building a routing system by using neo4j and the spatial plugin. We start from the OSM file and we read this file and import nodes and relationships in our graph (a custom graph model)
Now, if we don't use the batch inserter of neo4j, in order to import a compressed OSM file (with compressed dimension of around 140MB, and normal dimensions around 2GB) it takes around 3 days on a dedicated server with the following characteristics: CentOS 6.5 64bit, quad core, 8GB RAM; pease note that the most time is related to the Neo4J Nodes and relationships creation; in-fact if we read the same file without doing anything with neo4j, the file is read in around 7 minutes (i'm sure about this becouse in our process we first read the file in order to store the correct osm nodes ids and then we read again the file in order to create the neo4j graph)
Obviously we need to improve the import proces so we are trying to use the batchInserter. So far, so good (I need to check how much it will perform by using the batchInserter but I guess it will be faster); so the first thing I did was: let's try to use the batch inserter in a simple test case (very similar to our code, but without modifying our code directly)
I list my software versions:
Neo4j: 2.0.2
Neo4jSpatial: 0.13-neo4j-2.0.1
Neo4jGraphCollections: 0.7.1-neo4j-2.0.1
Osmosis: 0.43.1
Since I'm using osmosis in order to read the osm file, I wrote the following Sink implementation:
public class BatchInserterSinkTest implements Sink
{
public static final Map<String, String> NEO4J_CFG = new HashMap<String, String>();
private static File basePath = new File("/home/angelo/Scrivania/neo4j");
private static File dbPath = new File(basePath, "db");
private GraphDatabaseService graphDb;
private BatchInserter batchInserter;
// private BatchInserterIndexProvider batchIndexService;
private SpatialDatabaseService spatialDb;
private SimplePointLayer spl;
static
{
NEO4J_CFG.put( "neostore.nodestore.db.mapped_memory", "100M" );
NEO4J_CFG.put( "neostore.relationshipstore.db.mapped_memory", "300M" );
NEO4J_CFG.put( "neostore.propertystore.db.mapped_memory", "400M" );
NEO4J_CFG.put( "neostore.propertystore.db.strings.mapped_memory", "800M" );
NEO4J_CFG.put( "neostore.propertystore.db.arrays.mapped_memory", "10M" );
NEO4J_CFG.put( "dump_configuration", "true" );
}
#Override
public void initialize(Map<String, Object> arg0)
{
batchInserter = BatchInserters.inserter(dbPath.getAbsolutePath(), NEO4J_CFG);
graphDb = new SpatialBatchGraphDatabaseService(batchInserter);
spatialDb = new SpatialDatabaseService(graphDb);
spl = spatialDb.createSimplePointLayer("testBatch", "latitudine", "longitudine");
//batchIndexService = new LuceneBatchInserterIndexProvider(batchInserter);
}
#Override
public void complete()
{
// TODO Auto-generated method stub
}
#Override
public void release()
{
// TODO Auto-generated method stub
}
#Override
public void process(EntityContainer ec)
{
Entity entity = ec.getEntity();
if (entity instanceof Node) {
Node osmNodo = (Node)entity;
org.neo4j.graphdb.Node graphNode = graphDb.createNode();
graphNode.setProperty("osmId", osmNodo.getId());
graphNode.setProperty("latitudine", osmNodo.getLatitude());
graphNode.setProperty("longitudine", osmNodo.getLongitude());
spl.add(graphNode);
} else if (entity instanceof Way) {
//do something with the way
} else if (entity instanceof Relation) {
//do something with the relation
}
}
}
Then I wrote the following test case:
public class BatchInserterTest
{
private static final Log logger = LogFactory.getLog(BatchInserterTest.class.getName());
#Test
public void batchInserter()
{
File file = new File("/home/angelo/Scrivania/MilanoPiccolo.osm");
try
{
boolean pbf = false;
CompressionMethod compression = CompressionMethod.None;
if (file.getName().endsWith(".pbf"))
{
pbf = true;
}
else if (file.getName().endsWith(".gz"))
{
compression = CompressionMethod.GZip;
}
else if (file.getName().endsWith(".bz2"))
{
compression = CompressionMethod.BZip2;
}
RunnableSource reader;
if (pbf)
{
reader = new crosby.binary.osmosis.OsmosisReader(new FileInputStream(file));
}
else
{
reader = new XmlReader(file, false, compression);
}
reader.setSink(new BatchInserterSinkTest());
Thread readerThread = new Thread(reader);
readerThread.start();
while (readerThread.isAlive())
{
try
{
readerThread.join();
}
catch (InterruptedException e)
{
/* do nothing */
}
}
}
catch (Exception e)
{
logger.error("Errore nella creazione di neo4j con batchInserter", e);
}
}
}
By executing this code, I get this exception:
Exception in thread "Thread-1" java.lang.ClassCastException: org.neo4j.unsafe.batchinsert.SpatialBatchGraphDatabaseService cannot be cast to org.neo4j.kernel.GraphDatabaseAPI
at org.neo4j.cypher.ExecutionEngine.<init>(ExecutionEngine.scala:113)
at org.neo4j.cypher.javacompat.ExecutionEngine.<init>(ExecutionEngine.java:53)
at org.neo4j.cypher.javacompat.ExecutionEngine.<init>(ExecutionEngine.java:43)
at org.neo4j.collections.graphdb.ReferenceNodes.getReferenceNode(ReferenceNodes.java:60)
at org.neo4j.gis.spatial.SpatialDatabaseService.getSpatialRoot(SpatialDatabaseService.java:76)
at org.neo4j.gis.spatial.SpatialDatabaseService.getLayer(SpatialDatabaseService.java:108)
at org.neo4j.gis.spatial.SpatialDatabaseService.containsLayer(SpatialDatabaseService.java:253)
at org.neo4j.gis.spatial.SpatialDatabaseService.createLayer(SpatialDatabaseService.java:282)
at org.neo4j.gis.spatial.SpatialDatabaseService.createSimplePointLayer(SpatialDatabaseService.java:266)
at it.eng.pinf.graph.batch.test.BatchInserterSinkTest.initialize(BatchInserterSinkTest.java:46)
at org.openstreetmap.osmosis.xml.v0_6.XmlReader.run(XmlReader.java:95)
at java.lang.Thread.run(Thread.java:744)
This is related to this code:
spl = spatialDb.createSimplePointLayer("testBatch", "latitudine", "longitudine");
So now I'm wondering: how can I use the batchInserter for my case? I have to add the created nodes to the SimplePointLayer....so how can I create it by using the batchInserter graph db service?
Is there any little simple sample?
Any tip is really really appreciated
cheers
Angelo
The OSMImporter class in the code has an example of using the batch inserter to import OSM data. The main thing is that the batch inserter is not really supported by neo4j spatial, so you need to do a few things manually. If you look at the class OSMImporter.OSMBatchWriter, you will see how it does things. It is not using the SimplePointLayer at all, since that does not support the batch inserter. It is creating the graph structure it wants directly. The simple point layer is quite simple, certainly much simpler than the OSM model created by the code I'm referencing, so I think you should be able to write a batch-inserter compatible version yourself without too much trouble.
What I would recommend is that you create the layer and nodes using the batch inserter to create the correct graph structure, then switch to the normal embedded API and use that to iterate through the nodes and add them to the spatial index.