After reading about Apache Flume and the benefits it provides in terms of handling client events I decided it was time to start looking into this in more detail. Another great benefit appears to be that it can handle Apache Avro objects :-) However, I am struggle to understand how the Avro schema is used to validate Flume events received.
To help understand my problem in more detail I have provided code snippets below;
Avro schema
For the purpose of this post I am using a sample schema defining a nested Object1 record with 2 fields.
{
"namespace": "com.example.avro",
"name": "Example",
"type": "record",
"fields": [
{
"name": "object1",
"type": {
"name": "Object1",
"type": "record",
"fields": [
{
"name": "value1",
"type": "string"
},
{
"name": "value2",
"type": "string"
}
]
}
}
]
}
Embedded Flume agent
Within my Java project I am currently using the Apache Flume embedded agent as detailed below;
public static void main(String[] args) {
final Event event = EventBuilder.withBody("Test", Charset.forName("UTF-8"));
final Map<String, String> properties = new HashMap<>();
properties.put("channel.type", "memory");
properties.put("channel.capacity", "100");
properties.put("sinks", "sink1");
properties.put("sink1.type", "avro");
properties.put("sink1.hostname", "192.168.99.101");
properties.put("sink1.port", "11111");
properties.put("sink1.batch-size", "1");
properties.put("processor.type", "failover");
final EmbeddedAgent embeddedAgent = new EmbeddedAgent("TestAgent");
embeddedAgent.configure(properties);
embeddedAgent.start();
try {
embeddedAgent.put(event);
} catch (EventDeliveryException e) {
e.printStackTrace();
}
}
In the above example I am creating a new Flume event with "Test" defined as the event body sending events to a separate Apache Flume agent running inside a VM (192.168.99.101).
Remote Flume agent
As described above I have configured this agent to receive events from the embedded Flume agent. The Flume configuration for this agent looks like;
# Name the components on this agent
hello.sources = avroSource
hello.channels = memoryChannel
hello.sinks = loggerSink
# Describe/configure the source
hello.sources.avroSource.type = avro
hello.sources.avroSource.bind = 0.0.0.0
hello.sources.avroSource.port = 11111
hello.sources.avroSource.channels = memoryChannel
# Describe the sink
hello.sinks.loggerSink.type = logger
# Use a channel which buffers events in memory
hello.channels.memoryChannel.type = memory
hello.channels.memoryChannel.capacity = 1000
hello.channels.memoryChannel.transactionCapacity = 1000
# Bind the source and sink to the channel
hello.sources.avroSource.channels = memoryChannel
hello.sinks.loggerSink.channel = memoryChannel
And I am executing the following command to launch the agent;
./bin/flume-ng agent --conf conf --conf-file ../sample-flume.conf --name hello -Dflume.root.logger=TRACE,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true
When I execute the Java project main method I see the "Test" event is passed through to my logger sink with the following output;
2019-02-18 14:15:09,998 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 54 65 73 74 Test }
However, it is unclear to me exactly where I should configure the Avro schema to ensure that only valid events are received and processed by Flume. Can someone please help me understand where I am going wrong? Or, if I have misunderstood the intention of how Flume is designed to convert Flume events into Avro events?
In addition to the above I have also tried using the Avro RPC client after changing the Avro schema to specify a protocol talking directly to my remote Flume agent, but when I attempt to send events I see the following error;
Exception in thread "main" org.apache.avro.AvroRuntimeException: Not a remote message: test
at org.apache.avro.ipc.Requestor$Response.getResponse(Requestor.java:532)
at org.apache.avro.ipc.Requestor$TransceiverCallback.handleResult(Requestor.java:359)
at org.apache.avro.ipc.Requestor$TransceiverCallback.handleResult(Requestor.java:322)
at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.messageReceived(NettyTransceiver.java:613)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.apache.avro.ipc.NettyTransceiver$NettyClientAvroHandler.handleUpstream(NettyTransceiver.java:595)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:786)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:458)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:439)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:558)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:553)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:471)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:332)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My goal is that I am able to ensure that events populated by my application conforms to the Avro schema generated to avoid invalid events being published. I would prefer I achieve this using the embedded Flume agent, but if this is not possible then I would consider using the Avro RPC approach talking directly to my remote Flume agent.
Any help / guidance would be a great help. Thanks in advance.
UPDATE
After further reading I wonder if I have misunderstood the purpose of Apache Flume. I originally thought this could be used to automatically create Avro events based on the data / schema, but now wondering if the application should assume responsibility for producing Avro events which will be stored in Flume according to the channel configuration and sent as a batch via the sink (in my case a Spark Streaming cluster).
If the above is correct then I would like to know whether Flume is required to know about the schema or just my Spark Streaming cluster which will eventually process this data? If Flume is required to know about the schema then can you please provide details of how this can be achieved?
Thanks in advance.
Since your goal is to process the data using Spark Streaming cluster you may solve this problem with 2 solutions
1) Using Flume client (tested with flume-ng-sdk 1.9.0) and Spark Streaming (tested with spark-streaming_2.11 2.4.0 and spark-streaming-flume_2.11 2.3.0) without Flume server in between the network topology.
Client class sends Flume json event at port 41416
public class JSONFlumeClient {
public static void main(String[] args) {
RpcClient client = RpcClientFactory.getDefaultInstance("localhost", 41416);
String jsonData = "{\r\n" + " \"namespace\": \"com.example.avro\",\r\n" + " \"name\": \"Example\",\r\n"
+ " \"type\": \"record\",\r\n" + " \"fields\": [\r\n" + " {\r\n"
+ " \"name\": \"object1\",\r\n" + " \"type\": {\r\n" + " \"name\": \"Object1\",\r\n"
+ " \"type\": \"record\",\r\n" + " \"fields\": [\r\n" + " {\r\n"
+ " \"name\": \"value1\",\r\n" + " \"type\": \"string\"\r\n" + " },\r\n"
+ " {\r\n" + " \"name\": \"value2\",\r\n" + " \"type\": \"string\"\r\n"
+ " }\r\n" + " ]\r\n" + " }\r\n" + " }\r\n" + " ]\r\n" + "}";
Event event = EventBuilder.withBody(jsonData, Charset.forName("UTF-8"));
try {
client.append(event);
} catch (Throwable t) {
System.err.println(t.getMessage());
t.printStackTrace();
} finally {
client.close();
}
}
}
Spark Streaming Server class listens at port 41416
public class SparkStreamingToySample {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setMaster("local[2]")
.setAppName("SparkStreamingToySample");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(30));
JavaReceiverInputDStream<SparkFlumeEvent> lines = FlumeUtils
.createStream(ssc, "localhost", 41416);
lines.map(sfe -> new String(sfe.event().getBody().array(), "UTF-8"))
.foreachRDD((data,time)->
System.out.println("***" + new Date(time.milliseconds()) + "=" + data.collect().toString()));
ssc.start();
ssc.awaitTermination();
}
}
2) Using Flume client + Flume server between + Spark Streaming (as Flume Sink) as network topology.
For this option, the code is the same, but now the SparkStreaming must specify the full dns qualified hostname instead of localhost to start SparkStreaming server at same port 41416 if you're running this locally for testing. The Flume client will connect to flume server port 41415. The tricky part now is how to define your flume topology. You need to specify both a source and a sink for this to work.
See flume conf below
agent1.channels.ch1.type = memory
agent1.sources.avroSource1.channels = ch1
agent1.sources.avroSource1.type = avro
agent1.sources.avroSource1.bind = 0.0.0.0
agent1.sources.avroSource1.port = 41415
agent1.sinks.avroSink.channel = ch1
agent1.sinks.avroSink.type = avro
agent1.sinks.avroSink.hostname = <full dns qualified hostname>
agent1.sinks.avroSink.port = 41416
agent1.channels = ch1
agent1.sources = avroSource1
agent1.sinks = avroSink
You should get same results with both solutions, but returning to your question of if Flume is really needed for Spark Streaming contents from Json stream, the answer is it depends, Flume supports interceptors so in this case it could be used to cleanse or filter invalid data for your Spark project, but since you're adding an extra component to the topology it may impact performance and require more resources (CPU/Memory) than without Flume.
Related
I have conducted a test sending 100K persistent MQTT messages (QoS 2) to ActiveMQ Artemis. The topic has two Telegraf listeners, one on VM 85 and the other on VM 86. These listeners write data to the InfluxDB on their respective servers.
The main goal of the test is to ensure all messages delivered to VM 85 are also delivered to VM 86 even if VM 86 is down. Before executing the test both listeners connect to the broker each with a unique client ID and with clean-session = false and subscribe to the topic using QoS 2. This ensures the subscription for each is present when the messages are sent whether or not the listeners are actually active. Neither listener is connected when the test starts. The order of operations is:
Start listener on VM 85.
Send data.
Ensure messages are delivered to listener on VM 85.
Start listener on VM 86.
Ensure messages are delivered to listener on VM 86.
The good news is that all messages are delivered to the Influx DB on both VMs. However, the relevant queue for VM 86 still shows about 4.3 K messages remaining, as shown below:
If I then restart the listener on VM 86, it shows it's writing more data, as shown below:
However, the total messages in the InfluxDB correctly remains at 100K. If InfluxDB receives a duplicate record, it will overwrite it. However, the client is incrementing by one and setting the date at each increment, so this shouldn't occur, at least from the client.
I'm not clear on why this would be. Why does the the listener on VM 86 need to be restarted to completely empty the queue?
There is one parameter I haven't tried in the Telegraf plugin:
## Maximum messages to read from the broker that have not been written by an
## output. For best throughput set based on the number of metrics within
## each message and the size of the output's metric_batch_size.
##
## For example, if each message from the queue contains 10 metrics and the
## output metric_batch_size is 1000, setting this to 100 will ensure that a
## full batch is collected and the write is triggered immediately without
## waiting until the next flush_interval.
# max_undelivered_messages = 1000
It seems the batch size defaults to 1000, based on the output messages. But the maximum messages to read before output seems to be something greater, since 4.3K are output when restarted. Except that they have already been output. That's the confusing part.
Client Code:
package abc;
import java.time.Instant;
import org.eclipse.paho.client.mqttv3.MqttClient;
import org.eclipse.paho.client.mqttv3.MqttConnectOptions;
import org.eclipse.paho.client.mqttv3.MqttException;
import org.eclipse.paho.client.mqttv3.MqttMessage;
import org.eclipse.paho.client.mqttv3.MqttSecurityException;
import org.eclipse.paho.client.mqttv3.persist.MemoryPersistence;
import com.influxdb.client.domain.WritePrecision;
import com.influxdb.client.write.Point;
public class MqttPublishSample {
public static void main(String[] args) throws MqttSecurityException, MqttException, InterruptedException {
String broker = "tcp://localhost:1883";
String clientId = "JavaSample";
MemoryPersistence persistence = new MemoryPersistence();
int qos = 2;
int start = Integer.parseInt(args[0]);
int end = Integer.parseInt(args[1]);
String topic = args[2];
if (topic == null) {
topic = "testtopic/999";
}
System.out.println("start: " + start + ", end: " + end + ", topic: " + topic + " qos: " + qos);
MqttClient sampleClient = new MqttClient(broker, clientId, persistence);
MqttConnectOptions connOpts = new MqttConnectOptions();
connOpts.setCleanSession(false);
connOpts.setUserName("admin");
connOpts.setPassword("xxxxxxx".toCharArray());
System.out.println("Connecting to broker: " + broker);
sampleClient.connect(connOpts);
System.out.println("Connected");
for (int i = start; i <= end; i++) {
// print out every 1000
if (i%100 == 0) {
System.out.println("i: " + i);
}
try {
Point point = Point.measurement("temperature").addTag("machine", "unit43").addField("external", i)
.time(Instant.now(), WritePrecision.NS);
content = point.toLineProtocol();
MqttMessage message = new MqttMessage(content.getBytes());
message.setQos(qos);
sampleClient.publish(topic, message);
Thread.sleep(10);
} catch (MqttException me) {
System.out.println("reason " + me.getReasonCode());
System.out.println("msg " + me.getMessage());
System.out.println("loc " + me.getLocalizedMessage());
System.out.println("cause " + me.getCause());
System.out.println("excep " + me);
me.printStackTrace();
}
}
sampleClient.disconnect();
System.out.println("Disconnected");
}
}
Telegraph Plugin on 85:
###############################################################################
# INPUT PLUGINS #
###############################################################################
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
## Topics that will be subscribed to.
topics = [
"testtopic/#",
]
## The message topic will be stored in a tag specified by this value. If set
## to the empty string no topic tag will be created.
# topic_tag = "topic"
## When using a QoS of 1 or 2, you should enable persistent_session to allow
## resuming unacknowledged messages.
qos = 2
persistent_session = true
## If unset, a random client ID will be generated.
client_id = "InfluxData_on_86_listen_local"
## Username and password to connect MQTT server.
username = "admin"
password = "xxxxxx"
data_format = "influx"
[[inputs.mqtt_consumer]]
servers = ["tcp://10.102.11.86:1883"]
## Topics that will be subscribed to.
topics = [
"testtopic/#",
]
## The message topic will be stored in a tag specified by this value. If set
## to the empty string no topic tag will be created.
# topic_tag = "topic"
## When using a QoS of 1 or 2, you should enable persistent_session to allow
## resuming unacknowledged messages.
qos = 2
persistent_session = true
## If unset, a random client ID will be generated.
client_id = "InfluxData_on_86_listen_85"
## Username and password to connect MQTT server.
username = "admin"
password = "xxxx"
data_format = "influx"
###############################################################################
# OUTPUT PLUGINS #
###############################################################################
[[outputs.influxdb_v2]]
## The URLs of the InfluxDB cluster nodes.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
urls = ["http://127.0.0.1:8086"]
## Token for authentication.
token = "xxxx"
## Organization is the name of the organization you wish to write to.
organization = "xxxx"
# ## Destination bucket to write into.
bucket = "events"
I wasn't able to replicate this issue even initially at lower volumes, although I had it twice at 100K messages.
When i added the following parameters to the Telegraf Listener:
max_undelivered_messages = 100
It seemed to slow things down, as batches were limited to 100 according to the telegraph output.
However, when I removed it, it seemed batches where still limited to 100.
Finally, I changed the same parameter to 1000:
max_undelivered_messages = 1000
After this, message batch sizes improved to well beyond 100, as they were initially.
Furthermore, at least on the third try of 100K messages, there are no longer any messages remaining in the queue after the sequence described in the question is completed.
I'm not really sure if this change did anything, but in any case the correct amount of messages were always being received.
So, I'm marking this as answered.
I started kafka connector using following command:
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-postgres/connect-postgres.properties
Serialization props in the connect-avro-standalone.properties is:
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
I've created a java backend which listen to this kafka stream topic and its able to get the data from postgres with each add/update/delete.
But the data is coming in some unknown encoding format and that's why ican't read the data correctly.
Here is the relevant code snippet:
properties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,
Serdes.String().getClass().getName());
properties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,
Serdes.ByteArray().getClass().getName());
StreamsBuilder streamsBuilder = new StreamsBuilder();
final Serde<String> stringSerde = Serdes.String();
final Serde<byte[]> byteSerde = Serdes.ByteArray();
streamsBuilder.stream(Pattern.compile(getTopic()), Consumed.with(stringSerde, byteSerde))
.mapValues(data -> {
System.out.println("->"+new String(data));
return data;
});
I'm confused on where and what I need to change; in the avro connector prop or in the java side code
Your Kafka Connect config here means that the messages on the Kafka topic will be Avro serialised:
value.converter=io.confluent.connect.avro.AvroConverter
Which means that you need to deserialise using Avro in your Streams app. See here for more details: https://docs.confluent.io/current/streams/developer-guide/datatypes.html#avro
I am trying to use Jena to write to a local free standalone GraphDB (version 8.5.0) repository.
What I have tried
(1) Direct use from Jena
I used this Jena 3.7.0 code snippet:
String strInsert =
"INSERT DATA {"
+ "<http://dbpedia.org/resource/Grace_Hopper> "
+ "<http://dbpedia.org/ontology/birthDate>"
+ " \"1906-12-9\"^^<http://www.w3.org/2001/XMLSchema#date> .}";
UpdateRequest updateRequest = UpdateFactory.create(strInsert);
UpdateProcessor updateProcessor = UpdateExecutionFactory.createRemote(updateRequest,
"http://localhost:7200/repositories/PersonData");
updateProcessor.execute();
which results in the following exception
org.apache.jena.atlas.web.HttpException: 415 -
at org.apache.jena.riot.web.HttpOp.exec(HttpOp.java:1091)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:718)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:501)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:459)
at org.apache.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:81)
at org.graphdb.jena.tutorial.SimpleInsertQueryExample.main(SimpleInsertQueryExample.java:91)
On the GraphDB side I get the following error:
[INFO ] 2018-06-29 11:33:05,605 [repositories/PersonData | o.e.r.h.s.ProtocolExceptionResolver] Client sent bad request ( 415)
org.eclipse.rdf4j.http.server.ClientHTTPException: Unsupported MIME type: application/sparql-update
(2) GraphDB via Jena Fuseki
As an alternative I explored the GraphDB documentation, which states that it is possible to access GraphDB using the Jena Joseki, now Fuseki, server. But for that Fuseki needs to be configured to read the GraphDB as a Jena dataset and then accessed via a Ontotext Jena adapter com.ontotext.jena.SesameDataset. But I can find no GraphDB libraries that inlude this class.
(3) Accessing GraphDB using RDF4J
Accessing GraphDB from RDF4J works without issues:
Repository repository = new HTTPRepository(GRAPHDB_SERVER, REPOSITORY_ID);
repository.initialize();
RepositoryConnection repositoryConnection = repository.getConnection();
repositoryConnection.begin();
Update updateOperation = repositoryConnection.prepareUpdate(QueryLanguage.SPARQL, strInsert);
updateOperation.execute();
try {
repositoryConnection.commit();
} catch (Exception e) {
if (repositoryConnection.isActive())
repositoryConnection.rollback();
}
My Question
Is there a way to access GraphDB efficiently from Jena? I have seen this related SO question, but I was hoping for a better approach.
GraphDB implements standard SPARQL 1.1 endpoints according the RDF4J protocol.
http://localhost:7200/repositories/PersonData - SPARQL query endpoint, which does not support "application/sparql-update"
http://localhost:7200/repositories/PersonData/statements - SPARQL update endpoint
Try changing your code to point to the update endpoint:
UpdateProcessor updateProcessor = UpdateExecutionFactory.createRemote(updateRequest,
"http://localhost:7200/repositories/PersonData/statements");
The Jena adapter to GraphDB is no longer supported.
FWIW not an answer to "how to connect with Jena", but the code you use to access GraphDB via the RDF4J API is more complicated than it needs to be. You can simply do this:
Repository repository = new HTTPRepository(GRAPHDB_SERVER, REPOSITORY_ID);
repository.initialize();
try(RepositoryConnection conn = repository.getConnection()) {
conn.prepareUpdate(strInsert).execute();
}
It will auto-commit and also automatically roll back on connection close if necessary.
I just need to run a dataflow pipeline on a daily basis, but it seems to me that suggested solutions like App Engine Cron Service, which requires building a whole web app, seems a bit too much.
I was thinking about just running the pipeline from a cron job in a Compute Engine Linux VM, but maybe that's far too simple :). What's the problem with doing it that way, why isn't anybody (besides me I guess) suggesting it?
This is how I did it using Cloud Functions, PubSub, and Cloud Scheduler
(this assumes you've already created a Dataflow template and it exists in your GCS bucket somewhere)
Create a new topic in PubSub. this will be used to trigger the Cloud Function
Create a Cloud Function that launches a Dataflow job from a template. I find it easiest to just create this from the CF Console. Make sure the service account you choose has permission to create a dataflow job. the function's index.js looks something like:
const google = require('googleapis');
exports.triggerTemplate = (event, context) => {
// in this case the PubSub message payload and attributes are not used
// but can be used to pass parameters needed by the Dataflow template
const pubsubMessage = event.data;
console.log(Buffer.from(pubsubMessage, 'base64').toString());
console.log(event.attributes);
google.google.auth.getApplicationDefault(function (err, authClient, projectId) {
if (err) {
console.error('Error occurred: ' + err.toString());
throw new Error(err);
}
const dataflow = google.google.dataflow({ version: 'v1b3', auth: authClient });
dataflow.projects.templates.create({
projectId: projectId,
resource: {
parameters: {},
jobName: 'SOME-DATAFLOW-JOB-NAME',
gcsPath: 'gs://PATH-TO-YOUR-TEMPLATE'
}
}, function(err, response) {
if (err) {
console.error("Problem running dataflow template, error was: ", err);
}
console.log("Dataflow template response: ", response);
});
});
};
The package.json looks like
{
"name": "pubsub-trigger-template",
"version": "0.0.1",
"dependencies": {
"googleapis": "37.1.0",
"#google-cloud/pubsub": "^0.18.0"
}
}
Go to PubSub and the topic you created, manually publish a message. this should trigger the Cloud Function and start a Dataflow job
Use Cloud Scheduler to publish a PubSub message on schedule
https://cloud.google.com/scheduler/docs/tut-pub-sub
There's absolutely nothing wrong with using a cron job to kick off your Dataflow pipelines. We do it all the time for our production systems, whether it be our Java or Python developed pipelines.
That said however, we are trying to wean ourselves off cron jobs, and move more toward using either AWS Lambdas (we run multi cloud) or Cloud Functions. Unfortunately, Cloud Functions don't have scheduling yet. AWS Lambdas do.
There is a FAQ answer to that question:
https://cloud.google.com/dataflow/docs/resources/faq#is_there_a_built-in_scheduling_mechanism_to_execute_pipelines_at_given_time_or_interval
You can automate pipeline execution by using Google App Engine (Flexible Environment only) or Cloud Functions.
You can use Apache Airflow's Dataflow Operator, one of several Google Cloud Platform Operators in a Cloud Composer workflow.
You can use custom (cron) job processes on Compute Engine.
The Cloud Function approach is described as "Alpha" and it's still true that they don't have scheduling (no equivalent to AWS cloudwatch scheduling event), only Pub/Sub messages, Cloud Storage changes, HTTP invocations.
Cloud composer looks like a good option. Effectively a re-badged Apache Airflow, which is itself a great orchestration tool. Definitely not "too simple" like cron :)
You can use cloud scheduler to schedule your job as well. See my post
https://medium.com/#zhongchen/schedule-your-dataflow-batch-jobs-with-cloud-scheduler-8390e0e958eb
Terraform script
data "google_project" "project" {}
resource "google_cloud_scheduler_job" "scheduler" {
name = "scheduler-demo"
schedule = "0 0 * * *"
# This needs to be us-central1 even if the app engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/${var.project_id}/locations/${var.region}/templates:launch?gcsPath=gs://zhong-gcp/templates/dataflow-demo-template"
oauth_token {
service_account_email = google_service_account.cloud-scheduler-demo.email
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "test-cloud-scheduler",
"parameters": {
"region": "${var.region}",
"autoscalingAlgorithm": "THROUGHPUT_BASED",
},
"environment": {
"maxWorkers": "10",
"tempLocation": "gs://zhong-gcp/temp",
"zone": "us-west1-a"
}
}
EOT
)
}
}
Background:
I am using a locally run Neo4J instance, (at localhost:7474), and accessing it through a Java adaptor which uses Cypher via the REST API (with Jersey), and makes data accessible to my Grails app running on the same server.
Question:
Is it possible to query a Neo4J db using Cypher, via the REST API, and return the URI of a node? Right now, I can check Neo4J server status, create nodes, populate node properties, query, and create relationships.
My problem is that my "add relationship" and traversal code requires a node URIs as input. I can query for nodes and obtain the correct JSON describing the results, but I cannot seem to get the URI locations.
Here is a simplified version of my getUserByEmail code:
public URI getUserByEmail( String email )
{
System.out.println( "GETTING USER BY EMAIL [" + email + "]..." );
String queryStr = "MATCH (user) WHERE user.nodetype=\'user\' and user.email=\'" + email + "\' RETURN user";
WebResource webResource = client.resource( ROOT_URI + "/transaction/commit" );
String payload = "{\"statements\" : [ {\"statement\" : \"" + queryStr + "\"} ]}";
ClientResponse response = webResource
.accept( MediaType.APPLICATION_JSON )
.type( MediaType.APPLICATION_JSON )
.entity( payload )
.post( ClientResponse.class );
String responseStr = response.getEntity( String.class );
URI responseLocation = response.getLocation();
System.out.println( "RESPONSE STRING: " + responseStr );
System.out.println( "GOT USER AT: [" + responseLocation + "]" );
return responseLocation;
}
The JSON results come back fine and reflect what is in the graph db. The location, however, is always null.
The "add relationship" code that I am using works, as long as I have the URI to the start node. The code I have is based on the addRelationship() code that lives here:
https://github.com/neo4j/neo4j/blob/2.1.6/community/server-examples/src/main/java/org/neo4j/examples/server/CreateSimpleGraph.java
In your JSON results, the self property value for each "user" will be its URI.
In this example, the response has 2 "n" nodes, and the self property value of each is its URI.
Here is an example of how to get the transactional endpoint (which is normally less verbose than the legacy endpoint) to also return the self property.
You can in your case create the node uri yourself by just appending the node internal id to the following url :
http://localhost:7474/db/data/node/your-123-id
maybe you want to set the scheme, host and database port in a configuration file to not do hardcode changes when changing the database location.