Spark Structured Streaming and Neo4j - neo4j

My goal is to write transformed data from a MongoDB collection into Neo4j using Spark Structured Streaming. According to the Neo4j docs, this should be possible with the "Neo4j Connector for Apache Spark" version 4.1.2.
Batch queries so far work fine. However, with the following example below, I run into an error message:
spark-shell --packages org.mongodb.spark:mongo-spark-connector:10.0.2,org.neo4j:neo4j-connector-apache-spark_2.12:4.1.2_for_spark_3
val dfTxn = spark.readStream.format("mongodb")
.option("spark.mongodb.connection.uri", "mongodb://<IP>:<PORT>")
.option("spark.mongodb.database", "test")
.option("spark.mongodb.collection", "txn")
.option("park.mongodb.read.readPreference.name","primaryPreferred")
.option("spark.mongodb.change.stream.publish.full.document.only", "true")
.option("forceDeleteTempCheckpointLocation", "true").load()
val query = dfPaymentTx.writeStream.format("org.neo4j.spark.DataSource")
.option("url", "bolt://<IP>:<PORT>")
.option("save.mode", "Append")
.option("checkpointLocation", "/tmp/checkpoint/myCheckPoint")
.option("labels", "Account")
.option("node.keys", "txn_snd").start()
This gives me the following error message:
java.lang.UnsupportedOperationException: Data source org.neo4j.spark.DataSource does not support streamed writing
Although the Connector should officially support streaming starting with version 4.x. Does anybody have an idea what I'm doing wrong?
Thanks in advance!

Incase, if the connector doesnt support streaming writes, you can try something like below.
you can leverage foreachBatch() functionality from spark structured streaming and write the data into Neo4j in batch mode.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
def process_entry(df, id):
df.write.ToNeo4j(url=url, table="mytopic", mode="append", properties=props)
query = df.writeStream.foreachBatch(process_entry).start()
In the above code, you can have your Neo4j Writer logic and you can write the data into database using batch mode.

Related

Nested rows using STRUCT are not supported in Dataflow SQL (GCP)

With Dataflow SQL I would like to read a Pub/Sub topic, enrich the message and write the message to a Pub/Sub topic.
Which Dataflow SQL query will create my desired output message?
Pub/Sub input message: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}}
Desired Pub/Sub output message: {"event_timestamp":1619784049000, "device":{“ID":"some_id", “NAME”:”some_name”}}
What I get is: {"event_timestamp":1619784049000, "device":{"ID":"some_id"}, "NAME":"some_name" }
but I need the NAME inside the “device” attribute.
SELECT message_table.device as device, devices.name as NAME
FROM pubsub.topic.project_id.`topic` as message_table
JOIN bigquery.table.project_id.dataflow_sql_dataset.devices as devices
ON devices.device_id = message_table.device.id
Unfortunately, Dataflow SQL does not currently support STRUCT/Sub queries, but we are working on it. Since there are some Apache Beam dependencies preventing its progress (Nested Rows Support, Upgrading Calcite), we cannot provide an ETA at the moment, but you can follow its progress on this issue tracker.
You need to create a struct in the projection (SELECT part)
SELECT STRUCT(message_table.device.ID as ID , devices.name as NAME) as device
FROM pubsub.topic.project_id.`topic` as message_table
JOIN bigquery.table.project_id.dataflow_sql_dataset.devices as devices
ON devices.device_id = message_table.device.id

No such property: ToInputStream for class: Script4

I have a situation where I want to import my graph data to database.I am having janusgraph(latest version) running with cassandra(version 3) and elasticsearch(version 6.6.0) using Docker.I have been suggested to use gryo format.So I have tried this command
graph.io(IoCore.gryo()).reader().create().readGraph(ToInputStream.from("my_graph.kryo"), graph);
but ended up with an error
No such property: ToInputStream for class: Script4
The documentation I am following is here.Please take a look and put me in a right procedure. Thanks in advance!
ToInputStream is not a function of Gremlin or JanusGraph. I believe that it is only a function of IBM Compose so unless you are running JanusGraph on that specific platform, this command will not work.
Versions of JanusGraph that utilize TinkerPop 3.4.x will support the io() step and this is the preferred manner in which to load gryo (as well as graphson and graphml) files.
Graph graph = ... // setup JanusGraph instance
GraphTraversalSource g = traversal().withGraph(graph); // might use withRemote() here instead depending on how you are connecting I suppose
g.io("graph.kryo").read().iterate()
Note that if you are connecting remotely - it seems you are sending scripts to the Docker instance given your error - then be sure that that "graph.kryo" file path is accessible to Docker. That's what's nice about ToInputStream from Compose as it allows you to access remote sources.

ROS - How do I publish a message and get the subscribed callback immediately

I have a ROS node that allows you to "publish" a data structure to it, to which it responds by publishing an output. The timestamp of what I published and what it publishes is matched.
Is there a mechanism for a blocking function where I send/publish and output, and it waits until I receive an output?
I think you need the ROS_Services (client/server) pattern instead of the publisher/subscriber.
Here is a simple example to do that in Python:
Client code snippet:
import rospy
from test_service.srv import MySrvFile
rospy.wait_for_service('a_topic')
try:
send_hi = rospy.ServiceProxy('a_topic', MySrvFile)
print('Client: Hi, do you hear me?')
resp = send_hi('Hi, do you hear me?')
print("Server: {}".format(resp.response))
except rospy.ServiceException, e:
print("Service call failed: %s"%e)
Server code snippet:
import rospy
from test_service.srv import MySrvFile, MySrvFileResponse
def callback_function(req):
print(req)
return MySrvFileResponse('Hello client, your message received.')
rospy.init_node('server')
rospy.Service('a_topic', MySrvFile, callback_function)
rospy.spin()
MySrvFile.srv
string request
---
string response
Server out:
request: "Hi, do you hear me?"
Client out:
Client: Hi, do you hear me?
Server: Hello client, your message received.
Learn more in ros-wiki
Project repo on GitHub.
[UPDATE]
If you are looking for fast communication, TCP-ROS communication is not your purpose because it is slower than a broker-less communicator like ZeroMQ (it has low latency and high throughput):
ROS-Service pattern equivalent in ZeroMQ is REQ/REP (client/server)
ROS publisher/subscriber pattern equivalent in ZeroMQ is PUB/SUB
ROS publisher/subscriber with waitformessage equivalent in ZeroMQ is PUSH/PULL
ZeroMQ is available in both Python and C++
Also, to transfer huge amounts of data (e.g. pointcloud), there is a mechanism in ROS called nodelet which is supported only in C++. This communication is based on shared memory on a machine instead of TCP-ROS socket.
What exactly is a nodelet?
Since you want to stick with publish/ subscribers, assuming from your comment, that services are to slow I would have a look at waitForMessage (Documentation).
And for an example on how to use it you can have a look at this ros answers question.
All you need to do is to publish your data and immediately call waitForMessage on the output topic and manually pass the received message to your "callback".
I hope this is what you were looking for.
To get this request/reply behaviour ROS has a mechanism called ROS service.
You can specify the input and output of your service in a service file similar to a ROS message definition. You can then call the service of a node with your input and the call will receive an output when the service is finished.
Here is a tutorial how to use this mechanism in python. If you prefer C++ there is also one, you should find it.

is that possible to do igblast(Analyze immunoglobulin (Ig) sequences) with python/biopython

I'm fresh to python/biopython and tried blast using biopython like this:
from Bio.Blast import NCBIWWW
fasta_string = open("C:\\xxxx\\xxxx\\xxxx\\abc.fasta").read()
result_handle = NCBIWWW.qblast("blastp", "nr", fasta_string)
print result_handle.read()
I was wondering if it is possible to run igblast using biopython. I have searched for this but it seems no one is really doing this.
I haven't tried them, but I have found:
pyigblast.
PyIgBlast - Open source Parser to call IgBlast and parse results for
high-throughput sequencing. Uses Python multi-processing to get around
bottlenecks of IgBlast multi-threading. Parses blast output to
deliminated files (csv,json) for uploading to databases. Can connect
directly with mysql and mongo instances to insert directly.
The code uses Python 2.7 and BioPython.
PyIR
Immunoglobulin and T-Cell receptor rearrangement software. A Python wrapper for IgBLAST that scales to allow for the parallel processing of millions of reads on shared memory computers. All output is stored in a convenient JSON format.
The code uses Python 3.6 and BioPython.
Both tools use a locally installed igblastn executable. You can install igblastn locally with conda install igblast.

erlang connecting to tinkerpop via REST

In the Tinkerpop or Titan documentation, all operations are based on a sample graph. How to creat a new empty graph to work on?
I am programming in erlang connecting to Tinkergraph, planned to use Titan later in production. There is no erlang driver for both so I am connecting by REST. It is easy to read from graph, but if I want to read from user's input then write into the graph, for example, to create a person named teddy:
screenshot 1
I got those errors. What is the correct way?
Thank you.
Update: For following situation:
23> Newperson=terry.
terry
24> Newperson.
terry
If I want to add this terry, below two will not work. What's the correct way to do it?
screenshot 2
1
TitanGraph titanGraph = TitanFactory.open(config); will open a titan graph without the sample data.
If you have already commited the sample data to your keyspace then you can just change the keyspace defined in your config file.
For example if you are using a cassandra backend you would change storage.cassandra.keyspace=xxxxxx .
You can also clear any keyspace using TitanCleanup.clear(graph);
2
As for the error you are seeing. It looks like you are trying to label your vertex incorrectly. I posted the following and it worked:
{
"gremlin" : "g.addV(label, x).property(y,z)",
"bindings" :
{
"x" : "person",
"y" : "name",
"z" : "Teddy"
}
}
A final note, when you start using Titan 1.0.0 make sure you checkout this section of the tinkerpop docs. Especially make sure to change the channel in the gremlin-server.yaml config to:
channelizer: com.tinkerpop.gremlin.server.channel.HttpChannelizer
Answer to my own question: construct a Body by lists:concat() or ++, then post

Resources