Re-using Bigtable connection with AbstractCloudBigtableTableDoFn - google-cloud-dataflow

I have a DoFn that extends AbstractCloudBigtableTableDoFn<> in order to send frequent Buffered Mutation requests to Bigtable.
When I run the job in the Cloud, I see repeated log entries at this step of the Dataflow pipeline that look like this:
Opening connection for projectId XXX, instanceId XXX, on data host batch-bigtable.googleapis.com, table admin host bigtableadmin.googleapis.com...
and
Bigtable options: BigtableOptions{XXXXX (lots of option entries here}
The code within the DoFn looks something like this:
#ProcessElement
public void processElement(ProcessContext c)
{
try
{
BufferedMutator mPutUnit = getConnection().getBufferedMutator(TableName.valueOf(TABLE_NAME));
for (CONDITION)
{
// create lots of different rowsIDs
Put p = new Put(newRowID).addColumn(COL_FAMILY, COL_NAME, COL_VALUE);
mPutUnit.mutate(p);
}
mPutUnit.close();
} catch (IOException e){e.printStackTrace();}
c.output(0);
}
This DoFn gets called very frequently.
Should I worry that Dataflow tries to re-establish the connection to Bigtable with every call to this DoFn? I was under the impression that inheriting from this class should ensure that a single connection to Bigtable should be re-used across all calls?

"Opening connection for projectId ..." should appear once per worker per AbstractCloudBigtableTableDoFn instance. Can you double check that connections are being opened per call as opposed to per worker?
Limit the number of workers to a handful
In stack driver, expand the "Opening connection for projectId" messages and check if jsonPayload.worker is duplicated across different log messages.
Also, can you detail what version of the client you are using and what version of beam?
Thanks!

To answer your questions...
Yes, you should be worried that Dataflow tries to reestablish a connection to Bigtable with each call to the DoFn. The expected behavior of AbstractCloudBigtableDoFn is that a Connection instance is maintained per worker.
No, inheriting from AbstractCloudBigtableDoFn does not ensure a single Connection instance is reused for each call to the DoFn. This is not possible because the DoFn is serialized across multiple physical machines based on the number of workers allocated for the Dataflow job.
First, ensure that there are no connection/authentication issues to Bigtable. Occasionally, Dataflow will need to reestablish a connection to Bigtable. However, doing so for each call to the DoFn is not expected.

Related

Stream PubSub to Spanner - Wait.on Step

Requirement is to delete the data in spanner tables before inserting the data from pubsub messages. As MutationGroup does not guarantee the order of execution, separated delete mutations into separate set and so have two sets, one for Delete and other to AddReplace Mutations.
PCollection<Data> dataJson =
pipeLine
.apply(PubsubIO.readStrings().fromSubscription(options.getInputSubscription()))
.apply("ParsePubSubMessage", ParDo.of(new PubSubToDataFn()))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(10))))
;
SpannerWriteResult deleteResult = dataJson
.apply("DeleteDataMutation", MapElements.via(......))
.apply("DeleteData", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
dataJson
.apply("WaitOnDeleteMutation", Wait.on(deleteResult.getOutput()))
.apply("AddReplaceMutation", MapElements.via(...))
.apply("UpsertInfoToSpanner", SpannerIO.write().withSpannerConfig(spannerConfig).grouped());
This is a streaming dataflow job and I tried multiple Windowing but it never executes "UpsertInfoToSpanner" Step.
How can I fix this issue? Can someone suggest a path forward.
Update:
Requirement is to apply Two Mutation Groups sequential on same input data i.e. Read JSON from PubSub message to delete existing data from multiple tables with mutation group and then insert data reading from the JSON PubSub message.
Re-pasting the comment earlier for better visibility:
The Mutation operations within a single MutationGroup are guaranteed to be executed in order within a single transaction, so I don't see what the issue is here... The reason why Wait.on() never releases is because the output stream that is being waited on is on the global window, so will never be closed in a streaming pipeline.

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

UVM ports: put,get,export, analysis

I am trying to master in UVM, and completely lost in UVM ports. Please help better understand the ports.
So as I understood there are 3 main types of ports
Put-> get : producer put data and consumer gets the data. This is blocking statement.
Put-> Export->Imp
Analysis->Subscriber : producer transmit the data and other subscribers gets it. This is non-blocking statement.
Also there are TLM_FIFOs which allows to buffer the transaction for later usage. It has 2 types: uvm_tlm_fifo and uvm_tlm_analysis_fifo.
And my questions are:
Is my understanding right?
What is the difference between get and export?
What is the difference between uvm_tlm_fifo and uvm_tlm_analysis_fifo?
Thanks
Hayk
The use of TLM interfaces isolates each component from changes in
other components throughout the environment.
For ports understanding, there are two common terminologies: Producer and Consumer. Instead of producer and consumer, think in terms of initiator and target of communication between components.
An initiator is always having a port connected to it. Just like driver has seq_item_port.
A target always have an export. Just like sequencer havng seq_item_export.
For Put/Get ports:
Initiator/Producer:
port.put(tr);
Target/Consumer: (Note the Input in task)
task pet(input simple_trans t);
//...
endtask
In put port, initiator is the producer which puts a transaction for the consumer. Initiator/Producer blocks till the put task is unblocked by Target/Consumer.
Initiator/Consumer:
port.get(tr);
Target/Producer:(Note the Output in task)
task get(output simple_trans t);
//...
endtask
While in get port, the initiator is the consumer. A consumer requests/asks for transaction and producer provides it. Initiator/Consumer blocks till the get task is unblocked by Target/Producer.
The put/get ports are typically used to have operational behavior of a system. These ports are used for one-to-one communication.
Analysis ports are generally used to broadcast the transaction. The write method is always non blocking. There may be zero or more connections to analysis ports. Again the rules for initiator and target remains the same.
Initiator:
port.write(tr);
Target:(Note the function, not task)
function void write(simple_trans tr);
//...
endfunction
All the ports requires implementation of methods in user's classes. The uvm_*_imp is used for the same. While buffering of data can be done through FIFOs.
For analysis ports, uvm_analysis_fifo is used, since these FIFO must have the ability to further broadcast the transaction. The default size of analysis FIFO is unbounded.
While uvm_tlm_fifo is used when put/get ports are used, that is, for one-to-one communication. The default size of analysis FIFO is 1, which can be changed to unbounded.
Again, FIFOs always puts/gets the data upon request from a component, henceforth there is an export type of connection at both the ends.
For further information, refer to UVM User Guide.

Using Neo4j with React JS

Can we use graph database neo4j with react js? If not so is there any alternate option for including graph database in react JS?
Easily, all you need is neo4j-driver: https://www.npmjs.com/package/neo4j-driver
Here is the most simplistic usage:
neo4j.js
//import { v1 as neo4j } from 'neo4j-driver'
const neo4j = require('neo4j-driver').v1
const driver = neo4j.driver('bolt://localhost', neo4j.auth.basic('username', 'password'))
const session = driver.session()
session
.run(`
MATCH (n:Node)
RETURN n AS someName
`)
.then((results) => {
results.records.forEach((record) => console.log(record.get('someName')))
session.close()
driver.close()
})
It is best practice to close the session always after you get the data. It is inexpensive and lightweight.
It is best practice to only close the driver session once your program is done (like Mongo DB). You will see extreme errors if you close the driver at a bad time, which is incredibly important to note if you are beginner. You will see errors like 'connection to server closed', etc. In async code, for example, if you run a query and close the driver before the results are parsed, you will have a bad time.
You can see in my example that I close the driver after, but only to illustrate proper cleanup. If you run this code in a standalone JS file to test, you will see node.js hangs after the query and you need to press CTRL + C to exit. Adding driver.close() fixes that. Normally, the driver is not closed until the program exits/crashes, which is never in a Backend API, and not until the user logs out in the Frontend.
Knowing this now, you are off to a great start.
Remember, session.close() immediately every time, and be careful with the driver.close().
You could put this code in a React component or action creator easily and render the data.
You will find it no different than hooking up and working with Axios.
You can run statements in a transaction also, which is beneficial for writelocking affected nodes. You should research that thoroughly first, but transaction flow is like this:
const session = driver.session()
const tx = session.beginTransaction()
tx
.run(query)
.then(// same as normal)
.catch(// errors)
// the difference is you can chain multiple transactions:
const tx1 = await tx.run().then()
// use results
const tx2 = await tx.run().then()
// then, once you are ready to commit the changes:
if (results.good !== true) {
tx.rollback()
session.close()
throw error
}
await tx.commit()
session.close()
const finalResults = { tx1, tx2 }
return finalResults
// in my experience, you have to await tx.commit
// in async/await syntax conditions, otherwise it may not commit properly
// that operation is not instant
tl;dr;
Yes, you can!
You are mixing two different technologies together. Neo4j is graph database and React.js is framework for front-end.
You can connect to Neo4j from JavaScript - http://neo4j.com/developer/javascript/
Interesting topic. I am using the driver in a React App and recently experienced some issues. I am closing the session every time a lifecycle hook completes like in your example. When there where more intensive queries I would see a timeout error. Going back to my setup decided to experiment by closing the driver in some more expensive queries and it looks like (still need more testing) the crashes are gone.
If you are deploying a real-world application I would urge you to think about Authentication and Authorization when using a DB-React setup only as you would have to store username/password of the neo4j server in the client. I am looking into options of having the Neo4J server issuing a token and receiving it for Authorization but the best practice is for sure to have a Node.js server in the middle with something like Passport to handle Authentication.
So, all in all, maybe the best scenario is to only use the driver in Node and have the browser always communicating with the Node server using axios...

Altering the timeout setting of an Axis 1.4 generated SOAP Java client

I have a problem with changing the standard options used by an Axis 1.4 generated web service client code.
We consume a certain web service of a partner who is using the old RPC/Encoded style, which basically means we're not able to go for Axis 2 but are limited to Axis 1.4.
The service client is retrieving data from the remote server through our proxy which actually runs quite nicely.
Our application is deployed as a servlet. The retrieved response of the foreign web service is inserted into a (XML) document we provide to our internal systems/CMS.
But if the external service is not responding - which didn't happen yet but might happen at anytime - we want to degrade nicely and return our produced XML document without the calculated web service information within a resonable time.
The data retrieved is optional (if this specific calculation is missing it isn't a big issue at all).
So I tried to change the timeout settings. I did apply/use all methods and keys I could find in the documentation of axis to alter the connection and socket timeouts by searching the web.
None of these seems to influence the connection timeouts.
Can anyone give me advice how to alter the settings for an axis stub/service/port based on version 1.4?
Here's an example for the several configurations I tried:
MyService service = new MyServiceLocator();
MyServicePort port = null;
try {
port = service.getMyServicePort();
javax.xml.rpc.Stub stub = (javax.xml.rpc.Stub) port;
stub._setProperty("axis.connection.timeout", 10);
stub._setProperty(org.apache.axis.client.Call.CONNECTION_TIMEOUT_PROPERTY, 10);
stub._setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_CONNECTION_TIMEOUT_KEY, 10);
stub._setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_SO_TIMEOUT_KEY, 10);
AxisProperties.setProperty("axis.connection.timeout", "10");
AxisProperties.setProperty(org.apache.axis.client.Call.CONNECTION_TIMEOUT_PROPERTY, "10");
AxisProperties.setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_CONNECTION_TIMEOUT_KEY, "10");
AxisProperties.setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_SO_TIMEOUT_KEY, "10");
logger.error(AxisProperties.getProperties());
service = new MyClimateServiceLocator();
port = service.getMyServicePort();
}
I assigned the property changes before the generation of the service and after, I set the properties during initialisation, I tried several other timeout keys I found, ...
I think I'm getting mad about that and start to forget what I tried already!
What am I doing wrong? I mean there must be an option, mustn't it?
If I don't find a proper solution I thought about setting up a synchronized thread with a timeout within our code which actually feels quite awkward and somehow silly.
Can you imagine anything else?
Thanks in advance
Jens
axis1.4 java client soap wsdl2java rpc/encoded xml servlet generated alter change setup stub timeout connection socket keys methods
I think it may be a bug, as indicated here:
https://issues.apache.org/jira/browse/AXIS-2493?jql=text%20~%20%22CONNECTION_DEFAULT_CONNECTION_TIMEOUT_KEY%22
Typecast service port object to org.apache.axis.client.Stub.
(i.e)
org.apache.axis.client.Stub stub = (org.apache.axis.client.Stub) port;
Then set all the properties:
stub._setProperty(org.apache.axis.client.Call.CONNECTION_TIMEOUT_PROPERTY, 10);
stub._setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_CONNECTION_TIMEOUT_KEY, 10);
stub._setProperty(org.apache.axis.components.net.DefaultCommonsHTTPClientProperties.CONNECTION_DEFAULT_SO_TIMEOUT_KEY, 10);

Resources