ksqldb cli failed to teminate a query:Timeout while waiting for command topic consumer to process command topic - ksqldb

I want to terminate a query to drop a table. But i got the below error, and after while the query is terminated, the ksql log don't print any error message. How can i find the root cause?
ksql> terminate CTAS_KSQL1_TABLE_SACMES_PACK_STATS_275;
Could not write the statement 'terminate CTAS_KSQL1_TABLE_SACMES_PACK_STATS_275;' into the command topic.
Caused by: Timeout while waiting for command topic consumer to process command topic

Looks like you may have run into a bug in older versions of ksqlDB. Maybe this one: https://github.com/confluentinc/ksql/issues/4267
The general issue is that the query gets into a state where it can't close down cleanly. What's blocking the shutdown does eventually complete or timeout. In the case of issue #4267 above, the issue was that the sink topic, i.e. the topic ksqlDB is writing to, has been deleted out-of-band, i.e. by something other than ksqlDB, and ksqlDB is stuck trying to get metadata for a non-existent topic. Did you delete the sink topic?
There were others resolved issues too that I can't find.
Bouncing the server after issuing the terminate should clean up the stuck query. Though it's a pretty severe workaround!
Upgrading to a later version, something released after May 2020, the issue should be resolved.

Related

Message Loss (message sent to spring-amqp doesn't get published to rabbitmq)

We are having a setup where we are using spring-amqp transacted channels to push our messages to RabbitMq. During a testing we found that messages were not even getting published from spring-amqp to rabbitmq;
we suspect metricsCollector.basicPublish(this) in com.rabbitmq.client.impl.ChannelN failure(no exception is thrown).
because we can see that RabbitUtils.commitIfNecessary(channel) in org.springframework.amqp.rabbit.core.RabbitTemplate is not getting called when there is an issue executing metricsCollector.basicPublish(this) for the same code flow.
We have taken TCP dumps and could see that message were written to stream/socket on rabbitmq, but since commit didn't happen due to an a probable amqp api failure the messages were not delivered to corresponding queues.
Jars Version Being used in the setup:-
spring-amqp-2.2.1.RELEASE.jar,
spring-rabbit-2.2.1.RELEASE.jar
amqp-client-5.7.3.jar,
metrics-core-3.0.2.jar
Is anyone facing the similar issue?
Can someone please help.
---edit 1
(Setup) :- We are using same connection Factory for flows with parent transaction and flows not running with parent transactions
On further analyzing the issue , we found that isChannelLocallyTransacted is sometimes showing in-consistent behavior because ConnectionFactoryUtils.isChannelTransactional(channel, getConnectionFactory() is sometimes having a reference to transacted channel (returns true hence expression isChannelLocallyTransacted evaluates to false) due to which tx.commit never happens; so message gets lost before getting committed to RabbitMQ.

KSQL failed to create stream from topic

have a json topic name "customer-event" and trying to create stream from the below KSQL:
create stream cssc_customer_event_json (description varchar,pageEvent_id varchar)with (kafka_topic='customer-event', value_format='json');
it return below message:
Message
------------------------------------
Statement written to command topic
------------------------------------
after the query run and no stream has been created. Anyone can advise what may be the problem?
Thanks
Regards,
Han
Same thing happens when I try to create table.
create stream cssc_customer_event_json (reportSuite varchar,exclude_id varchar,exclude_value varchar,exclude_description varchar,pageEvent_id varchar)with (kafka_topic='customer-event', value_format='json');
it should create a new stream. But it has no stream created
ksql> show streams;
Stream Name | Kafka Topic | Format
------------------------------------
------------------------------------
The message:
Statement written to command topic
Is generally only seen when the REST endpoints thread writes the message to the command topic, then times out waiting from the engine side to read the message and process it.
There are a couple of reasons this can occur:
There's a misconfiguration in Kafka stopping the engine side reading, e.g. ACLs set incorrectly, so ksqlDB can write, but can't read the data.
There's a stability issue in Kafka - e.g. endangered partitions etc, though this tends to stop the write to Kafka working, not the read side.
The thread reading the command topic has crashed. There were some bugs in earlier versions that could cause this. The ksql application log would report this. Restarting may fix it, or upgrading.
The thread reading the command topic is stuck. There were some bugs in earlier versions that could cause this. The ksql application log would NOT report this. You'd need to do a thread-dump or similar to diagnose. Restarting may fix it, or upgrading.

Hyperledger Composer Fail to Update unless Peer is restarted

I am having trouble with Hyperledger composer. I am using jwt on a docker deployed composer rest server. At times when I try to update my data, despite the api returning me with a 200 ok, when I call GET to retrieve the newly updated info, the data remains unchanged. I could only temporary solve this by 'docker restart '. And after an unknown amount of time, the update will fail again and I have to restart the peer.
I a wondering what could be the problem.
You don't give much detail, for example the environment you are running, the version of composer, fabric etc. So I am going to guess you are running composer 0.20 with fabric 1.2.0.
There is a big problem in fabric 1.2.0 which meant that the blockchain and world state didn't get updated. You need to use fabric 1.2.1 which solves the issue.
Every data update happens via a transaction.
For the data to be updated, the transaction must first be approved. The peer will issue a transaction proposal first which will then be subject to the endorsement policy that you setup when you created the channel.
This means there can be a delay before the transaction is accepted and committed, or the transaction might be rejected altogether meaning your data will not be updated.Even if the transaction does go on the ledger, you can't really know when this will happen.
I would start by checking the peer logs to see what's actually happening.
I would also check the bugs reported on your specific version of fabric to see if there are any known issue. You might want to ask in fabric rocket chat channel as well.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Neo4j won't elect a different master after deadlock

I received the following error from neo4j while our app was running a cypher query.
Error: HTTP Error 400 when running the cypher query against neo4j.ForsetiClient[43]
can't acquire ExclusiveLock{owner=ForsetiClient[42]} on NODE(922), because holders
of that lock are waiting for ForsetiClient[43]. Wait list:ExclusiveLock[ForsetiClient[42]
waits for []]
This is the second time this has happened to us and I am not sure how to prevent this from happening again. We have never received these error in our development environment so thats why it is a little weird.
Another thing since this has happened if I take done the current master in the cluster none of the other instances in the cluster will become the master which is a major problem. I need to get this fixed quickly any help would be great.
We are using Neo4j 2.1.4 in HA mode.

Resources