DSE OpsCenter showing wrong node status - datastax-enterprise

I've come across several occurrences already when one or two of our DSE Search nodes would be shown with "Down - Unresponsive" status in OpsCenter even though the node is up (i.e. I can access the Solr admin UI). Sometimes, nodetool status would also show that the node is down. But more often, it's only OpsCenter. I found out that the fix is to restart the datastax-agent service. Would could be causing this?
I'd also like to follow-up my other questions:
New Solr node in "Active - Joining" state for several days
Fault tolerance and topology transparency of multi-node DSE Cluster

Related

DRBD stuck in Connected/WFBitMapS

So I am trying to setup a DRBD replication between 2 nodes.
When I restart my primary node, second one gets primary correclty but once the second one is back online, it stays stuck in the below state.
Primary has the following status: 0:r0/0 WFBitMapS Primary/Secondary UpToDate/Consistent
and secondary has the following: 0:r0/0 Connected Secondary/Primary Negotiating/UpToDate.
Rebooting the secondary node fixes the issue but everytime there is a failover, my nodes get stuck in the above state again.
Below some information on my cluster:
debian 10
drbd 8.9.10
drbd disk size: 6.7TB.
Does anyone has an idea on what's going on or what does this state mean ? I did not find any useful informations on this on Google...
Thank you

Solr7 and zookeeper behavior leading to deleted data directories, how to research/prevent

During testing, I came across the following situation:
I had set up 3 VMs, all Ubuntu 18.04.
The first 2 machines had a solr7 instance. All 3 machines had a zookeeper. All of these are in Docker containers, the entire config deployed via Ansible.
Solr 7.5, Zookeeper 3.14.3
There's a frontend that acts as interface to insert stuff.
The zookeeper machines were set up to create an ensemble, which they properly did. They all had their id, a leader was elected, solr7 instances could connect and received their settings properly.
Inserting a bunch of data all worked fine.
Then I took down 2 of the VMs, leaving 1 with both a solr7 and zookeeper and redeployed the new config, without a zookeeper ensemble.
This did not work, the interface refused to come up, it all took too long so I decided to go back to 3 VMs.
While I could once again connect, I noticed all data was gone.
Even worse, when looking at the location of the solr data directories, those were all gone. Every single collection/core was gone.
I've been trying to google this issue, but there seems to be no documentation of anything like this.
My current working theory is that solr started and asked the zookeeper ensemble for its configuration. Zookeeper either was not in sync or lost its settings and sent an empty reply or did not reply at all. To which solr decided to remove the existing data folders, as the received config specified nothing/not receiving a config at all.
That's just guesswork though. I'm at a complete loss even finding information about this
I'm not even sure what to search for. All results I get are "how to delete solr cores" or "how to remove collections".
Any help or pointing in the right direction would be appreciated.
EDIT: After talking about it on the solr mailing list, a ticket was made for this: https://issues.apache.org/jira/browse/SOLR-13396
After asking about it on the solr mailing list, a bug ticket was made: https://issues.apache.org/jira/browse/SOLR-13396
So answering my own question so this can be closed.

Keyspace doesn't appear in Opscenter Repair Service UI

I recently updated to Datastax OpsCenter 6.1.0 and enabled the repair service against my cluster running DSE 5.0.5. The UI shows repair task status for tables in the OpsCenter keyspace, but none for the keyspace which stores all my application data. My keyspace is configured with NetworkTopologyStrategy and two replicas in each of two data centers.
How can I determine why OpsCenter is not repairing my keyspace (I don't see anything relevant in the logs)? Is there something specific I need to change in the configuration?
After updating to OpsCenter 6.1.3, the "Subrange" area of the repair service appears to be making significant progress for the first time. It's almost to 10%, whereas previously it would only ever get to ~2% at most. I am optimistic that the upgrade has fixed the issue.

Neo4j won't elect a different master after deadlock

I received the following error from neo4j while our app was running a cypher query.
Error: HTTP Error 400 when running the cypher query against neo4j.ForsetiClient[43]
can't acquire ExclusiveLock{owner=ForsetiClient[42]} on NODE(922), because holders
of that lock are waiting for ForsetiClient[43]. Wait list:ExclusiveLock[ForsetiClient[42]
waits for []]
This is the second time this has happened to us and I am not sure how to prevent this from happening again. We have never received these error in our development environment so thats why it is a little weird.
Another thing since this has happened if I take done the current master in the cluster none of the other instances in the cluster will become the master which is a major problem. I need to get this fixed quickly any help would be great.
We are using Neo4j 2.1.4 in HA mode.

AWS Spot Instances and ipcluster plugin

Currently what does the ipcluster plugin do when AWS shuts down one or more of the spot instance nodes? Is there any mechanism to re-start and then re-add these nodes back to the IPython cluster?
You need to use the loadbalance command in order to scale your cluster. If you know you want x nodes at all times, simply launch it with "--max_nodes x --min_nodes x" and it will try to add back the nodes as soon as they go away.
If your nodes go away, it's probably because of the spot price market fluctuations so your might have to wait for it to go below your SPOT_BID value before seeing them appear back.
I use the load balancer a lot with the SGE plugin, but never tried it with ipcluster, so I do not know how well it will behave.

Resources