Consul issue between versions on a cluster (0.6 - 0.7.1) - docker

I have one setup of 3 nodes of consul cluster , this nodes have a 0.6 version , all run over a docker container inside a private a network, recently in my work we notice the 0.6 version don't support a snapshot commands (and http endpoint returns 404), so I decide upgrade one of that nodes to 0.7.1 in a isolated container, I get running this version in the cluster , but I notice a couple of problems:
1) the snapshot command now fail, I get a 500 error: http: Request GET /v1/snapshot?dc=dc1, error: failed to decode response: read tcp 10.109.140.9:50728->10.109.140.7:8300: read: connection reset by peer from=10.109.140.9:58846
2) I look few endpoints using the UI , and seems they are failing when I browse in the nodes section, I get this JS error " Handlebars error: Could not find property 'tomographyGraph' on object ."
some recomendations over my situation? thanks :)

Related

Isilon Cluster - SyncIQ Job Failed with No node on source cluster was able to connect to target cluster

enter image description here
Note : Disbaled encryption in both cluster and I'm using trial licencse, Both cluster's OneFS version is 9.1 and i tried 9.1 between 9.4 too.
The SyncIQ log /var/log/isi_migrate.log shows "Killing policy: No node on source cluster was able to connect to target cluster." message.
When i try to start job, I'm getting above error. can you tell what am i missing or is there Some configuration needs to done before run the job?

Running Kafka connect in standalone mode, having issues with offsets

I am using this Github repo and folder path I found: https://github.com/entechlog/kafka-examples/tree/master/kafka-connect-standalone to run Kafka connect locally in standalone mode. I have made some changes to the Docker compose file but mainly changes that pertain to authentication.
The problem I am now having is that when I run the Docker image, I get this error multiple times, for each partition (there are 10 of them, 0 through 9):
[2021-12-07 19:03:04,485] INFO [bq-sink-connector|task-0] [Consumer clientId=connector- consumer-bq-sink-connector-0, groupId=connect-bq-sink-connector] Found no committed offset for partition <topic name here>-<partition number here> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1362)
I don't think there are any issues with authenticating or connecting to the endpoint(s), I think the consumer (connect sink) is not sending the offset back.
Am I missing an environment variable? You will see this docker compose file has CONNECT_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets, and I tried adding CONNECTOR_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets (CONNECT_ vs. CONNECTOR_) and then I get an error Failed authentication with <Kafka endpoint here>, so now I'm just going in circles.
I think you are focused on the wrong output.
That is an INFO message
The offsets file (or topic in distributed mode) is for source connectors.
Sink connectors use consumer groups. If there is no found offset found for groupId=connect-bq-sink-connector, then the consumer group didn't commit it.

Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1 (missing "/")

Executive summary
For several weeks we sporadically see the following error on all of our AKS Kubernetes clusters:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1
Obviously there is a missing "/" after "mcr.microsoft.com".
The problem started after upgrading the clusters from 1.17 to 1.20.
Where does this spelling error come from? Is there anything WE can do about it?
Some details
The full error is:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": rpc error: code = Unknown desc = failed to pull and unpack image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to resolve reference "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to do request: Head https://mcr.microsoft.comoss/v2/calico/pod2daemon-flexvol/manifests/v3.18.1: dial tcp: lookup mcr.microsoft.comoss on 168.63.129.16:53: no such host
In 50% of the cases the following is logged also:
Pod 'calico-system/calico-typha-685d454c58-pdqkh' triggered a Warning-Event: 'FailedMount'. Warning Message: Unable to attach or mount volumes: unmounted volumes=[typha-ca typha-certs calico-typha-token-424k6], unattached volumes=[typha-ca typha-certs calico-typha-token-424k6]: timed out waiting for the condition
There seems to be no measurable effect on cluster health apart from the warnings - I see no correlating errors in any services.
We did not find a trigger which causes the behavior. It does not seem to be correlated to any change we do from our side (deployments, scaling, ...).
Also there seems to be no pattern as to the frequency. Sometimes there is no problem for several days and then we have the error pop up 10 times per day.
Another observation is that the calico-kube-controller and several pods were restarted. Replicaset and deployments did not change.
Restart time
Since all the pods of the daemonset are running eventually, the problem seems to be solving itself after some time.
Are you behind a firewall, and used this link to set it up
https://learn.microsoft.com/en-us/azure/aks/limit-egress-traffic
If so add HTTP to the mcr.microsoft.com, looks like MS missed the 's' in an update recently
Paul

OpenShift 4 error: Error reading manifest

during OpenShift installation from a local mirror registry, after I started the bootstrap machine i see the following error in the journal log:
release-image-download.sh[1270]:
Error: error pulling image "quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129":
unable to pull quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: unable to pull image:
Error initializing source docker://quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129:
(Mirrors also failed: [my registry:5000/ocp4/openshift4#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: Error reading manifest
sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129 in my registry:5000/ocp4/openshift4: manifest unknown: manifest unknown]):
quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: error pinging docker registry quay.io:
Get "https://quay.io/v2/": dial tcp 50.16.140.223:443: i/o timeout
Does anyone have any idea what it can be?
The answer is here in the error:
... dial tcp 50.16.140.223:443: i/o timeout
Try this on the command line:
$ podman pull quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129
You'll need to be authenticated to actually download the content (this is what the pull secret does). However, if you can't get the "unauthenticated" error then this would more solidly point to some network configuration.
That IP resolves to a quay host (you can verify that with "curl -k https://50.16.140.223"). Perhaps you have an internet filter or firewall in place that's blocking egress?
Resolutions:
fix your network issue, if you have one
look at doing an disconnected /airgap install -- https://docs.openshift.com/container-platform/4.7/installing/installing-mirroring-installation-images.html has more details on that
(If you're already doing an airgap install and it's your local mirror that's failing, then your local mirror is failing)

Getting Neo4J running on OpenShift

I am trying to get the Bitnami Neo4j image running on OpenShift (testing on my local Minishift), but I am unable to connect. I am following the steps outlined in this issue (now closed), however, now I cannot access the external IP for the load balancer.
Here are the steps I have taken:
Deploy Image (bitnami/neo4j)
Create service for the load balancer,
using the YAML supplied in the issue mentioned
Get the external IP
address for the LB (oc get services) The command in step 3 lists 2
of the same IP addresses, and when I attempt to go to this IP in my
browser it times out.
I can create a route that points to port 7374 on the IP of the LB, but
then I get the same error as reported in the aforementioned issue.
(ServiceUnavailable: WebSocket connection failure. Due to security
constraints in your web browser, the reason for the failure is not
available to this Neo4j Driver. Please use your browsers development
console to determine the root cause of the failure. Common)
Configure neo4j to accept non-local connections. E.g.:
dbms.connector.bolt.address=0.0.0.0:7687
Source: https://neo4j.com/developer/kb/explanation-of-error-websocket-connection-failure/

Resources