What is the right alternative to ntpdate, wrt the below feature required? - ntpd

Since ntpdate has been deprecated, ntpd -gq serves the same purpose in most cases.
With ntpdate, one would receive an error
no server suitable for synchronization found
and exit if one provides a wrong hostname/ip.
Is this feature not yet available with ntpd ? i.e., for ntpd -gq to quit if it didnt get any response from the (invalid/not running ntp) server.

Related

Tasks will not run in Spring Cloud Data Flow (Docker/K8S)

Last week I installed the Docker/Kubernetes based version of Spring Cloud Data Flow
Although there were not overt errors, things are not working correctly.
I am able to create streams and tasks in the web UI and Spring Cloud Data Flow Shell but nothing runs.
I am most interested in Tasks.
When I create them, they all show with a Task Status of UNKNOWN.
Unfortunately, no matter how many times I launch them, the status always remains UNKNOWN.
I'm able to delete them but what magic must I use to make them run?
There's nothing apparent from the description as to what has failed. Perhaps if you can update it with more details, it'd be useful.
From a troubleshooting standpoint, when deploying streams or if the launch of Tasks' fails for any reason, they will be logged in the SCDF-server/Skipper-server logs. You'd have to tail the logs of the respective pod to learn more about the failures.
Also, it'd be useful to verify the output of kubectl describe pod/<POD_NAME> to see what's causing the stream/task pods not to start successfully. They're usually listed towards the end of this command-output.
The usual suspects are due to pods' health-check failures and/or the stream/task application docker images aren't resolvable at runtime. You'll see the reasons in the logs, of course.
This was a misconfiguration on my end.
I'm able to run as expected now.

Random and occasional network error (NSURLErrorDomain Code=-1001 and NSURLErrorDomain Code=-1005)

The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.
The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
and more frequently:
Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.
After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.
Some quick facts about the setup:
Happens on device and simulator
Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.
Some findings I've written down during the last couple of days:
It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.
My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.
I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.
Please let me know if you need any more information.
In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.
Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.
Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:
http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/
From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.
Ok, I lost a lot of time investigeting similar issue.
1005 could be coused by known iOS bug and there are couple fixes. For example add header
"Connection" with value "close".
More info
1001 is a different story. In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.
I believe you can do easy test if you are facing similar issue.
Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).
Close/Kill app (this will close connections to server)
(OPTIONAL) Wait a while (lets say 60 seconds)
Start app again and try send request
If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.
PS: if you don't have access to server, you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.

HBase 0.98.1 Put operations never timeout

I am using 0.98.1 version of HBase server and client. My application has strict response time requirements. As far as HBase is concerned, I would like to abort the HBase operation if the execution exceeds 1 or 2 seconds. This task timeout is useful in case of Region-Server being non-responsive or has crashed.
I tired configuring
1) HBASE_RPC_TIMEOUT_KEY = "hbase.rpc.timeout";
2) HBASE_CLIENT_RETRIES_NUMBER = "hbase.client.retries.number";
However, the Put operations never timeout (I am using sync flush). The operations return only after the Put is successful.
I looked through the code and found that the function receiveGlobalFailure in AsyncProcess class keeps resubmitting the task without any check on the retires. This is in version 0.98.1
I do see that in 0.99.1 there have been some changes to AsyncProcess class that might do what I want. I have not verified it though.
My questions are:
Is there any other configuration that I missed that can give me
the desired functionality.
Do I have to use 0.99.1 client to
solve my problem? Does 0.99.1 solve my problem?
If I have to use 0.99.1 client, then do I have to use 0.99.1 server or can I still use my existing 0.98.1 region-server.

Icinga - check_yum - Socket Timeout?

I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?
First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.

How do you ensure your Rails server running

What is common approach to make sure that Rails server is auto-restarted after a serious crash, or a process kill? How do you deal with hanging processes? I have nginx and thin running on my production server - would you suggest to put something in between them? Or using another server?
Firstly:
You should identify the cause of a process hang or kill. These are not normal behaviours and indicate a fault somewhere.
Look for:
Insufficient memory or high load before a crash - indicates a configuration problem.
Versions of nginx that are too new.
If you're virtualising, this can cause a number of subtle problems with linux kernels that may cause segfaults. If you're using EC2, use Amazon Linux for your best chance. Ubuntu server is too bleeding edge for this purpose.
In order to do the restarts, I suggest you use monit as this is quick, easy and reliable - it's the normal way to do this.
Lastly, I suggest you set up external monitoring as well using something like Pingdom, as even monit won't catch every type fault, such as hardware failures.
If you only want to monitor an application, I'm always using Nagios with Centreon. You can set email alarming when your rails server is down. You have to setup your NRPE on every machine you want to monitor.
When an error is detected you can run a bash file to kill hanging processes and restart the server automatically. Personally, I never use that because a crash mean something goes wrong. So I do it manually in order to check everything.
Try to look here : http://www.centreon.com/

Resources