detecting data/node partition errors - erlang

The last time I saw a data/partition node error it was because I launched the erlang shell which connected to the node on the same CPU via cookies etc. Immediately after startup the shell dumped the partition error on the screen. This is terribly bothersome....
how do I trap this exception?
how do I repair the exception programatically? (asked in another question)
how do I prevent this exception?
[update] I have two boxes that are running my Yaws application. The databases are replicated via Mnesia's extra_db_node feature. At some time after the servers are running I log into one of the boxes and launch 'erl' with a different sname and same cookie so that the 3 nodes can communicate. Shortly after the shell stabilizes and the shell prompt is displayed... a complex tuple is displayed on the screen indicating that there is a network partition error. This message appears to be a console dump rather than an exception that could be trapped by my yaws applications... but I want my yaws applications to detect the error and take corrective action.

Related

What happens if an IIS application worker process hangs?

I am totally new in web programming... Now I am working on an already implemented ASP.NET MVC application which is deployed in IIS. This app is bound to an application pool which has only one worker process. At this moment, I am trying to understand what happens if the worker process freezes/hangs due to an uncontrolled exception thrown by app code. So may someone explain me it?
What we have observed is that when this happens, application stops working correctly and we need to restart its application pool in order to app begins to work correctly again. After observing this behavior, I have a doubt..... In application pool advanced configuration, under process model, the ping maximum response time (seconds) is set to 90 so as far as I know, when application pool pings the worker process and it does not respond because it is hang, after 90 seconds then worker process should terminate, but it seems it is not terminating because when this happens we need to restart application pool in order to app works again.... so Why in this case worker process does not terminate?
First off, you have "only" one Worker Process and should probably keep it that way. Often times Web Gardening causes more issues than it helps, particularly with .NET Apps. Second, you say it freezes/hangs due to "uncontrolled" (unhandled?) exception thrown by app code. Why do you think this is the case. Do you have an error page or something indicating its an exception? The "ping" process checks if the process is still doing work, but not necessarily finishing requests. So from the perspective of WAS, IIS is still responding.
If you want to troubleshoot, you could investigate getting a memory dump with DebugDiag and perform some automated analysis on it. https://support.microsoft.com/en-us/help/919792/how-to-use-the-debug-diagnostics-tool-to-troubleshoot-a-process-that-h

Distributed Erlang - network split recovery and using heart with distributed applications

I have a standard situation, two distributed Erlang nodes, one master one standby.
When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.
However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.
The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.
My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007
If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?
The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?
The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start.
heart:set_cmd("sleep 60; ./bin/myapp start")
The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.
If anyone has a better solution PLEASE let me know.
UPDATE
The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.
As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.
UPDATE
Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.
Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

How to debug Neo4J stalls?

I have a Neo4J server running that periodically stalls out for 10s of seconds. The web frontend will say it's "disconnected" with the red warning bar at the top, and a normally instant REST query in my application will apparently hang, until the stall ends and then everything returns to normal. The web frontend becomes usable and my REST query completes fine.
Is there any way to debug what is happening during one of these stall periods? Can you get a list of currently running queries? Or a list of what hosts are connected to the server? Or any kind of indication of server load?
Most likely JVM garbage collection kicking in because you haven't allocated enough heap space.
There's a number of ways to debug this. You can for example enable GC logging (uncomment appropriate lines in neo4j-wrapper.conf), or use a profiler (e.g. YourKit) to see what's going on and why the pauses.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Exit the VM when an application stops running

I've got an Erlang application packed with Rebar that's meant to be run as a service. It clusters with other instances of itself.
One thing I've noticed is that if the application crashes on one node, the Erlang VM remains up even when the application reaches its supervisor's restart limit and vanishes forever. The result is that other nodes in the cluster don't notice anything until they try to talk to the application.
Is there a simple way to link the VM to the root supervisor, so that the application takes down the whole VM when it dies?
When starting your application using application:start() you can add the optional Type parameter to be one of the atoms permanent, transient or temporary. I guess you are looking for permanent.
As mentioned in application:start/2:
If a permanent application terminates, all other applications and the entire Erlang node are also terminated.
If a transient application terminates with Reason == normal, this is reported but no other applications are terminated. If a transient application terminates abnormally, all other applications and the entire Erlang node are also terminated.
If a temporary application terminates, this is reported but no other applications are terminated.

Resources