Writing tests that don't fail when Supervised processes die

Writing tests that don't fail when Supervised processes die - erlang

I'm writing a few tests in ExUnit to illustrate how different Supervisor strategies work. I had planned to test results by intentionally causing spawned processes to fail, and then testing the restarted processes' output.
Thus far I have been unsuccessful in creating passing tests, as the initial process failure causes the test to fail. I have tried capturing the errors (try/catch) in both the Supervisor/GenServer implementation and the test implementation, but I have not been able to capture any and avoid the test failure.
Is there any way to capture these errors so that they do not trigger a test failure?
Is there a better/different means of testing different supervisor strategies?
Thanks!

You need to careful with your links. When you start a supervisor, it is linked to the current process, so if you crash the supervisor (or any other linked process), it will cause the test to crash too.
You can change this behaviour by setting Process.flag(:trap_exit, true) now links won't trigger crashes and instead you will be able to find messages of format {:EXIT, pid, reason} in your mailbox.
This is a fine approach for testing but for production or in general you likely want to setup some sort of monitor.

Because I was intentionally causing a process to fail and wanted to ignore this failure within the ExUnit test, I ended up using catch_exit/1 to prevent the test process from failing.

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?

The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

How can I coordinate integration tests in a multi-container (Docker) system?

I have inherited a system that consists of a couple daemons that asynchronously process messages. I am trying to find a clean way to introduce integration testing into this system with minimal impact/risk on the existing programs. Here is a very simplified overview of their responsibilities:
Process 1 polls a queue for messages, and inserts a row into a DB for each one it dequeues.
Process 2 polls the DB for rows inserted by Process 1, does some calculations, and then deposits a file into a directory on the host and sends an email.
These processes are quite old and complex, and I am strongly inclined to avoid modifying them in any way. What I would like to do is put each of them in a container, and also stand up the dependencies (queue, DB, mail server) in other containers. This part is straightforward, but what I'm unsure about is the best way to orchestrate these tests. Since these processes consume and generate output asynchronously I will need to poll or wait for the expected outcome (mail sent, file created).
Normally I would just write a series of tests in a single test suite of my language of choice (Java, Go, etc), and make the setUp / tearDown hooks responsible for resetting the environment to the desired state. But because these processes have a lot of internal state I am afraid I cannot successfully "clean up" properly after each distinct test. This would be a problem if, for example, one test failed to generate the desired output in a specific period of time so I marked it as failed, but a subsequent test falsely got marked as passed because the original test case actually did output something (albeit much slower than anticipated) that was mistakenly attributed to the subsequent test. For these reasons I feel I need to recreate the world between each test.
In order to do this the only options I can see are:
Use a shell script to actually run my tests -- having it bring up the containers, execute a single test file, and then terminate my containers for each test.
Follow my usual pattern of setUp / tearDown in my existing test framework but call out to docker to terminate and start up the containers between each test.
Am I missing another option? Is there some kind of existing framework or pattern used for this sort of testing?

RabbitMQ Integration Test and Threading

I have written a RabbitMQ consumer by implementing the MessageListener interface and setting up a SimpleMessageListenerContainer. Its working well when I test it manually. Now I would like to write an integration test that:
Creates a message
Pushes the message to my RabbitMQ server
Waits while the message is consumed by my MessageListener implementation
Test makes some asserts once everything is done
However, since my MessageListener is running in a separate thread, it makes unit testing difficult. Using a Thread.sleep in my test to wait for the MessageListener is unreliable, I need some kind of blocking approach.
Is setting up a response queue and using rabbitTemplate.convertSendAndReceive my only option? I wanted to avoid setting up response queues, since they won't be used in the real system.
Is there any way to accomplish this using only rabbitTemplate.convertAndSend and then somehow waiting for my MessageListener to receive the message and process it? Ideally, I would imagine something like this:
rabbitTemplate.convertAndSend("routing.key", testObject);
waitForListner() // Somehow wait for my MessageListener consume the message
assertTrue(...)
assertTrue(...)
I know I could just pass a message directly to my MessageListener without connecting to RabbitMQ at all, but I was hoping to test the whole system if it is possible to do so. I plan on falling back to that solution if there is no way to accomlish my goal in a reasonably clean way.

There are several approaches, the easiest being to wrap your listener and pass in a CountDownLatch which is counted down by the listener and the main test thread uses
assertTrue(latch.await(TimeUnit.SECONDS));
You can also pass back the actual message received so you can verify it is as expected.
Also see the integration test cases in the framework itself.

Automate testing of process flows in Rails

I am building a educational service, and it is a rather process heavy application. I have a lot of user actions triggering various actions, some present, some future.
For example, when a student completes a lesson for his day, the following should happen:
updating progress count for his user-module record
checking if he has completed a particular module and progressing him to the next one (which in turn triggers more actions)
triggering current emails to other users
triggering FUTURE emails to himself (ongoing lesson plans)
creating range of other objects (grading todos by teachers)
any other special case events
All these triggers are built into the observers of various objects, and the execution delayed using Sidekiq.
What is killing me is the testing, and the paranoia that I might breaking something whenever I push something. In the past, I do a lot of assertion and validations checks, and they were sufficient. For this project, I think this is not enough, given the elevated complexity.
As such, I would like to implement a testing framework, but after reading through the various options (Rspec, Cucumber), it is not immediately clear what I should be investing my effort into, given my rather specific needs, especially for the observers and scheduled events.
Any advice and tips on what approach and framework would be the most appropriate? Would probably save my ass in the very near future ;)
Not that it matters, but I am using Rails 3.2 / Mongoid. Happy to upgrade if it works.

Testing can be a very subjective topic, with different approaches depending on the problems at hand.
I would say that given your primary need for testing end-to-end processes (often referred to as acceptance testing), you should definitely checkout something like cucumber or steak. Both allow you to drive a headless browser and run through your processes. This kind of testing will catch any big show stoppers and allow you to modify the system and be notified of breaks caused by your changes.
Unit testing, although very important, and should always be used in parallel with acceptance tests, isn't for doing end-to-end testing, Its primarily for testing the output of specific methods in isolation
A common pattern to use is called Test Driven Development (TDD). In this, you write your acceptance tests first, in the "outer" test loop, and then code your app with Unit tests as part of the "inner" test loop. The idea being, when you've finished the code in the inner loop, then the outer loop should also pass, and you should have built up enough test coverage to have confidence that any future changes to the code will either pass/fail the test depending on if the original requirements are still met.
And lastly, a test suite is something that should grow and change as your app does. You may find that whole sections of your test suite can (and maybe should) be rewritten depending on how the requirements of the system change.

Unit Testing is a must. you can use Rspec or TestUnit for that. It will give you atleast 80% confidence.
Enable "render views" for controller specs. You will catch syntax errors and simple logical errors faster that way.There are ways to test sidekiq jobs. Have a look at this.
Once you are confident that you have enough unit tests, you can start looking into using cucumber/capybara or rspec/capybara for feature testing.

How to disable a CodedUI Test Agent from code?

We have a service to pick up custom tests in XML and convert those to CodedUI tests. We then start a process for MSTest to load the tests into the Test Controller which then distributes the tests across various Agents. We run regression tests at night, so no one is around to fix a system if something goes wrong. When certain exceptions occur in the test program, it pops open an error window and no more test can run on the system. Subsequent tests are loaded into the agent and fail immediately because they can not perform their assigned tasks. Thousands of tests that should take all night on multiple systems now fail in minutes.
We can detect that an error occurred by how quickly a test is returned, but we don't know how to disable the agent so as not to allow it to pick up any more tests.
addendum:
If the test has failed so miserably that no more tests can attempt a successful run (as noted, we may not have an action to handle some, likely new, popup), then we want to disable that agent as no more tests need to run on it: they will all fail. As we have many agents running concurrently, if one fails (and gets disabled), the load can still be distributed without a long string of failures. These other regression tests can still have a chance to succeed (everything works) or fail (did we miss another popup, or is this an actual regression failure).
2000 failures in 20 seconds doesn't say anything except 1 system had an problem that no one realized it would have and now we wasted a whole night of tests. 2 failures (1 natural, 1 caused by issue from previous failure) and 1 system down means the total nights run might be extended by an hour or two and we have useful data on how to start the day: fix 1 test and rerun both failures.

One would need to abort the testrun in that case. If you are running mstest yourself, you would need to inject a ^c into the command line process. But: if no-one is around to fix it, why does it matter that the consequenting test fail ? if its just to see which test was the cause of the error quickly, why not generate a code ui check to see if the message box is there and mark the test inconclusive with Assert.inconclusive. The causing test would stand out like a flag.

If you can detect the point at which you want to disable the agent then you can disable the agent by running the "TestAgentConfig.exe delete" which will reset the agent to unconfigured state.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart