Quality Control / Log Monitoring - monitoring

One of the articles I really enjoyed reading recently was Quality Control by Last.FM. In the spirit of this article, I was wondering if anyone else had favorite monitoring setups for web type applications. Or maybe if you don't believe in Log Monitoring, why?
I'm looking for a mix of opinion slash experience here I guess.

We get a bunch of email/pager alerts from an older host/app/network monitoring environment that get gradually more abusive depending on severity of the problem/time taken to respond. Fortunately we all have thick skins and very broad senses of humour. :)

We use log4net, and normally write both to log files and the database. However, when we've been tracking down a particularly difficult problem, we've enabled the email appender, so that critical log messages went straight to a developer's email account. This allowed us to figure out what was happening more immediately.
In addition, our infrastructure team has several tools they use to monitor system uptime, event logs, etc., to give them early warning when something is about to go down. We've also helped them implement custom monitoring scripts that test specific functionality of our code.

Related

How to setup alerts for memory usage in Heroku?

That, basically. I have a Rails 5 application but really more on a general level Im trying to find a way to get email alerts when my dynos reach certain memory usage threshold. Im using one web and one worker. Cant find anything.
Heroku has an 'Application metrics' page which can also alert you on some conditions. However there is no option for alerting for memory usage. We have asked Heroku support and this is their answer:
Yes, unfortunately there's not a built-in way to receive alerts about
memory usage. However, you could potentially set up your Papertrail
instance to alert you if your app emits an R14 or an R15. Keep in mind
that those are cases where it might be too late to take any corrective
action on memory as it is already alerting a disruption to the app.
For more granularity, you'll need to enable the log-runtime-metrics
lab so that your memory usage is printed to your application logs:
https://devcenter.heroku.com/articles/log-runtime-metrics.
You can also use a log drain to parse, chart, and setup alerts for
those errors. Librato is the most common tool I see customers using
for that kind of workflow: https://elements.heroku.com/addons/librato.
Relying on Papertrail is good enough for us for now, but we are still exploring other options. One option is to simply use the Rails app to report its own memory usage, for example through a system call like this pmap #{Process.pid} | tail -1. Another option is to use Monit (https://mmonit.com/) but it's not very easy to set it up and configure. (You will need a custom build pack to run this on Heroku.) Also Heroku supports some 3rd party addons for monitoring which can send alerts, like New Relic.
You can't in Heroku, but the best solution I've found so far is to use App Signal
They allow you to set up custom alerts for any metric they track, including host memory usage.
App Signal is a pretty solid APM in general too, so you can ditch New Relic, Scout, or whatever other tool you might be using as well.

Agresso payment creation via acrbatchinput

We're attempting to generate payments in an Agresso 5.5 system. The mechanism we've been told to use is to write new payment data into table acrbatchinput where it will be picked up and processed by a regular job running in agrbibat.dll. We have code that worked on a previous version of Agresso but following the upgrade our payments get rejected by the agrbibat job. Sometimes it generates useful messages in the log, sometimes it doesn't, and working through failures without good information is becoming a bit of a slog.
Is there some documentation we're missing? In particular it would be useful to have a full list of validation rules the job is using so we can implement these ourselves rather than trying to infer them from the log. I can't find any - there's not a lot for acrbatchinput on Google. Does this list or some other documentation exist? Is agribibat something easily decompilable, e.g. .NET?
Thanks. The test system we have is running against Oracle on Solaris with the Agresso jobs hosted on Windows. We have limited access to the Oracle and Agresso systems because (I think!) the same Oracle server is hosting the live payment system, but I could probably talk finance into giving us agrbibat.dll if that might help. We're unlikely to get enough access to their servers to debug it in place.
It turns out that our problem is partly because the new test system we've been given access to wasn't set up correctly, so we might be able to progress this without extra information - we're waiting on the financial team here for input.
However we're still interested in acrbatchinput or agrbibat documentation or information. You've missed the bounty I set but ticks, votes and gratitude still available.
I know this is an ancient old question, but here's my response anyway for anyone else that finds it.
The only documentation is the usual Agresso help files from within the desktop client. Meaningful information is only gleaned through trial and error, however!
The required fields differs depending on whether a given record is a GL, AP/AR or tax transaction. (That much is, at least, explained in the help).
In addition to using the log file, it's often helpful to look at GL07's report output for errors.

Automated testing with Ruby on Rails - best practices

Curious, what are you folks doing in as far as automating your unit tests with ruby on rails? Do you create a script that run a rake job in cron and have it mail you results? a pre-commit hook in git? just manual invokation? I understand tests completely, but wondering what are best practices to catch errors before they happen. Let's take for granted that the tests themselves are flawless and work as they should. What's the next step to make sure they reach you with the potentially detrimental results at the right time?
Not sure about what exactly do you want to hear, but there are couple of levels of automated codebase control:
While working on a feature, you can use something like autotest to get an instant feedback on what's working and what's not.
To make sure that your commits really don't break anything use continuous integration server like cruisecontrolrb or Integrity (you can bind these to post-commit hooks in your SCM system).
Use some kind of exception notification system to catch all the unexpected errors that might pop up in production.
To get some more general view of what happened (what was user doing when the exception occured) you can use something like Rackamole.
Hope that helps.
If you are developing with a team, the best practice is to set up a continuous integration server. To start, you can run this on any developers machine. But in general its nice to have a dedicated box so that its always up, is fast, and doesn't disturb a developer. You can usually start out with someone's old desktop, but at some point you may want it to be one of the faster machines so that you get immediate response from tests.
I've used cruise control, bamboo and teamcity and they all work fine. In general the less you pay, the more time you'll spend setting it up. I got lucky and did a full bamboo set up in less than an hour (once)-- expect to spend at least a couple hours the first time through.
Most of these tools will notify you in some way. The baseline is an email, but many offer IM, IRC, RSS, SMS (among others).

Keeping applications and infrastructure connected

I work in an IT department that is divided into two groups. One group develops and manages applications, the other manages the company's infrastructure and servers. One of the problems we face is a break down in communication. I work for the application group and one of the problems I have is not being notified when a server is taken down by infrastructure, or a database is being refreshed.
Does anyone have suggestions on how to improve communications between the two groups or any ideas on how to keep a light-weight log across multiple systems (both linux and windows)? Ideally it would be nice if we could have our boxes just tweet their statuses or something.
Thanks for the help,
Ben
One thing you could do to communicate server status is to have our Infrastructure group setup a network monitoring system like Nagios. This will give everyone in your application group the ability to get a snapshot view of the status of every server in the system. Having this kind of status is invaluable when you are doing development.
Nagios gives you network monitoring, but also allows you to show scheduled down time for a particular server in the system.
Another thing your group could do to foster communication with the Infrastructure is to have your build system report which servers it is currently using for building and testing your products.
Also, setting up regular meeting between stakeholders of both groups is probably a good idea too. If you all are talking to each other, even for 15 minute a week, you'll probably see incidents like the one you described above go down quite a bit.
I think this is a bigger issue of change control.
You should have hardware and software change control and an approval process.
Ultimately, infrastructure serves you - the purpose for IT infrastructure is to run applications.
In my current large financial data company, servers are not TOUCHED without proper authorization through the client and application groups. It seems like a huge pain, but every single server is there for a reason - to meet a specific business goal and run a specific application. There is simply no excuse for the infrastructure group to be changing things or upsetting servers on their own volition.
Response to critical hardware failure might be an exception.
Needed software and OS updates are handled through scheduled maintenance windows and an approved change process.
I like the Nagios idea as well. If you want to setup something that's more of a communication tool, I would recommend a content management system like Drupal.
We use Drupal internally to communicate between teams. When one team takes a server down, they would add an event into Drupal. The rest of us would either get it as an email, an RSS item or just by refreshing the page.
Implement a change control process where changes are submitted, approved and scheduled for BOTH groups. This lets everyone know what is going on. This process can be as light or heavy-weight as you want.

Using MSTest as site/environment monitoring tool

We currently use Hp SiteScope for monitoring synthetic transactions across some of our web apps. This works pretty well except for the licensing cost for each synthetic transaction makes it prohibitive to ensure adequate coverage across our applications.
So, an alternative would be to use SiteScope's URL monitoring which can basically call a URL and then provide some basic checks for the certain strings. With that approach, I'd like to create a page that either calls a bunch of pages or try to tap into a MSTest group somehow to run tests.
In the end, I'd like a set of test cases that can be used against multiple environments to be used for production verification, uptime, status, etc.
Thanks,
Matt
Have you taken a look at System Center Operations Manager 2007?
I'm just getting started, but it appears to do what you are describing in your question.
We are looking to monitoring our data center and the a web application...from the few things I have found on the web it is going to fit our need.
Update
I've since moved to Application Insights. A great overview can be found here, https://azure.microsoft.com/en-us/documentation/articles/app-insights-monitor-web-app-availability/
There are two methods one can use, a simple ping, or record a multi-step synthetic user "experience". Basically you act as a user, and using IE and a Visual Studio Web Test project you record navigating around your site and upload that file to Azure.
For example, I record logging in, navigating a few pages, and then logging out. As long as all of those events happen in a timely manner the site is in a good operating state.
If the tests fail, take too long to respond for example, I'll get an email alerting me something isn't exactly right.

Resources