How long an Azure' web role can stay active? - upload

My cloud application (web role) uploads over 5000 records into SQL Azure.
The total time for this process takes about 15-19 minutes on my development machine.
Once, I deploy to the Cloud and try again, it fails after 10 - 12 minutes without any error message.
My guess is that the web role times out after certain period of time. Is there a setting for this?

Your webrole doesn't time out.
SQL Azure can time out and also your connection can be killed.
Some reasons for Connection Termination:
Sessions consuming greater than one million locks are terminated.
Transactions with a log file size > 1 GB are terminated.
The distance from the first or oldest active transaction log sequence number (LSN) to the tail of the log (current LSN) cannot exceed 20% of the size of the log file.
If a transaction locks a resource required by an underlying system operation for more than 20 seconds, it is terminated.
When a session uses more than 5 GB of tempdb space (= 655,360 pages), the session is terminated.
When there is memory contention, sessions consuming greater than 16-megabyte (MB) for more than 20 seconds are terminated in the descending order of time the resource has been held, such as the oldest session is terminated first.
A database will be read-only when it reaches its maximum database size. Transactions attempting to updates or inserts will be terminated when this happens.
http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx#Reasons

May I suggest you page your upload? Even if it's just to build the record set in memory (cache/queue/table/blob store rather) and then execute against SQL.
At the very least it'll allow you to isolate where the issue is.

Related

Neo4J taking out long-lived locks in non-query transaction

In our application we occasionally add around 10,000 nodes and 100,000 relationships to a Neo4J graph over the course of a few minutes, and then DETACH DELETE many of them a few minutes later. Previously the delete query was very quick (<100ms), but after a small change to our data model and some of our other queries (which are not running at the time), it now often blocks for minutes before completing.
While this blocking is happening there are no other queries running, and I have an export from Halin showing all the transactions that are happening at the time. It's difficult to reproduce here, but in summary there are exactly two transactions going on, one of which is my delete query. The delete query is stated to be blocked by the other one, which has 7 locks out, is in the Running state, and has no attached query or client at all. I imagine this means that it's an internal Neo4J process. It has 0 cpu time, and its entire 180s runtime is accounted for by idle time. There's no other information given.
What could be causing this transaction to lock the nodes that I want to delete for such a long time with no queries running?
What I've tried:
Using apoc.periodic.iterate and apoc.periodic.commit to split the query into smaller chunks - the inner queries end up locked
Looking in the query logs - difficult to be sure but I can't see any evidence of the internal transaction
Looking in the debug logs - records of garbage collections (always around 300ms) and some graph algorithms running, but never while this query is blocked, and nothing else relevant
Other info:
Neo4J version: 3.5.18-enterprise (docker)
Cluster mode: HA cluster with 2 nodes (also reproduced with only 1 node)
It turned out that there was a query a few minutes before that had been set going and then the client disconnected (missing await in C#). I still don't quite understand why this caused the observations, but my guess is that Neo4j put the query into a weird state after the client disconnected, and then some part of it ended up waiting for the transaction timeout before releasing its locks.

Sidekiq concurrency and database connections pool

Here is my problem: Each night, I have to process around 50k Background Jobs, each taking an average of 60s. Those jobs are basically calling the Facebook, Instagram and Twitter APIs to collect users' posts and save them in my DB. The jobs are processed by sidekiq.
At first, my setup was:
:concurrency: 5 in sidekiq.yml
pool: 5 in my database.yml
RAILS_MAX_THREADS set to 5 in my Web Server (puma) configuration.
My understanding is:
my web server (rails s) will use max 5 threads hence max 5 connections to my DB, which is OK as the connection pool is set to 5.
my sidekiq process will use 5 threads (as the concurrency is set to 5), which is also OK as the connection pool is set to 5.
In order to process more jobs in the same time and reducing the global time to process all my jobs, I decided to increase the sidekiq concurrency to 25. In Production, I provisionned a Heroku Postgres Standard Database with a maximum connection of 120, to be sure I will be able to use Sidekiq concurrency.
Thus, now the setup is:
:concurrency: 25 in sidekiq.yml
pool: 25 in my database.yml
RAILS_MAX_THREADS set to 5 in my Web Server (puma) configuration.
I can see that 25 sidekiq workers are working but each Job is taking way more time (sometimes more than 40 minutes instead of 1 minute) !?
Actually, I've been doing some tests and realize that processing 50 of my Jobs with a sidekiq concurrency of 5, 10 or 25 result in the same duration. As if somehow, there was a bottleneck of 5 connections somewhere.
I have checked Sidekiq Documentation and some other posts on SO (sidekiq - Is concurrency > 50 stable?, Scaling sidekiq network archetecture: concurrency vs processes) but I haven't been able to solve my problem.
So I am wondering:
is my understanding of the rails database.yml connection pool and sidekiq concurrency right ?
What's the correct way to setup those parameters ?
Dropping this here in case someone else could use a quick, very general pointer:
Sometimes increasing the number of concurrent workers may not yield the expected results.
For instance, if there's a large discrepancy between the number of tasks and the number of cores, the scheduler will keep switching your tasks and there isn't really much to gain, the jobs will just take about the same or a bit more time.
Here's a link to a rather interesting read on how job scheduling works https://en.wikipedia.org/wiki/Scheduling_(computing)#Operating_system_process_scheduler_implementations
There are other aspects to consider as well, such as datastore access, are your workers using the same table(s)? Is it backed by a storage engine that locks the entire table, such as MyISAM? If that's the case, it won't matter if you have 100 workers running at the same time, and enough RAM and cores, they will all be waiting in line for whichever query is running to release the lock on the table they're all meant to be working with.
This can also happen with tables using engines such as InnoDB, which doesn't lock the entire table on write but you may have different workers accessing the same rows (InnoDB uses row-level locking) or simply some large indexes that don't lock but slow down the table.
Another issue I've encountered was related to Rails (which I'm assuming you're using) taking quite a toll on RAM in some cases, so you might want to look at your memory footprint as well.
My suggestion is to turn on logging and look at the data, where do your workers spend most time at? Is it something on the network layer (unlikely), is it waiting to get access to a core? Reading/writing from your data store? Is your machine swapping?

Application Pool Occasionally Spiking Memory Consumption

We have just launched a new MVC5 web site. The site uses Entity Framework for its data and also implements a couple of WebApi services for some simple AngularJS pages used on the web site.
The site has gone through development and testing without a problem, but now it is installed on an IIS 8.5 production server we are seeing the following entries in the IIS (WAS) event logs:
Here is first error:
A worker process serving application pool 'xxx' has requested a recycle
because it reached its private bytes memory limit.
Around 90 seconds later we see this error:
A worker process '4880' serving application pool 'xxx' failed to stop
a listener channel for protocol 'http' in the allotted time. The data
field contains the error number.
Which is immediately (the same time to the second) followed by a third error:
A process serving application pool 'xxx' exceeded time limits during
shut down. The process id was '4880'.
Finally, we see another Application Pool reccycle event:
A worker process serving application pool 'xxx' has requested a recycle
because it reached its private bytes memory limit.
We are currently seeing this problem approximately once per day and it does not seem to be related to site traffic/loading.
The reason we set the Application Pool to recycle on a Private Bytes consumption exceeded 4,194,304 KB (4 GB) - it normally (for perhaps 36 hours) sits at less than 1 GB, was because we had noticed that occasionally the Application Pools Private Memory consumption would increase linearly. Again we did not see this during development or local testing.
We have tried running load tests of several hundred concurrent users across the application, but have been unable to replicate this error sequence.
We have also run the application locally for extended periods of time with ReSharper's dotMemory profiler and memory snapshots do not reveal any problems.
Are there any tools/techniques available that we can run on the production server that would give us more information on what is happening?

neo4j high cpu and open transactions

is there a way to check why the server (neo4j dedicated) has high cpu after a while of running queries?
also is the attached monitor screen ok? lots of open transactions there, which only increase
Opened should continue to increase. That is not how many are currently opened but rather just a total including transactions that were opened and are now finished and not running.
However, "current" shows 7 which means you still have 7 transactions running which probably explains the high CPU usage, depending on what those transactions are doing. Is it expected that you would have 7 transactions running? If so then there's probably nothing to worry about. If not, then you might want to look in to why those transactions didn't finish when you expected them to and you can also configure the execution card to limit the time each query can run for before being killed.

How can I get Jelastic to sleep?

Yesterday I got a trial account on webhosting.net's Jelastic v2.2.2 and configured an environment with a minimum of 0 cloudlets (max 8, i.e., all dynamic, no reserved). Then I deployed a Grails war which was using 3 cloudlets after it started up (around 350 MB). It worked great, and I was very impressed.
However, I did not access my app overnight, and the billing history shows it kept using 3 dynamic cloudlets every hour, even with 0 requests (i.e., 0 MB paid traffic) for 14 hours. Is there some way I can get my Jelastic environment to sleep (i.e., hibernation) after some period with no requests (e.g., after an hour or two)? Then, when it gets a request, I'd like it to automatically wake up (i.e., allocate some cloudlets and restore memory from disk). I see how to stop and restart it manually, but I would like it to work automatically, for any requester.
edit: I found the following documentation, but does it not work for Tomcat/Grails?
Hibernation
Jelastic’s hibernation feature delivers even better utilization of cluster resources. Optimal use of resources is achieved by suspending non-active containers and returning released resources back to the cluster.
Because they are in sleep mode, hibernated containers do not consume resources (only disk space). As a result you save money while your containers are in hibernate mode. If applications are needed again the platform returns them to a running state again in just a few seconds.
It takes a little time to awaken your environment from sleep, so it's not suitable to work how you describe for production use - you would effectively lose visitors because it would seem like your service is offline due to the delays for that first access.
For that reason the 'sleep' function is only active for trial accounts, and the inactivity time before sleep is set by the hosting provider (so you should contact them directly for help on that point).
Of course you should also remember that accesses from search engine spiders etc. may keep your environment awake.

Resources