Weird three minutes delay for TSQLConnection.Connect with InterBase 7.5 - delphi

Environment:
Delphi 2009 client applications (and one Java), running on Windows 2003 server
connecting to InterBase 7.5.1 (another Windows 2003 Server) over dbExpress
The Delphi applications log the time to open the TSQLConnection using the AfterConnect event handler of the TSQLConnection object.
In random intervals, the connect need a three-minutes "extra time". I first suspected it could be a problem with the SQL query, but more detailed logging today showed that it is the SQLConnection.Connect which hangs.
I am not sure if this could be a problem with network, the InterBase server, or the Delphi / dbExpress layer.
Has anybody experienced a similar three-minutes "hang"?
p.s. the Java application does not log connect time so I can not say wheter it is affected by this problem.
This phenomenon appeared in the log files since we started with logging in 2012, but the rate has sharply increased last month. The only environment change has been the addition of new Windows servers (for RDP services, Mail, and Fax) so it could be a network-related problem.

Aside of a possible network problem, the cause of the delay can be that, from time to time, your query triggers a garbage collection in one of the table(s) that it is querying.
I don't know in detail Interbase 7.5 internals, but in my experience (with Firebird), this usually happens when a select is made on a table from which many records have been deleted/updated recently.
This doc at IBExpert.net explains it:
A garbage collection is only performed during a database sweep, database backup or when a SELECT query is made on a table (and not by INSERT, ALTER or DELETE). Whenever Firebird/InterBase® touches a row, such as during a SELECT operation, the versioning engine sweeps out any versions of the row where the transaction number is older than the Oldest Interesting Transaction (OIT). This helps to keep the version history small and manageable and also keeps performance reasonable.
A periodic sweep or backup made at low usage hours, can increase performance and minimize the risk of being hitted by an inconvenient garbage collection. See Sweep interval and automated housekeeping (page 6-20) and Facilitating garbage collection (page 11-19) at the Interbase 7.5 Operations Guide for more info on this.

Please check if hard disk power saving is activated on any disk on mentioned servers. That would explain if you have a delay in first connect and then no delay in following connections. Then, after a while power saving gets activated and the same problem raises again.

Since the rate has increased with the additions of new servers on the network you could have a packet loss and a long timeout to retry. For test that hypothesis you can change the connection timeout to a small value. You also can monitor the network traffic between the servers using wireshark or tcpdump.
Monitoring
To monitor the TCP handshake only you can use:
tcpdump -i eth0 'tcp[13] & 2 = 2

Related

Erlang/Elixir on Docker and Hot Code Swap

One of the features of Erlang (and, by definition, Elixir) is that you can do hot code swap. However, this seems to be at odd with Docker, where you would need to stop your instances and restart new ones with new images holding the new code. This essentially seem to be what everyone does.
This being said, I also know that it is possible to use one hidden node to distribute updates to all other nodes over network. Of course, just like that is sounds like asking for trouble, but...
My question would be the following: has anyone tried and achieved with reasonable success to set up a Docker-based infrastructure for Erlang/Elixir that allowed Hot-code swapping? If so, what are the do's, don'ts and caveats?
The story
Imagine a system to handle mobile phone calls or mobile data access (that's what Erlang was created for). There are gateway servers that maintain the user session for the duration of the call, or the data access session (I will call it the session going forward). Those server have an in-memory representation of the session for as long as the session is active (user is connected).
Now there is another system that calculates how much to charge the user for the call or the data transfered (call it PDF - Policy Decision Function). Both systems are connected in such a way that the gateway server creates a handful of TCP connections to PDF and it drops users sessions if those TCP connections go down. The gateway can handle a few hundred thousand customers at a time. Whenever there is an event that the user needs to be charged for (next data transfer, another minute of the call) the gateway notifies PDF about the fact and PDF subtracts a specific amount of money from the user account. When the user account is empty PDF notifies the gateway to disconnect the call (you've run out of money, you need to top up).
Your question
Finally let's talk about your question in this context. We want to upgrade a PDF node and the node is running on Docker. We create a new Docker instance with the new version of the software, but we can't shut down the old version (there are hundreds of thousands of customers in the middle of their call, we can't disconnect them). But we need to move the customers somehow from the old PDF to the new version. So we tell the gateway node to create any new connections with the updated node instead of the old PDF. Customers can be chatty and also some of them may have a long-running data connections (downloading Windows 10 iso) so the whole operation takes 2-3 days to complete. That's how long it can take to upgrade one version of the software to another in case of a critical bug. And there may be dozens of servers like this one, each one handling hundreds thousands of customers.
But what if we used the Erlang release handler instead? We create the relup file with the new version of the software. We test it properly and deploy to PDF nodes. Each node is upgraded in-place - the internal state of the application is converted, the node is running the new version of the software. But most importantly, the TCP connection with the gateway server has not been dropped. So customers happily continue their calls or are downloading the latest Windows iso while we are upgrading the system. All is done in 10 seconds rather than 2-3 days.
The answer
This is an example of a specific system with specific requirements. Docker and Erlang's Release Handling are orthogonal technologies. You can use either or both, it all boils down to the following:
Requirements
Cost
Will you have enough resources to test both approaches predictably and enough patience to teach your Ops team so that they can deploy the system using either method? What if the testing facility cost millions of pounds (because of the required hardware) and can use only one of those two methods at a time (because the test cycle takes days)?
The pragmatic approach might be to deploy the nodes initially using Docker and then upgrade them with Erlang release handler (if you need to use Docker in the first place). Or, if your system doesn't need to be available during the upgrade (as the example PDF system does), you might just opt for always deploying new versions with Docker and forget about release handling. Or you may as well stick with release handler and forget about Docker if you need quick and reliable updates on-the-fly and Docker would be only used for the initial deployment. I hope that helps.

Is DataSnap Optimized for responding to more than 1k users at the same time?

We want to start a big multi-tier application. The server side application must respond to more than 1000 users at the same time. We want to create server application by 64 bit compiler and client side with 32 bit. In this case we don't know DataSnap can respond to all client without any problem or not?
In this case The Server computer is very powerful (multi-processor and more than 16GB of RAM) and DataBase Management system is FireBird 2.5.
You need a way to perform realistic load tests.
For the Firebird database, you can simulate concurrent users with the free Apache JMeter tool. It can run SQL statements and record their execution time statistics (average, min/max etc.). So you could for example create a thread group with twenty different SQL queries, and then run twenty threads which each will perform these queries sequentially.
JMeter allows to define time limits on the SQL query, and JMeter treats it as an error if the query exceeds this limit. Then you can try to find the maximum client count where the overall error rate is still less than (for example) five percent.
But you also need to know how high the expected database load will be, and you will also need to have a test database with a realistic size, not only a couple of records. Also, some database queries like reports might cause higher load - these should be included in the simulation too, as they can affect overall performance. In JMeter, you can create a second thread group, running in parallel with the first one, for these long-running statements with different settings (less simulated clients).
Testing the database will show if there is a bottleneck already in this area. For example, the test result could be that the database can serve twenty clients with a total average transaction rate of 20 TPS (transactions per second), which means one client executes one transaction per second. But this TPS value will decrease with higher user count.
Related question: Firebird usage in big projects which also has a link to http://www.firebirdsql.org/en/case-studies-catalog/
Regarding DataSnap client load simulation: this can be done with a scripted client, which runs a predefined set of statements / commands over the connection.
To run a high number of load test clients simultaneaously you could use a service like Amazon Elastic Compute Cloud (EC2), to launch clones of your test machine image, saving you hardware costs. But of course I would start with a small client machine which simply runs ten or twenty scripted clients.
As far as I know DataSnap is based on Indy. And Indy's connection handling model is not very scaleable - one thread per connection, which is very resource consuming. Even using Indy's thread pools is not an option I think... Also in Windows (32 bit) for example there is a limit of the maximum threads you can create (2000 IIRC). Anyway - using many threads is not good and hits performance of the server (for reference - Windows Internals book, Windows Performance Team blog etc.)
A scalable, robust and professional application server would use IO Completion ports (IOCP) for data processing. But I don't know if DataSnap can take advantage of it.
UPDATE:
On the CodeRage7 I asked similar scaleability questions. Here are the answers:
Q: Recently there was a question on StackOverflow about DataSnap's scaleability/performance. So can DS handle, for example, 2000 or more concurent user request at the network and application level?
A: the scalability is based on scalability of TCP/HTTP/HTTPs and # of connections allowed in your server operating system. Also based on memory and hardware you employ. There is no specific limit in DataSnap.
My comment: While this is true, Indy's connection handling model, i.e. one thread per connection, introduces bottleneck especially in 32 bit Windows (2000 threads max). In Win64 it should not be so much problem, but again - this kind of handling data flow leads to performance degradation.
Q: Does DataSnap support some kind of load balancing?
A: Not directly. You can do this in code in your DataSnap server(s).
My comment: I've found very good paper on implementing Failover/Load Balancing in DataSnap in Andreano Lanusse's blog
Q: Does DataSnap support IO Completion ports for better scaleability?
This my question was left unanswered.
Hope this helps!
UPDATE2:
I found very interesting post on DS Performance: DataSnap analysis based on Speed & Stability tests
UPDATE3:
DataSnap, Deployment, Performance, and More (Marco Cantu)
Monitoring and control of connections in DataSnap XE2 - translated in English
Monitoring and control of connections in DataSnap XE2 - original
When the specifications for a system are made, you need to be very precise when it's about multiple users.
For example: you create a website, and the client expects 15.000 unique users.
Then the client usually comes up with a requirement that the system needs to support 15.000 simultaneous users, which is very naive.
You'll need a more detailed specification than that.
Usually it's more sensible to say something like: in 99% of the requests, 99% of the users can get a response to their request within 5 seconds average.
In normal usage, you'll never see all users send a request within the same second. If at some point they all arrive within the same minute (also very unlikely), you'll have a lot fewer concurrent users.
Even for websites with tens of thousands of users, where most of them connect on a daily basis, the webserver is idle most the time, and once and a while it jumps to 5% or in extreme cases to 20%. If we really have to serve all of these users at once we'd be screwed, but that never happens, and it's not realistic to prepare a server for such loads.

TCP connection timeout is 20 or 21 seconds on *some* PCs when set to 500ms

I was given 10 new PCs, all (supposedly) with Windows 7 Pro freshly installed and nothing else done to them.
I have a program, coded in Delphi XE2, using Indy 10 components for the networking. I set the "connect timeout" and "read timeout" properties of my TIdTcpCleint to 500ms, set "resuse socket" to 'o/s dependant'" (I also tried a build with it set to No) and leave "use Nagle" (whatever that is set to True (I also tried with false).
Here's the problem: when I run the same .EXE on these PCs and test the case where I pull the network cable, my debug trace shows the connect attempt / connect timeout happening in the same second or the next second (with a granularity of 1 second) - but on others it is 20 or 21 seconds before I see the conenction timeout.
It would seem some of that the PCs are not totally "fresh install" as claimed, although I see no aps installed. Maybe some one installed somethign then removed it, maybe they tried to tweak performance.
Before I reinstall Windows on 10 PCs, can anyone suggest where to look? Does 20 (or 21) seconds ring a bell with regard to TCP Client connect timeout?
[update] I am attempting to connect directly to a specific IP Address, so I am not sure if #Nikolai suggestion to check DNS is relevant. Sorry for not mentioning this originally.
[upperdate] the program does not attempt to keep the socket open. It connects, sends some data & disconnects - repeatedly, for each new piece of data.
Sadly, this is working as intended. The connect did already timeout. Indy made the determination that the connect would fail in the 500 milliseconds that you asked it to. However, that does not guarantee the function will return.
After the connect times out, Indy spins down the connection to release all of its resources. It does this synchronously. This means that you wind up waiting for the underlying TCP operation to fail. This typically takes 20 seconds.
The solution is to call connect in a thread. Believe it or not, this is what Indy already does to implement the timeout. However, when it times out waiting for the thread, it tries to shut down the connection in the main thread. You need to defer that to a worker thread.
As for why it happens immediately on some systems and in 20 seconds on others, it depends on the precise networking configuration. For example, if IPv6 is enabled, the stack may attempt to use an IPv6-to-IPv4 connection, and that may not report down even if the physical interface is down. Immediate detection of connection impossibility is never guaranteed and you shouldn't rely on it.
I've had same problems with INDY in the past (while using D6, year 1998-2000). I changed the component to IP*Works. At that time it was an external component, but as far as I know it is included in XE2. Ip*Works is a bit hard to understand at the beginning but the way they approach to the communication structure is a lot different.
I think that it would be worth to give it a try.

How to license number of users at the database?

Given a Delphi and Interbase client-server application, I'd like to license the application by the number of users at the database. How can this be done with commercial licensing software? I don't see any of those listing features that look like they would cover this. Every user initially logs on to the database. The database seems so available that it would be open to any user - or at least administrators. Would I have to also write a Delphi exe or dll to run on the server - perhaps as a function in the database - with the licensing connected to that? Not sure how to proceed.
BTW, Interbase licenses simultaneous users, but I think they wrote that right into the server, but I want something similar.
To control simultaneous client connections you definitively need a server side application.
It can be a simple tcp/ip socket server as a service (daemon on linux) or another (midas?) server layer.
When your client app starts it call a server method for example Session.Connect, here you count active connections and return false (no code) in case of maximum limit reaches.
When application closes you notify server with Session.Disconnect. to decrease connection count.
Also is a good idea to keep a live (permanent) connection between client app and server service (as I sad sockets) to handle application hangups, uncontrolled restarts and process this event for example OnSockedDisconnect on server side, to decrease connection count and handle for disconnect propery, for example write in logs etc...
Of course communication should be crypted (handshaked), to avoid unwanted guests.
You can play also with sim cardreaders etc..
This method will not provide a industrial (nuclear) level of security, but if coded corectly it can take some time even for an expert hacker to broke it.
OR, you may take a look at some ready protection tools like SafeNet (HASP protection).
Also, Firebird (and maybe Interbase) have on DB Connect / Disconnect triggers, where if user have privileges it can read connection count. But these can be easily changed if DB are stored on customer server.

Can my program use Indy 10 at a customer site if I wrote it to use Indy 9?

I have written a program in Delphi 7 (includes a ModBus component that uses Indy). On my machine it uses Indy 9 and works fine. It communicates well with other machines via a ModBus protocol. However, when the program is run on a different machine, I get a CPU 90-100% load. Unfortunately this machine is not in my office but "on the other side of the world". How can I find out whether this machine is using Indy 9 or Indy 10? And, further, If it is running Indy 10, could that be the problem or is this very unlikely?
Definitive answer is No
If you compile your program with indy 9, even if using packages, it shall use INDY 9 to run. AFAIK, there's no way to compile the executable using INDY 9 and use INDY 10 at runtime, even if you want, and no way it happen by accident.
To find out whats causing the high CPU load you might try a profiler like AQTime or SamplingProfiler.
That will get you the method(s) that are running most of the time. Then you will be able to find out whats causing the problem.
Alternatively you could add some logging to your application.
To find the root cause you could prepare a test application which will go through a sequence of actions like opening / closing connections. If it asks the user for confirmation ("Continue ? y/n") before proceeding, the user can check the CPU load for every step to detect the critical operation.
Thanks for answers. I do not think this is an Indy issue though. On my Quad CPU PC the CPU load also goes up from 1-2 % to aprox. 25%. This happens if I keep the line open (connected). If I, however, disconnect the ModBus Server after every poll from the ModBus CLient side and let that PC reconnect, the CPU load is always low. WHat is normal? Having the line open all time, or connect and disconnect for every poll? The polling frequency is: in Idle mode : 2000ms, in active mode 500ms.
you need to add logs to ensure you know whats going on.
is it the connection itself that is causing you the issue? or is it the work performed while connected?
Logs will help you narrow this down and you may be able to alter you code to be less processor hungry.
using AQTime or SamplingProfiler as also suggest earlier will help you.
personally i always add logging to every application by default, alot of them require turning on but its there. Once the software it on site you never know what may change and simply turning the logs on can save you alot of time

Resources