-
Problem report
-
Resolution: Fixed
-
Trivial
-
None
-
None
-
Prev.Sprint, S24-W10/11, S24-W12/13
-
2
In a enverolment it's not possible to install agents directly on the database hosts for organizational reasons. So, a few proxy hosts with their agents are monitoring several hundred databases remotely, so it’s crucial that the agents have fluid concurrent connection handling.
During some testing, it was noticed that many of the hosts in the frontend have agent interface timeout errors and the metrics were only sporadically collected.
It was determined that this was caused by the connection pool locking mechanism in the Zabbix Agent2.
The agent locks its connection pool during any database connection attempt and during that time, any further database connection, whether it’s already in the pool or not, is prevented by the lock until the current attempt has timed out.
This effectively cancels out the agent2’s capability to process multiple requests at once, both in active and passive mode.
Is it possible to change the connection code so that a lock only applies to the slot that is currently in use in the pool, rather than the entire pool? We have tested the locking behavior with Oracle databases only, but it seems that the connection pool code is shared with that of other databases supported by the agent, so they likely have the same issue. Fixing this issue could potentially greatly improve performance overall in environments with mixed online and offline database hosts.
In order to speed up possible splution, we customized the agent to save failed connection attempts in the pool as well and also keep track of the error during the attempt.
This error is then returned immediately when another attempt to the same database host is made.
The failed connections are cleaned up after the keep alive period, allowing further attempts later on. This doesn’t fix the locking issue directly, but still helped a lot in the environment and gives some time for a long term solution.
The patch is provided for consideration.
In order to reproduce the issue, a test.sh script is provided. test.sh
$ time ./test.sh zabbix_get [325911]: Timeout while executing operation zabbix_get [325912]: Timeout while executing operation zabbix_get [325916]: Timeout while executing operation zabbix_get [325914]: zabbix_get [325917]: Timeout while executing operationTimeout while executing operation zabbix_get [325915]: Timeout while executing operation zabbix_get [325918]: Timeout while executing operation zabbix_get [325919]: Timeout while executing operation zabbix_get [325923]: Timeout while executing operation zabbix_get [325920]: Timeout while executing operation zabbix_get [325921]: Timeout while executing operation zabbix_get [325913]: Timeout while executing operation zabbix_get [325932]: Timeout while executing operation zabbix_get [325928]: zabbix_get [325929]: Timeout while executing operationzabbix_get [325931]: Timeout while executing operation Timeout while executing operation zabbix_get [325926]: Timeout while executing operation zabbix_get [325935]: Timeout while executing operation zabbix_get [325922]: Timeout while executing operationzabbix_get [325934]: Timeout while executing operation zabbix_get [325924]: Timeout while executing operation zabbix_get [325940]: Timeout while executing operation zabbix_get [325930]: Timeout while executing operation zabbix_get [325933]: zabbix_get [325925]: Timeout while executing operationTimeout while executing operation zabbix_get [325937]: Timeout while executing operation zabbix_get [325938]: Timeout while executing operation zabbix_get [325936]: Timeout while executing operation zabbix_get [325939]: Timeout while executing operation zabbix_get [325927]: Timeout while executing operation real 0m30,006s user 0m0,032s sys 0m0,014s
earlier started zabbix_get examples exits after 3 seconds one by one, but all still running zabbix_get exits after 30 seconds (default timeout of zabbix_get tool)
After you ran the script, by this command in another shell window
watch -n1 'netstat -np | grep 3333'
you can see how during 90 seconds (30 forked zabbix_get * 3 seconds agent's timeout) there single SYN_SENT connections appearing one by one.