• Prev.Sprint, S24-W10/11, S24-W12/13
    • 2

      In a enverolment it's not possible to install agents directly on the database hosts for organizational reasons. So, a few proxy hosts with their agents are monitoring several hundred databases remotely, so it’s crucial that the agents have fluid concurrent connection handling.

      During some testing, it was noticed that many of the hosts in the frontend have agent interface timeout errors and the metrics were only sporadically collected.

      It was determined that this was caused by the connection pool locking mechanism in the Zabbix Agent2. 
      The agent locks its connection pool during any database connection attempt and during that time, any further database connection, whether it’s already in the pool or not, is prevented by the lock until the current attempt has timed out.
      This effectively cancels out the agent2’s capability to process multiple requests at once, both in active and passive mode.

      Is it possible to change the connection code so that a lock only applies to the slot that is currently in use in the pool, rather than the entire pool? We have tested the locking behavior with Oracle databases only, but it seems that the connection pool code is shared with that of other databases supported by the agent, so they likely have the same issue. Fixing this issue could potentially greatly improve performance overall in environments with mixed online and offline database hosts.

      In order to speed up possible splution, we customized the agent to save failed connection attempts in the pool as well and also keep track of the error during the attempt.
      This error is then returned immediately when another attempt to the same database host is made.

      The failed connections are cleaned up after the keep alive period, allowing further attempts later on. This doesn’t fix the locking issue directly, but still helped a lot in the environment and gives some time for a long term solution.

      The patch is provided for consideration.

      In order to reproduce the issue, a test.sh script is provided. test.sh

      $ time ./test.sh
      zabbix_get [325911]: Timeout while executing operation
      zabbix_get [325912]: Timeout while executing operation
      zabbix_get [325916]: Timeout while executing operation
      zabbix_get [325914]: zabbix_get [325917]: Timeout while executing operationTimeout while executing operation
      
      zabbix_get [325915]: Timeout while executing operation
      zabbix_get [325918]: Timeout while executing operation
      zabbix_get [325919]: Timeout while executing operation
      zabbix_get [325923]: Timeout while executing operation
      zabbix_get [325920]: Timeout while executing operation
      zabbix_get [325921]: Timeout while executing operation
      zabbix_get [325913]: Timeout while executing operation
      zabbix_get [325932]: Timeout while executing operation
      zabbix_get [325928]: zabbix_get [325929]: Timeout while executing operationzabbix_get [325931]: Timeout while executing operation
      Timeout while executing operation
      
      zabbix_get [325926]: Timeout while executing operation
      zabbix_get [325935]: Timeout while executing operation
      zabbix_get [325922]: Timeout while executing operationzabbix_get [325934]: Timeout while executing operation
      
      zabbix_get [325924]: Timeout while executing operation
      zabbix_get [325940]: Timeout while executing operation
      zabbix_get [325930]: Timeout while executing operation
      zabbix_get [325933]: zabbix_get [325925]: Timeout while executing operationTimeout while executing operation
      
      zabbix_get [325937]: Timeout while executing operation
      zabbix_get [325938]: Timeout while executing operation
      zabbix_get [325936]: Timeout while executing operation
      zabbix_get [325939]: Timeout while executing operation
      zabbix_get [325927]: Timeout while executing operation
      
      real    0m30,006s
      user    0m0,032s
      sys     0m0,014s
      

      earlier started zabbix_get examples exits after 3 seconds one by one, but all still running zabbix_get exits after 30 seconds (default timeout of zabbix_get tool)

      After you ran the script, by this command in another shell window

      watch -n1 'netstat -np | grep 3333'
      

      you can see how during 90 seconds (30 forked zabbix_get * 3 seconds agent's timeout) there single SYN_SENT connections appearing one by one.

            rzvejs Rudolfs Zvejs
            zalex_ua Oleksii Zagorskyi
            Team INT
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: