After wasting several hours over each of the last several days troubleshooting this seemingly obscure TLS encryption issue in SSSD I feel compelled to write down some notes. It all started with a switch upgrade in our rack. Two new servers I had just provisioned were failing to find users in our 389 directory server. The systemd journal showed this unexpected message:
sssd[be[default]][930]: Could not start TLS encryption. unknown error
I have been deploying the same SSSD configuration on servers via Ansible for years and it has always worked reliably until now. Imagine my surprise when I looked on a few older servers and found that they had ceased working as well!
Watching Logs and Pulling Hair
Scratching my head, I increased the debug level and watched the logs as I restarted the sssd
service. I was seeing various errors indicating problems with encryption on the LDAP server, as well as general connection issues. For example:
[be[default]] [sss_ldap_init_sys_connect_done] (0x0020): ldap_install_tls failed: [Connect error] [unknown error]
[fo_set_port_status] (0x0100): Marking port 636 of server 'example.org' as 'not working'
[be[default]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed: [5]: Input/output error.
Because our setup uses a self-signed TLS certificate I spent some time double checking that I was using the correct certificate authority, that the certificate authority was installed correctly on all clients, the differences between the TLS configuration in /etc/openldap/ldap.conf
and /etc/sssd/sssd.conf
, etc.
On the LDAP server side, the slapd
access logs were also leading me to believe there was a TLS certificate issue:
[11/Jun/2021:14:28:52.182392072 +0300] conn=7418 op=-1 fd=72 closed – Peer does not recognize and trust the CA that issued your certificate.
The curious thing is that I was able to query the LDAP server directly using ldapsearch
with the TLS-enabled configuration in /etc/openldap/ldap.conf
:
# ldapsearch -x -Z
This confirmed that the certificates and the server were fine—the issue must be specific to SSSD or its configuration. Having said that, I noticed that ldapsearch
took five to ten seconds to complete on some hosts, both with plaintext ldap://
on port 389 and TLS encrypted ldaps://
on port 636. This should have tipped me off to the possibility of issues other than TLS early on, but alas…
An MTU Problem
As I shared a blow-by-blow account of this ordeal on the #sssd
channel on LiberaChat (IRC), one user suggested that I might try increasing several of the LDAP-specific SSSD timeouts. Until now I had only tried the general SSSD timeout
that controls the heartbeat frequency. The sssd-ldap
man page documents several LDAP timeouts with a default of six seconds, which is suspiciously close to my slow ldapsearch
responses above.
I made the following adjustments and tried again:
ldap_network_timeout = 20
ldap_opt_timeout = 20
ldap_search_timeout = 20
ldap_enumeration_search_timeout = 20
And this worked! Remembering that we had recently installed a new switch and configured jumbo frames, and that several of the hosts were exhibiting abnormally slow networking, I managed to discover that I was experiencing an MTU problem:
You have the symptoms of an MTU problem: some TCP connections freeze, more or less reproducibly for a given command or URL but with no easily discernible overall pattern. A telltale symptom is that interactive ssh sessions work well but file transfers almost always fail.
In my case, completely unrelated to SSSD, there were a few hosts where the SSH connection was more or less fine until I started vim
or top
. Using tracepath
I confirmed that those hosts were connected to ports on the switch that were incorrectly configured with the default MTU of 1500 bytes, but whose own links were brought up with jumbo frames. After re-configuring those few ports on the switch my networking was working much more reliably and SSSD was able to—quickly!—resolve users in LDAP again.
In hindsight I should have paid more attention to the other error messages in the SSSD domain log. It had been years since I last had to look at this setup and I was so focused on this being an obscure TLS issue or an SSSD bug (version 1.16.5-10.el7_9.7) that I didn’t consider the possibility of the problem coming from elsewhere. Just because the SSSD service fails with a TLS error doesn’t necessarily mean that there is a cryptographic error in the TLS configuration—in this case it was a timeout in the TLS connection failing due to network misconfiguration!