A Python/Synapse Debugging Nightmare

Posted 2019-04-14 01:33:35 UTC

Why dynamic types are scary


Synapse is slow

Synapse is the reference implementation of a Matrix homeserver. It's also currently the only functional implementation. Unfortunately, it has terrible performance, and is difficult to configure. There are many factors contributing to it's performance issues and instability, but I think I might have discovered a new one, at least for our particular configuration.

flamegraph showing lots of time spent in getaddrinfo

It turns out that twisted, the networking library that synapse uses, defaults to doing hostname lookups by calling socket.getaddrinfo from the python standard library. Importantly, this function blocks until resolution has either completed successfully or failed, potentially by timing out. On linux, this timeout can be up configured up to 30 seconds. This is a serious problem for an asynchronous networking library, so twisted does improve on this situation a bit by calling getaddrinfo inside a thread pool. Unfortunately, python has the famous GIL, so this doesn't mean all that much. Fortunately, twisted also comes with an asynchronous DNS resolver in twisted.names.client.

Attempted solution

How hard could it be to swap out the default resolver for the asynchronous one? It turns out, not very:

diff --git a/synapse/app/_base.py b/synapse/app/_base.py
index d4c6c4c8e..e72b4f472 100644
--- a/synapse/app/_base.py
+++ b/synapse/app/_base.py
@@ -22,6 +22,7 @@ import traceback
 import psutil
 from daemonize import Daemonize

+import twisted.names
 from twisted.internet import error, reactor
 from twisted.protocols.tls import TLSMemoryBIOFactory

@@ -124,6 +125,10 @@ def start_reactor(
             change_resource_limit(soft_file_limit)
             if gc_thresholds:
                 gc.set_threshold(*gc_thresholds)
+
+            # change the default resolver to avoid blocking getaddrinfo calls
+            reactor.installResolver(twisted.names.client)
+
             reactor.run()

     if daemonize:

It took a while to figure out which two lines I had to add, but the solution in the end is simple enough. Sure enough, the profiler results are much more reasonable. I sent this off as a pull request on synapse and it immediately failed a bunch of tests.

flamegraph without getaddrinfo

More problems

Synapse has two different testing systems: the python unit tests and the black box integration tests run by a tool called sytest. This patch had passed all the unit tests. I didn't want to mess with perl package management, so I never actually ran sytest before submitting the pull request. This was a mistake, as the sytest run spits out a bunch of messages that look like this:

twisted.names.error.DNSNameError: <Message id=55641 rCode=3 maxSize=0 flags=answer,recDes,recAv queries=[Query('localhost.localdomain', 255, 1)] authority=[<RR name= type=SOA class=IN ttl=9917s auth=False>]>
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: localhost.localdomain.

My first thought here was "oh, the async resolver probably isn't using /etc/hosts, which is how localhost gets defined in the first place". Interestingly, twisted.names.client definitely talks about reading the /etc/hosts file by default in the documentation. I wrote up a simple test case to make debugging easier:

lookupRemote = client.getHostByName("example.com")
lookupRemote.addCallback(print)
lookupRemote.addErrback(print)

lookupLocal = client.getHostByName("localhost.localdomain")
lookupLocal.addCallback(print)
lookupLocal.addErrback(print)

reactor.run()
[Failure instance: Traceback (failure with no frames): <class 'twisted.names.error.DNSNameError'>: <Message id=17073 rCode=3 maxSize=0 flags=answer,recDes,recAv queries=[Query('localhost.localdomain', 255, 1)] authority=[<RR name= type=SOA class=IN ttl=4543s auth=False>]> ]
2606:2800:220:1:248:1893:25c8:1946

This looks promising. The actual DNS lookup returns an IP address successfully, but the lookup from /etc/hosts fails.

Wow, everything is broken!

The implementation of name resolution from /etc/hosts in twisted ends up in the function searchFileFor in twisted/names/hosts.py. This function calls searchFileForAll in the same file. The implementation of both of these is shown below.

def searchFileForAll(hostsFile, name):
    """
    Search the given file, which is in hosts(5) standard format, for an address
    entry with a given name.
    @param hostsFile: The name of the hosts(5)-format file to search.
    @type hostsFile: L{FilePath}
    @param name: The name to search for.
    @type name: C{str}
    @return: L{None} if the name is not found in the file, otherwise a
        C{str} giving the address in the file associated with the name.
    """
    results = []
    try:
        lines = hostsFile.getContent().splitlines()
    except:
        return results

    name = name.lower()
    for line in lines:
        idx = line.find(b'#')
        if idx != -1:
            line = line[:idx]
        if not line:
            continue
        parts = line.split()

        if name.lower() in [s.lower() for s in parts[1:]]:
            results.append(nativeString(parts[0]))
    return results



def searchFileFor(file, name):
    """
    Grep given file, which is in hosts(5) standard format, for an address
    entry with a given name.
    @param file: The name of the hosts(5)-format file to search.
    @param name: The name to search for.
    @type name: C{str}
    @return: L{None} if the name is not found in the file, otherwise a
        C{str} giving the address in the file associated with the name.
    """
    addresses = searchFileForAll(FilePath(file), name)
    if addresses:
        return addresses[0]
    return None

As expected, searchFileFor("/etc/hosts", "localhost.localdomain") returns None. Besides the questionable decision to silently return [] when reading the hosts file fails, there's a very subtle bug here. Let's try a variant:

lines = hostsFile.getContent().splitlines()
for line in lines:
    parts = line.split()
    print([s.lower() for s in parts[1:]])
[]
[b'/etc/hosts:', b'static', b'lookup', b'table', b'for', b'host', b'names']
[]
[]
[b'<hostname.domain.org>', b'<hostname>']
[b'localhost.localdomain', b'localhost']
[b'localhost.localdomain', b'localhost']
[b'servy.localdomain', b'servy']
[]
[b'end', b'of', b'file']
>>> "localhost.localdomain" == b"localhost.localdomain"
False

Oh.

Note that the docstring is wrong. searchFileFor and searchFileForAll do not accept str for name. I'm going to assume this behavior is not intentional, and was presumably introduced when migrating the project from Python 2 to Python 3. This is because in Python 2, strings defaulted to ascii bytes, with u"string" being unicode. This was swapped in Python 3, for very good reasons.

This is why refactoring in python is scary. There is no way our puny human brains are going to be able to reason about this system.