Date: Mon, 6 Sep 2021 17:30:05 +0200
From: Martin Pieuchot 
Subject: Analyse of kernel lock contention

Unlocking UVM faults makes build time decrease a lot and improve the
overall latency of mixed userland workload.  In other words it gives
a smoother feeling for "desktop usage": it is now possible to do 'make
-j17' and watch a HD video at the same time.

So what next?  The 4 Flamegraphs below were captured with patrick@
during the WE.  We used its desktop 16-core arm64 machine with
amdgpu(4).  They all include the UVM unlocking diff and one also
includes the poll(2)/select(2) diff + unlocked sowakeup().  Web browsing
has been performed with iridium.

1) make-j17_arm64.svg

Building a kernel with 17 jobs is hard and only 30% of CPU time is spend
in userland.

  - Overall spinning time is ~40% (18% on KERNEL_LOCK(), 10% on
    SCHED_LOCK(), 12% on UVM's pageqlock)

    . the UVM unlocking diff made the contention shift from the
      KERNEL_LOCK() to the global pageqlock and per-amap rwlock. 
      Due to the high contention on shared amap in this workload
      many threads go to sleep at the same time which makes some
      contention appear on the SCHED_LOCK().

    . The SCHED_LOCK() is not *yet* a problem.  What is happening here
      shows that our rwlock implementation relying on a global sleep
      queue is suboptimal.  However in UVM's `vmobjlock' case we should
      hopefully turn many of the existing write locks into read locks.
      NetBSD is already doing that and this should be good enough to
      prevent some threads to go to sleep thus avoiding SCHED_LOCK() (or
      any global lock for the sleep queue) contention.

    . contention on the pageqlock could be reduced by revisiting/adding
      per UVM page locking

   - 10% of CPU time is spent idle.  It is hard to say how much this is
     because of the scheduler and/or its interaction with high spinning
     time.  However it is worth investigation.

  - Syscalls that need the KERNEL_LOCK() for this workload fall into 2

    . UVM ones that could be unlocked as part of a UVM next step:
       execve(2), fork(2), kbind(2), mmap(2), munmap(2), mprotect(2)

    . FS ones where the KERNEL_LOCK() could be pushed down to the VFS
      layer similarly to what has already be done for read(2) & write(2):
         dofsstatat(2), doopenat(2), __realpath(2), ioctl(2)

2) 2ytHD+make-j17_arm64.svg

Goal of this test was to generate enough workload to not have idle CPUs
and to expose where the contention is with a "desktop" usage.  Almost
the same amount of CPU time is spend in userland ~30-35%.  Which gives
us an indication that OpenBSD kernel isn't yet scaling to 16 CPUs for
such use case.

  -  Overall spinning time is also ~40% but with a different repartition
    (30% on KERNEL_LOCK(), 2% on SCHED_LOCK(), 8% on UVM's pageqlock).

  - syscalls that need the KERNEL_LOCK() for this workload are the same
    as above (for obvious reasons) but the following are, IMHO, the most
    important ones:

    . The kernel lock spinning time in futex(2) is there because sleeping
      with PCATCH still require it.

    . pipe, unix and network sockets all use selwakeup() and spin there
      because poll(2) & select(2) still need it.


   - With the kqpoll diff (2ytHD+make-j17+kqpoll_unlocked_arm64.svg) the
     contention in sowakeup() disappear, the one in pipeselwakeup() could
     receive the same treatment.
3) 2ytHD+googlemap_arm64.svg

The intend of this test is to expose where the contention is for heavy
multi-threaded process workload.  We didn't care much about idle time,
it is much more about low latency, how "smooth" can run desktop apps in
other words what happens in the kernel.

  - UVM fault unlocking is "good enough" for such workload and all the
    contention is due to syscalls

  - If we look at time spent in kernel, 37% is spent spinning on the
    KERNEL_LOCK() and 12% on the SCHED_LOCK().  So almost half of %sys
    time is spinning.

    . futex(2) for FUTEX_WAIT exposes most of it.  It spins on the
      KERNEL_LOCK() because sleeping with PCATCH requires it, then it
      spins on the SCHED_LOCK() to put itself on the sleep queue.

    . kevent(2), poll(2), and DRM ioctl(2) are responsible for a lot
      of KERNEL_LOCK() contention in this workload 

    . NET_LOCK() contention in poll(2) and kqueue(2) generate a lot of
      sleeps which, together with a lot of futex(2) make the SCHED_LOCK()
      contention bad.


Unlocking UVM fault is the obvious next step and we are not finished
with that yet.

Making poll(2) & select(2) work on top of the kqueue subsystem will
allow us to unlock selwakeup() & friends.  This will also help for
workloads with network traffic going to userland (server, proxy, etc).  

Completely unlocking poll(2), select(2) and kqueue(2) will require
making rwsleep(9) w/ PCATCH work without KERNEL_LOCK().  This implies
make signals work w/o KERNEL_LOCK().  This will also reduce the
contention in futex(2).

Unlocking UVM fault will make it easier to unlock many UVM related
syscalls.  This will help for workloads that fork a lot.

Pushing the KERNEL_LOCK() at the VFS border in all other syscalls
that matter can already be done and should already help, so I see no
reason to wait.