Saturday, September 27, 2008

Improving boot time on a general Linux distribution, not an easy task

We have just just released Mandriva Linux 2009 RC2 (with GNOME 2.24 final version, among new features ;), as well as reduce boot time on a lot of systems.

I thought it would be interesting to explain the various things we tried to save some seconds when booting, since it is a hot topic these days, with impressive results from various people, including Arjan Van de Ven 5s boot on a EEE 901 PC, even if I don't agree with all Arjan conclusions, mostly because it is not always possible to achieve the same kind of tuning with a flexible distribution which can run on many hardware platform, in contrast of a stripped installation and on a single (and now underpowered) hardware platform (Unfortunately, Mandriva folks couldn't attend to LPC this year, because we were busy working on Mandriva 2009 release ; let's hope new year LPC schedule won't conflict with our own schedule).

Before continuing, boot time covers three different aspects (and timing) :
- full boot time : from kernel startup to text login being available
- perceived boot time : from kernel startup to graphical login being available (you'll understand why I separate both later ;)
- graphical desktop startup time : from graphical login to desktop environment up and running (all apps from the session running and correctly rendered)

Over the years, at Mandriva, we have worked on improving boot time but causing regressions in our distribution :

  • In 2002 (yes, 6 years ago), we started reducing perceived boot time by starting X server (and display manager) before the entire boot was completed. This was the premise of "parallel boot" and gave good results, after some tuning (you must be sure all services needed when display manager starts have already been started).
  • In 2006, Couriousous (from Mandriva community) developed a parallel init implementation, called PrcSys, which was based on initscript LSB headers to handle dependencies across services. This created a vertuous circle : by ensuring initscripts were LSB compliant, we had parallel init support as a bonus. When done properly, it allows full boot time reduction up to 12s (compared to standard boot), reduction of perceived boot time is often not as big.

For Mandriva Linux 2009.0, we decided to try to improve boot time again (mostly full boot time and perceived boot time) by trying to fix bottlenecks in boot time when we found some, after testing many different systems.

First fix was to no longer wait for network up for dbus or for display manager if user authentication was not using network (LDAP / NIS / Samba). 1s gain in perceived boot time.

Then, we looked into "udev is slow" complain. Despite what most people think, udev by itself is not slow. What is usually slow is "coldplug" ie ensuring all modules for hardware on the system are loaded at startup and waiting for those modules to settle.

After decyphering strace logs, we discovered about 256 legacy ptys were created by kernel, which are no longer needed for most usage. With help from our kernel team, we reduce the default value of those ptys to 0 (it can be increased dynamically). Gain in for full boot time and perceived boot time : 2s.

We also had reports of "udev takes forever" when people had usb storage devices plugged on their system. We did some tests and it was adding about 5s to boot, mostly because of "usb-storage" settle delay (which is 5s), when udev coldplug starts. To try to reduce this, we are now loading usb-storage module before udev is started, if an usb mass storage device is detected, to make sure the 5s "usb-storage" settle delay is done in parallel with udev. Average gain : 3s (there is still a penalty of about 2s when usb mass storage is plugged but we can't really do anything about it ATM).

We also found some hardware specific issues (Asus EEE 701 and also a Core2 duo laptop) where initrd was extremely long. After investigation, our installer was adding usb-storage support in initrd for those platforms, even when it was not needed. And since initrd was waiting for usb devices to "settle", we were loosing between 6 to 15s. Yes, 15s on EEE PC ! (BTW, you can check if you have the issue on your Mandriva system, by checking /etc/modprobe.conf for scsi_hostadapter line ; if modprobe usb-storage is there, remove the call and regenerate your initrd).

Another issue was floppy support (you know, those old plastic squares). floppy module was still trying to be loaded by default on all systems at startup, even if there was no floppy drive present, delaying the entire boot for about 2s in that case.

Unfortunately, not all our experiments were successful in reducing udev startup : since coldplug is causing a lot of modprobe calls and modprobe is not very smart (parsing kernel alias for each calls), blino wrote a modprobe daemon to try to reduce those calls but it didn't gave us any improvements (and sometime even, we got regressions). Same result when using "modprobe --all" instead of several calls to modprobe for each module.

Still on module loading subject, we tried to reduce such modprobe calls in initscripts (better to handle this with modprobe configuration file). Alsa startup script got fixed (it is much less costly now than on 2008.1 and we will probably nuke it completely on 2009 Spring, by only using udev to handle alsa support), as well as iptables script (it was doing a lot of modprobe for some optional features which were not enabled on 99% of case). This allowed 3 to 5s gain, but since those scripts were run in parallel, it didn't really reduce full boot time, but at least, cpu is not spending precious cycles doing useless work.

Then, we checked what was spending too much time in the early part of boot (preventing display manager startup) and we found two bottlenecks : harddrake and DKMS :
  • harddrake (our hardware autoconfiguration tool, which reconfigure on the fly your system if your hardware has changed since previous boot or if you change kernel version and no proprietary driver are available, harddrake will reconfigure X to use free driver) was quite long and since it was blocking display manager start (you want to be sure X is properly configured), it was directly impacting perceived boot time. 
  • dkms (it was integrated in our distro since 2004), is handling kernel module automatic rebuild, mostly for driver not included in Mandriva kernel or for proprietary modules. Unfortunately, even when module were correctly build for current kernel, dkms script was still very long, even for precompiled dkms module (for this particular feature, it was clearly a bug, since dkms script was not needed).
For those two bottlenecks, we discovered they were much faster to run when we timed execution after boot (after ensuring disk caches were flushed). They were impacted badly by parallel init. So, we moved dkms and harddrake startup outside parallel init to rc.sysinit and we were able to gain 3s for harddrake and 2 to 4s with dkms.

So far, we got good results but you might wonder why I didn't talk about readahead, since it is used on other distros. Well, we did experiment readahead in the past, and each time, we had regression in both full boot time and perceived boot time. Why ? Because parallel init is already doing a pretty good job and readahead was not causing regression when we disabled parallel init.
When using default readahead setup, additional IO were done when other services were also trying to start, causing bad performance. And even ensuring readahead is started before all other services caused regressions in boot time.

Does this mean readahead is a dead-end ? Not really, when you look at bootchart closely. The idea was to find time slots where no IO were done and cause readahead at that time (thanks to Arjan idea from his 5s boot talk). And there is a big slot with low IO usage : udev coldplug ! First trial was to start readhead very early in the boot but it was still causing boot time regression. Then, we tried to call readahead directly in start_udev script, just after coldplug is initiated, before all udev triggers are settled. And things started to improved (yay !).

So, we tried to check with "custom" readahead list (not using our default stripped list of files but the real list of files used on the test system). And then, regression came back. Back to drawing board. Discussion with other collegues (and Arjan hints) lead us to try something else : scheduling readahead IO as idle, to make sure readahead is not impacting other processes if readahead file list is large. And guess what ? it worked ! no more regression in boot time and even better, improved boot time, both full boot time and perceived boot time, and we were able to move readahead call back early in the boot process, before starting udev. But since test was only done on one system, we checked this change was also working on a lot of different systems, with powerful or low-end CPU, slow harddrive, fast harddrive, SSD (which are still quite slow these days). And results were quite positive : we never had any regression in boot time. Either timing were the same (EEE 701 is a good example, CPU is always at 100%, so IO are not bottleneck), or both full and perceived boot time were improved. On my home system, I got a 2s improved, in both full and perceived boot time.

So, we plugged IO idle readahead (for testers, don't search it in Mdv 2009 RC2, it was not part of it), as well as automatic readahead file list creation (based on work from Fedora folks). What does it mean ? On first boot, readahead will not be improve boot time but instead will monitor which files were used for boot. Then, the list will be optimized based on storage device and will be used on the second boot. Moreover, this list will be refreshed automatically every month (after reboot of course), to ensure optimisations are still relevant to the system.

So far, so good, but what about desktop login timing ? Kudos to Bedhad for his work on preload daemon : it preloads files used at login by desktop environment, using the idle time in display manager (gdm / kdm), when it is waiting for user to input his login and password. This daemon is monitor system to learn which programs are being used at login and preload them automatically. This is great because it is not based on static file list (difficult to do for a desktop agnostic distro like us) and doesn't require a specific mode for file monitoring, like readahead. And even better, if user changes his habbits (switching from KDE to GNOME for instance), after several logins, preload will preload GNOME files instead of KDE files. Even better, if autologin is enabled (so there is no idle time to preload), there is no regression in desktop login time, since IO preload is done in idle. After some measures, preload gave us about 5s improvement in desktop login.

In conclusion, as you can see, improving boot time is not an easy task, but we worked hard to improve it for upcoming Mandriva 2009. It requires a lot of measures (thanks bootchart), a lot of experiments and a wide range of systems (you can find some of the bootcharts used during our tests here).