samedi 27 septembre 2008

Improving boot time on a general Linux distribution, not an easy task

We have just just released Mandriva Linux 2009 RC2 (with GNOME 2.24 final version, among new features ;), as well as reduce boot time on a lot of systems.

I thought it would be interesting to explain the various things we tried to save some seconds when booting, since it is a hot topic these days, with impressive results from various people, including Arjan Van de Ven 5s boot on a EEE 901 PC, even if I don't agree with all Arjan conclusions, mostly because it is not always possible to achieve the same kind of tuning with a flexible distribution which can run on many hardware platform, in contrast of a stripped installation and on a single (and now underpowered) hardware platform (Unfortunately, Mandriva folks couldn't attend to LPC this year, because we were busy working on Mandriva 2009 release ; let's hope new year LPC schedule won't conflict with our own schedule).

Before continuing, boot time covers three different aspects (and timing) :
- full boot time : from kernel startup to text login being available
- perceived boot time : from kernel startup to graphical login being available (you'll understand why I separate both later ;)
- graphical desktop startup time : from graphical login to desktop environment up and running (all apps from the session running and correctly rendered)

Over the years, at Mandriva, we have worked on improving boot time but causing regressions in our distribution :

  • In 2002 (yes, 6 years ago), we started reducing perceived boot time by starting X server (and display manager) before the entire boot was completed. This was the premise of "parallel boot" and gave good results, after some tuning (you must be sure all services needed when display manager starts have already been started).
  • In 2006, Couriousous (from Mandriva community) developed a parallel init implementation, called PrcSys, which was based on initscript LSB headers to handle dependencies across services. This created a vertuous circle : by ensuring initscripts were LSB compliant, we had parallel init support as a bonus. When done properly, it allows full boot time reduction up to 12s (compared to standard boot), reduction of perceived boot time is often not as big.

For Mandriva Linux 2009.0, we decided to try to improve boot time again (mostly full boot time and perceived boot time) by trying to fix bottlenecks in boot time when we found some, after testing many different systems.

First fix was to no longer wait for network up for dbus or for display manager if user authentication was not using network (LDAP / NIS / Samba). 1s gain in perceived boot time.

Then, we looked into "udev is slow" complain. Despite what most people think, udev by itself is not slow. What is usually slow is "coldplug" ie ensuring all modules for hardware on the system are loaded at startup and waiting for those modules to settle.

After decyphering strace logs, we discovered about 256 legacy ptys were created by kernel, which are no longer needed for most usage. With help from our kernel team, we reduce the default value of those ptys to 0 (it can be increased dynamically). Gain in for full boot time and perceived boot time : 2s.

We also had reports of "udev takes forever" when people had usb storage devices plugged on their system. We did some tests and it was adding about 5s to boot, mostly because of "usb-storage" settle delay (which is 5s), when udev coldplug starts. To try to reduce this, we are now loading usb-storage module before udev is started, if an usb mass storage device is detected, to make sure the 5s "usb-storage" settle delay is done in parallel with udev. Average gain : 3s (there is still a penalty of about 2s when usb mass storage is plugged but we can't really do anything about it ATM).

We also found some hardware specific issues (Asus EEE 701 and also a Core2 duo laptop) where initrd was extremely long. After investigation, our installer was adding usb-storage support in initrd for those platforms, even when it was not needed. And since initrd was waiting for usb devices to "settle", we were loosing between 6 to 15s. Yes, 15s on EEE PC ! (BTW, you can check if you have the issue on your Mandriva system, by checking /etc/modprobe.conf for scsi_hostadapter line ; if modprobe usb-storage is there, remove the call and regenerate your initrd).

Another issue was floppy support (you know, those old plastic squares). floppy module was still trying to be loaded by default on all systems at startup, even if there was no floppy drive present, delaying the entire boot for about 2s in that case.

Unfortunately, not all our experiments were successful in reducing udev startup : since coldplug is causing a lot of modprobe calls and modprobe is not very smart (parsing kernel alias for each calls), blino wrote a modprobe daemon to try to reduce those calls but it didn't gave us any improvements (and sometime even, we got regressions). Same result when using "modprobe --all" instead of several calls to modprobe for each module.


Still on module loading subject, we tried to reduce such modprobe calls in initscripts (better to handle this with modprobe configuration file). Alsa startup script got fixed (it is much less costly now than on 2008.1 and we will probably nuke it completely on 2009 Spring, by only using udev to handle alsa support), as well as iptables script (it was doing a lot of modprobe for some optional features which were not enabled on 99% of case). This allowed 3 to 5s gain, but since those scripts were run in parallel, it didn't really reduce full boot time, but at least, cpu is not spending precious cycles doing useless work.

Then, we checked what was spending too much time in the early part of boot (preventing display manager startup) and we found two bottlenecks : harddrake and DKMS :
  • harddrake (our hardware autoconfiguration tool, which reconfigure on the fly your system if your hardware has changed since previous boot or if you change kernel version and no proprietary driver are available, harddrake will reconfigure X to use free driver) was quite long and since it was blocking display manager start (you want to be sure X is properly configured), it was directly impacting perceived boot time. 
  • dkms (it was integrated in our distro since 2004), is handling kernel module automatic rebuild, mostly for driver not included in Mandriva kernel or for proprietary modules. Unfortunately, even when module were correctly build for current kernel, dkms script was still very long, even for precompiled dkms module (for this particular feature, it was clearly a bug, since dkms script was not needed).
For those two bottlenecks, we discovered they were much faster to run when we timed execution after boot (after ensuring disk caches were flushed). They were impacted badly by parallel init. So, we moved dkms and harddrake startup outside parallel init to rc.sysinit and we were able to gain 3s for harddrake and 2 to 4s with dkms.

So far, we got good results but you might wonder why I didn't talk about readahead, since it is used on other distros. Well, we did experiment readahead in the past, and each time, we had regression in both full boot time and perceived boot time. Why ? Because parallel init is already doing a pretty good job and readahead was not causing regression when we disabled parallel init.
When using default readahead setup, additional IO were done when other services were also trying to start, causing bad performance. And even ensuring readahead is started before all other services caused regressions in boot time.

Does this mean readahead is a dead-end ? Not really, when you look at bootchart closely. The idea was to find time slots where no IO were done and cause readahead at that time (thanks to Arjan idea from his 5s boot talk). And there is a big slot with low IO usage : udev coldplug ! First trial was to start readhead very early in the boot but it was still causing boot time regression. Then, we tried to call readahead directly in start_udev script, just after coldplug is initiated, before all udev triggers are settled. And things started to improved (yay !).

So, we tried to check with "custom" readahead list (not using our default stripped list of files but the real list of files used on the test system). And then, regression came back. Back to drawing board. Discussion with other collegues (and Arjan hints) lead us to try something else : scheduling readahead IO as idle, to make sure readahead is not impacting other processes if readahead file list is large. And guess what ? it worked ! no more regression in boot time and even better, improved boot time, both full boot time and perceived boot time, and we were able to move readahead call back early in the boot process, before starting udev. But since test was only done on one system, we checked this change was also working on a lot of different systems, with powerful or low-end CPU, slow harddrive, fast harddrive, SSD (which are still quite slow these days). And results were quite positive : we never had any regression in boot time. Either timing were the same (EEE 701 is a good example, CPU is always at 100%, so IO are not bottleneck), or both full and perceived boot time were improved. On my home system, I got a 2s improved, in both full and perceived boot time.

So, we plugged IO idle readahead (for testers, don't search it in Mdv 2009 RC2, it was not part of it), as well as automatic readahead file list creation (based on work from Fedora folks). What does it mean ? On first boot, readahead will not be improve boot time but instead will monitor which files were used for boot. Then, the list will be optimized based on storage device and will be used on the second boot. Moreover, this list will be refreshed automatically every month (after reboot of course), to ensure optimisations are still relevant to the system.

So far, so good, but what about desktop login timing ? Kudos to Bedhad for his work on preload daemon : it preloads files used at login by desktop environment, using the idle time in display manager (gdm / kdm), when it is waiting for user to input his login and password. This daemon is monitor system to learn which programs are being used at login and preload them automatically. This is great because it is not based on static file list (difficult to do for a desktop agnostic distro like us) and doesn't require a specific mode for file monitoring, like readahead. And even better, if user changes his habbits (switching from KDE to GNOME for instance), after several logins, preload will preload GNOME files instead of KDE files. Even better, if autologin is enabled (so there is no idle time to preload), there is no regression in desktop login time, since IO preload is done in idle. After some measures, preload gave us about 5s improvement in desktop login.

In conclusion, as you can see, improving boot time is not an easy task, but we worked hard to improve it for upcoming Mandriva 2009. It requires a lot of measures (thanks bootchart), a lot of experiments and a wide range of systems (you can find some of the bootcharts used during our tests here).

32 commentaires:

  1. I suppose you've looked at how XP handles this? And microsoft bootvis... in theory it learns and improves how to schedule things (I think the practice is a little different)

    RépondreSupprimer
  2. Thanks for posting this. I don't do work that looks anything like this but I find it fascinating to hear the different approaches (and successes) in optimizing this specific case.

    RépondreSupprimer
  3. Well, since nobody in Free Software community has access to XP source code (and nobody want to), no, we didn't look at XP. I'm not sure their way of "optimizing" boot can be mapped on Linux, since Linux boot is extremely flexible and not easily optimisable.

    RépondreSupprimer
  4. The XP (and OSX 10.4) ways of doing boot optimisation have been looked at (at least in passing) in other distributions.

    XP does boot and application prefetching - http://www.microsoft.com/whdc/archive/benchmark.mspx . Vista can also cache hard disk sectors on USB keys (the so called ReadyBoost - http://en.wikipedia.org/wiki/ReadyBoost ). Both can also do defragging.

    The OSX optimisations mentioned on http://www.kernelthread.com/mac/apme/optimizations/ are roughly readahead, caching results of previous boots, defragging and keeping a cache of needed boot files near the fastest part of the hard disk.

    RépondreSupprimer
  5. Have you found any regression in login time in the GNOME 2.23.x/2.24 since there have been some significant changes. I logged http://bugzilla.gnome.org/show_bug.cgi?id=553959 but I have not pin point down this down exactly why it is the case.

    Should your results show some slowdown in login, this may be one of them.

    -Ghee

    RépondreSupprimer
  6. There's another simple way, that isn't in your list.

    Implement a new flag for some low priority tasks/daemons, and run it after the user has logged in and started his software.

    I wrote a little script that instead of loading services on boot, monitor /proc/stat to get iowait, and only load services after iowait is non existant (and with a 1 sec delay between services).

    Services I start "late" : postfix, vixie-cron, mdadm, sshd, samba, netmount, ntp-client, apcupsd, smartd and 1-2 custom daemons.

    This should be even faster with readahead, as it allows 100% cpu + io focus on getting the desktop enviroment + applications the user loads right away (firefox/instant messenger etc) while loading the daemons later.

    RépondreSupprimer
  7. Regarding your usb-storage settle delay of 5s - I just discovered that my USB stick needs a settle delay of 15s (see http://bugzilla.kernel.org/show_bug.cgi?id=11640). This requires a kernel parameter at boot time. So please read the scsi_mod.inq_timeout parameter rather than assuming 5s!

    RépondreSupprimer
  8. Nice Read. And what is now the boot time for a complete boot in Mandriva?

    RépondreSupprimer
  9. Great work Fred, and nice to have words to have words about the work done behind the doors by the Mandirva team.
    Congrats !

    RépondreSupprimer
  10. - Regarding using flash for optimizing boot, even hard disk manufacturer selling hybrid drive recognized Vista "feature" is not working at all.
    - I'm not convinced by "defragmentation" speeding boot, specially with ext3, specially without any benchmarks.
    - no issue found in GNOME 2.24 startup
    - Anders, thanks for your idea, it might be interesting to add some additional custom headers to init scripts to "flag" such services and starting them "later" using PrcSys. Do you have any figures to check how much gain you got ?
    - CkeekyGoat : don't worry, we haven't hardcoded nor changed any delay for usb storage (because we knew it could cause some problems). We just tried to make sure the usb settle delay was shared with other processes, so other processes could be run until dust settles over usb bus.
    - Andre4s : well, it is extremely hardware dependent. On my favorite test box (P4 2.4Ghz, average harddrive), full boot went from 29 to 26s and perceived boot from 27.5s to 21s (I writing this from memory, I don't have the figures available right now).

    RépondreSupprimer
  11. What about allowing the user to remove the calls to modprobe from the initscripts?
    I always recompile my kernels with exactly my
    hardware specification compiled in, so I have no modules at all.
    There is no need to use modprobe at boot time
    in this case, right?

    So my suggestion would be to make it easy to remove the task of searching and loading modules from the initscripts. Is it possible?

    RépondreSupprimer
  12. Carlos : while it might be doable for expert users, rebuilding a kernel on each hardware configuration is not doable automatically : it would require installing gcc / kernel-source on user system, having to rebuild the kernel (of course, in idle background, it would take a looong time and a lot free disk space). This would also depend on which hardware is plugged on the system at the time of the kernel configuration. Each kernel security update would respin the process. And if hardware change a little, kernel should be rebuilt.

    Frankly, I've stopped building kernel for myself 8 years ago when I started working at Mandriva, because there was people more competent than me doing it and it was way faster. I'm not sure we should go back to "everybody rebuilds its kernel".

    RépondreSupprimer
  13. Have you tried to boot with "quiet" parameter, it can be an improvement in some machines when kernel starts at first (before udev)

    Thanks for your work! :-)

    Pacho

    RépondreSupprimer
  14. I dont have any figures on how much I gained from tuning some services to start later, but it was noticable and very easy to do, I had some daemons loading a fair bit of disk data during start however, so my results might be a bit off. I dont know how many non important daemons you start on a stock mandriva system.

    Regarding usb pens for preloading, this can produce very good improvements in theory, as there's no seek overhead one can load small files EXTREMELY fast. However this something I'd bother putting on a default setup, as very few people would use it.

    Also I looked into the readahead daemons you mentioned a while back, and decided not to use it, some people reported problems running applications with it active, and that it's memory usage was severe. I wrote my own a while ago that I still haven't gotten around to using, kernel auditing to monitor file access. The theory was that my readahead daemon would run and keep ahead of needed file access. I also wouldn't want preloading after I've started my main applications, I dont see any good reason for running a readahead daemon 20 seconds after user logs into X for example.

    Regarding modprobe, it might be possible to use insmod in some cases? I imagine quite a lot of modules could be loaded with that instead of modprobe, as aliases for for example crypto algorithms is unimportant.
    Maybe something that calculates direct insmod statements everywhere where possible (can check if kernel args changed, and update cache when needed) and stuffs them in a script that's executed directly?

    RépondreSupprimer
  15. Hi Frederic,

    I've been running Mandriva since 2009 beta2. I'm really impressed with the distro. Though I noticed in either yesterday or today's sync my boot became significantly slower. I switched to terminal and saw it was after mounting the root and it was waiting for loading init or something similar (sorry I don't remember the exact message) and had a bunch of ....s.

    If you'd like me to run through some tests for this to help track it down I'm more than happy to assist. Feel free to mail me skyphyr using email from gmail. Sorry for the bot avoiding wording of my mail addy.

    Cheers,

    Alan.

    RépondreSupprimer
  16. It may be worth to re-consider a idea I had back in april of this year (while working for Mandriva, BTW):

    Add support of static file-lists to preload.

    The patch I sent to the ml was just a proof-of-concept, but it worked and the results were quite good, specially for small devices with a pre-configured environment and for cases of big subsystems such as kde, gnome or openoffice.

    http://sourceforge.net/mailarchive/forum.php?forum_name=preload-devel&max_rows=25&style=ultimate&viewmonth=200804

    RépondreSupprimer
  17. Strangely, nearly everything you mention is non-existant here. I do not have loads of autodetect mechanisms left and right because i know my system (like everyone who tunes his boot should) and configured everything hard/monolithic, except for parts that may be missing (like usb-devices or disks).

    I also already had parallel booting and X starting early in Gentoo by default, trough intelligent boot service dependency resolution.

    So I could not reduce my boot time by even one second. The slowest parts here are the bios, and services that I need absolutely.

    RépondreSupprimer
  18. Pacho: no, I haven't. Could you try generating bootchart with "quiet" set and not set in order to get some data to compare ?

    Anders: well, we don't run that many services which are cpu intensive and slowing the boot by default. Moreover, anacron are configured to not start its remaining job immediatly after boot. For modprobe, it is called directly by udev coldplug, using modules aliases, so replacing with insmod is not really an option (and we would loose potential options set in modprobe.conf(.d)). preload memory usage is quite low and it won't start on low memory system.

    Alan : you should upgrade to latest cooker and mail to cooker mailing list if you still have issues.

    Ademar : for preconfigured devices, it might be better to use readahead_later, rather than patching preload.

    Evi1M4chine : I'm glad your system is already optimized. However, I don't know how long you spent doing so and many users don't want (nor have the knownledge) to do such optimization. A bootchart of your system would still be interesting ;)

    RépondreSupprimer
  19. For folks unable to see the LWN link here's a link to Arjan's slides in PowerPoint format - http://www.fenrus.org/plumbers_fastboot.ppt .

    RépondreSupprimer
  20. Fred: the problem with readahead_later is that you don't know which program the user uses most. The idea of my patch is quite simple, but it's not about the boot time, but about application startup instead:

    - preload is good at predicting what program the user runs more frequently;

    - preload kind of sux when predicting which files are used by a program, as the file mappings from /proc are only the files open at some point in time (programs usually open/close lots of files during startup and preload doesn't include them).

    My patch solves the second problem:
    - You (the distro vendor) create a list of files opened by several programs during their startups (using strace, there's a python script there);
    - Whenever preload decides it should preload a particular program (let's say, firefox-bin, oowriter-bin, kdeinit4, etc), you load not just this particular file, but all the files from the filelist associated to this program (falling back to standard behavior if there's no filelist associated).

    It's quite simple and works flawlessly, with optmial results. But it's a hack, of PoC quality. Too bad I didn't have time to work more on this and Behdad was not giving attention to preload by that time, so the idea went to /dev/null.

    RépondreSupprimer
  21. Ademar: thanks for the detailed explanations. I guess now is a good time to ping Bedhad again ;)

    RépondreSupprimer
  22. What about loading suspended to disk kernel image (or maybe even daemons, X) - this would be as fast as hdd permit.

    RépondreSupprimer
  23. Klemensas : you mean hibernation image, not suspend, I guess. Well, that would require hibernate to work reliably, which is still not the case unfortunately (mostly because of Xorg but there are a lot of devices which aren't yet fixed properly in kernel). Moreover, it would slow shutdown (unless you want to start with the same hibernation image every time). And from looking at various bootchart, starting kernel itself is not the bottleneck.

    RépondreSupprimer
  24. Please install the package "prelink", which can improve start time of files formatted as ELF, by default.

    The kernel option "quiet" can suppress messages from kernel, maybe saving time to show information.

    RépondreSupprimer
  25. Please don't do anything XP does. Sure, it gets fast to the desktop but is totally bogged fro a long time after that which is super-annoying when it would seem you could start using it but can't.

    RépondreSupprimer
  26. In attempts to reduce the total boot times (although this is out of scope for the work on the distribution per se) I would be interested to hear people's experiences of using different BIOS to boot faster, such as OpenBIOS etc.

    Is there a site where you can see if the (new, open, faster) BIOS was successfully run on the device you own?

    thanks for information

    RépondreSupprimer
  27. Gary: prelink is only saving some CPU cycles when starting applications. We just did some quick tests on an "underpowered" system (EEE PC 701) and using prelink didn't reduce boot time but X seems to have started 1s earlier. No visible change in GNOME startup. Anyway, we'll investigate prelink during Mandriva 2009.1 development cycle.

    Slux: how about testing Mandriva 2009 when it is released next week so you can see if you did a good job ?

    Andrew : this blog post is focusing on Linux distro boot time optimization, so I'd prefer to stay on focus.

    RépondreSupprimer
  28. have you seen this?
    http://lwn.net/Articles/299483/

    RépondreSupprimer
  29. Klemensas : hmm, did you check the 3rd link in this blog post ?

    RépondreSupprimer
  30. Andrew: coreboot (ex-linuxbios) can jump to the linux kernel very very fast, i.e. quasi instantly, so you'll save a lot of time on legacy BIOS, but (and it's a big but) still too few motherboards are supported, and a new port is not trivial.

    have a look at : http://www.coreboot.org/Supported_Motherboards

    RépondreSupprimer
  31. Can the moblin project have any impact on mandriva boot time?
    this impressive even with Xfce
    http://www.phoronix.com/scan.php?page=article&item=intel_moblin_2&num=1

    RépondreSupprimer
  32. jms: partially ; we already looked at some of moblin ideas and we were already doing similar things on our OEM products. However, moblin is designed for a specific target and some of their technical choices can't be used for generic distro. Anyway, we've just pushed first phase of speedboot on Cooker yesterday, which improves boot time speed a lot.

    RépondreSupprimer