Yesterday I wanted to set up an OpenBSD vps. And the cheapest vps provider these days seems to be Scaleway, which doesn’t allow custom ISO uploads.
So what I did is pick debian, and from grub run with the serial console attached:
set root=(hd0,gpt1)
kopenbsd /bsd.rd
boot
And I got nothing, because this thing is serial not a proper terminal (or so I thought at first. Actually, the problem is the combination of grub+uefi+openbsd ramdisk fails to output anything other than serial):
Then I learned about -h com0
:
set root=(hd0,gpt1)
kopenbsd -h com0 /bsd.rd
boot
This works fine in qemu. Output goes to the serial console, installation can start, etc.
On the VPS, the shell never starts: https://p.teknik.io/yEJ0N
I tried different versions of openbsd and found out that versions previous to 5.3 worked. Something changed on 5.3, 5.2 “works” (network card not recognized, gpt/uefi don’t work). Of course it doesn’t work well enough to do a proper install, but it gets to a shell.
Now I asked myself, is the process hanging here (1), or is the output
changing from serial to something that doesn’t work through the scaleway tty (2)?
And I went to IRC for help. Specifically, freenode #openbsd.
And then the couple guys that tried to help me decided it was certainly the
second thing.
Why? Because I was not using the blessed boot.conf to set tty com0
.
Because grub was doing it wrong.
Because I wasn’t using a custom ramdisk with a nice boot.conf.
This was all a big misunderstanding, for various reasons:
- I’m getting all the kernel messages but the last in the serial console. The moment the messages stop, the kernel has been chatting serial for a while, and it’s just missing one message to send.
- Local qemu tests proved that grub with
-h com0
could work fine. - I am not using
/boot
at all, because all it does I’m already doing with grub, successfully.
But still, the people on IRC helped me mess with ramdisks,
and messing with ramdisks proved to me that the kernel was really hanging.
The first test was modifying /.profile
to make it reboot when it got to the shell immediately.
This worked correctly on qemu, but did nothing on Scaleway.
If it were a problem of display, it would still reboot.
The second test was removing /sbin/init
. This makes the kernel panic and reboot…
but just after finding the root partition.
Ours is hanging just before finding the root, so obviously it didn’t reboot with this either.
In short, the situation is that the kernel is hanging somewhere between printing
scsibus2 at softraid0: 256 targets
and
root on rd0a swap on rd0b dump on rd0b
The next thing was seeing if the kernel verbose mode would help.
I booted an openbsd vm, ran boot -c
(yes, this time from the famous boot>
prompt, since it was a full installation)
enabled verbose mode and looked at the output.
There were no added lines between scsibus…
and root on…
,
so verbosity wasn’t going to help.
The only thing left to do then, is build an openbsd kernel (and ramdisk image) with added printfs everywhere, to see where in the code the kernel hangs and finally find the bug. Or is it?
No, first let’s compare good and bad boots to see that the devices are different. Then try to boot a local uefi qemu with virtio devices, just as scaleway is doing. See that it fails just the same. Ah, locally reproducing a bug, that sure is nice. Especially so when the vps takes over a minute to reboot every time.
But when booting an actual install66.fs it doesn’t fail; it is only when booting from grub /with virtio/ and /uefi/. (Can’t boot from grub on uefi without serial, can’t see anything)
Some tests:
- UEFI+GRUB ⇒ no image
- UEFI+GRUB+virtio ⇒ no image
- UEFI+GRUB+com0 ⇒ works
- UEFI+GRUB+com0+virtio ⇒ kernel freezes after “scsibusN at softraidM: 256 targets”
- BIOS+GRUB+virtio ⇒ works
- UEFI+install66.fs+virtio ⇒ works
Anyway, there’s a bug here somewhere.
How to actually boot the install media
Reboot to the rescue ubuntu thing that boots off the network on the scaleway control panel.
wget http://ftp.hostserver.de/pub/OpenBSD/6.6/amd64/miniroot66.fs
cat miniroot66.fs > /dev/vda
sync
halt
Set to boot again from disk, reboot, attach console. boot>
prompt will appear:
set tty com0
boot
I told it to use the whole disk as GPT. The system will warn that “An EFI/GPT disk may not boot. Proceed?”. I disregarded the warning and the system booted fine after finishing the installation.