Recovering VMs when they fail with disk errors

My setup is carving out a logical volume on my domU physical volume, and handing that to virt-install to use as the root disk. Now, it also partitions the LV, then installs LVM on top of it. BUT, that change isn’t passed back to the VM host, so the what’s inside the disk image is opaque, ie. can’t be seen by disk tools. When my fileserver VM failed to boot with an error, I knew something had to be wrong. But, running fsck /dev/domU/helium didn’t work. But the following steps did:

Step 1. Read the partitions in the disk image: kpartx -a /dev/domU/helium

This setup block devices in /dev/mapper – namely domU-helium1 and domU-helium2. domU-helium1 was my /boot partition. No problems running fsck on it (Since grub can’t use LVM, it’s a partition on its own):
[root@elemental mapper]# fsck /dev/mapper/domU-helium1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.10 (10-Feb-2009) /dev/mapper/domU-helium1: clean, 44/128016 files, 78681/512000 blocks

But, domU-helium2 was a different story:
[root@elemental mapper]# fsck.ext4 /dev/mapper/domU-helium2 e2fsck 1.41.10 (10-Feb-2009) fsck.ext4: Superblock invalid, trying backup blocks... fsck.ext4: Bad magic number in super-block while trying to open /dev/mapper/domU-helium2

The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193

As a side note, I ran fsck.ext4 thinking that specifying the file system would be better. But if I just ran it normally, it probably would have been better:
[root@elemental mapper]# fsck /dev/mapper/domU-helium2 fsck from util-linux-ng 2.17.2 fsck: fsck.LVM2_member: not found fsck: Error 2 while executing fsck.LVM2_member for /dev/mapper/domU-helium2
Note the “fsck.LVM2_member”. That would have helped put me on the right track, but I didn’t think of it.

Anyway, getting LVM to pick up the LV was difficult: there’s vgimport, but it failed:
[root@elemental mapper]# vgimport vg_helium Volume group "vg_helium" is not exported

But it was definitely there after running kpartx:
[root@elemental mapper]# pvscan PV /dev/md127 VG bulkdata lvm2 [2.66 TiB / 0 free] PV /dev/sda3 VG domU lvm2 [235.44 GiB / 75.44 GiB free] PV /dev/sda2 VG vg_elemental lvm2 [42.00 GiB / 0 free] Total: 3 [2.93 TiB] / in use: 3 [2.93 TiB] / in no VG: 0 [0 ] [root@elemental mapper]# kpartx -a /dev/domU/helium [root@elemental mapper]# ls bulkdata-lvm0 domU-Fedora13PV domU-fedora13pv.2 domU-helium domU-helium2 domU-nexentaroot vg_elemental-root control domU-fedora13pv.1 domU-fedora14 domU-helium1 domU-nexenta3 domU-OIroot vg_elemental-swap [root@elemental mapper]# pvscan PV /dev/md127 VG bulkdata lvm2 [2.66 TiB / 0 free] PV /dev/mapper/domU-helium2 VG vg_helium lvm2 [19.51 GiB / 0 free] PV /dev/sda3 VG domU lvm2 [235.44 GiB / 75.44 GiB free] PV /dev/sda2 VG vg_elemental lvm2 [42.00 GiB / 0 free] Total: 4 [2.95 TiB] / in use: 4 [2.95 TiB] / in no VG: 0 [0 ]

So, Step 2: After much searching, the trick I found was vgchange -a y – change the VG to become active.
[root@elemental mapper]# vgchange -a y vg_helium 2 logical volume(s) in volume group "vg_helium" now active

Running ls in /dev/mapper showed the newly appeared LVs:
[root@elemental mapper]# ls bulkdata-lvm0 domU-Fedora13PV domU-fedora13pv.2 domU-helium domU-helium2 domU-nexentaroot vg_helium-lv_root vg_elemental-root control domU-fedora13pv.1 domU-fedora14 domU-helium1 domU-nexenta3 domU-OIroot vg_helium-lv_swap vg_elemental-swap

From there, it was a simple fsck /dev/mapper/vg_helium-lv_root:
[root@elemental mapper]# fsck /dev/mapper/vg_helium-lv_root fsck from util-linux-ng 2.17.2 e2fsck 1.41.10 (10-Feb-2009) [removed for brevity] lv_root: ***** FILE SYSTEM WAS MODIFIED ***** lv_root: 65133/1246032 files (0.2% non-contiguous), 567364/4982784 blocks

And cleaning up after the fix was a simple vgchange -a n vg_helium and kpartx -d /dev/domU/helium:
[root@elemental mapper]# vgchange -a n vg_helium 0 logical volume(s) in volume group "vg_helium" now active [root@elemental mapper]# kpartx -d /dev/domU/helium

Starting up the VM, it was now working:
[root@elemental mapper]# virsh start helium Domain helium started

[root@elemental mapper]# virsh console helium Connected to domain helium Escape character is ^] PCI: Fatal: No config space access function found drivers/rtc/hctosys.c: unable to open rtc device (rtc0) Welcome to Fedora Press 'I' to enter interactive startup. Starting udev: [ OK ] Setting hostname helium: [ OK ] Setting up Logical Volume Management: 2 logical volume(s) in volume group "vg_helium" now active [ OK ] Checking filesystems Checking all file systems. [/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/mapper/vg_helium-lv_root /dev/mapper/vg_helium-lv_root: clean, 65133/1246032 files, 567364/4982784 blocks [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/xvda1 ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/xvda1 is mounted. /dev/xvda1: clean, 44/128016 files, 78681/512000 blocks [/sbin/fsck.ext3 (1) -- /home] fsck.ext3 -a /dev/xvdb ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/xvdb is mounted. /dev/xvdb: clean, 7088/178651136 files, 448748230/714604544 blocks [ OK ] Remounting root filesystem in read-write mode: [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] Enabling /etc/fstab swaps: [ OK ] Entering non-interactive startup Calling the system activity data collector (sadc): Starting monitoring for VG vg_helium: 2 logical volume(s) in volume group "vg_helium" monitored [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface eth0: Determining IP information for eth0... done. [ OK ] [removed for brevity] Fedora release 13 (Goddard) Kernel 2.6.34.7-61.fc13.x86_64 on an x86_64 (/dev/hvc0) helium login:

And we’re live again.

This entry was posted on November 13, 2010, 10:32 pm by Kyle Lexmond and is filed under Uncategorized. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

nTh among all

Recovering VMs when they fail with disk errors

Recent Posts

Blogroll

Friends

Archives