Recovering VMs when they fail with disk errors


My setup is carving out a logical volume on my domU physical volume, and handing that to virt-install to use as the root disk. Now, it also partitions the LV, then installs LVM on top of it. BUT, that change isn’t passed back to the VM host, so the what’s inside the disk image is opaque, ie. can’t be seen by disk tools. When my fileserver VM failed to boot with an error, I knew something had to be wrong. But, running fsck /dev/domU/helium didn’t work. But the following steps did:

Step 1. Read the partitions in the disk image: kpartx -a /dev/domU/helium

This setup block devices in /dev/mapper – namely domU-helium1 and domU-helium2. domU-helium1 was my /boot partition. No problems running fsck on it (Since grub can’t use LVM, it’s a partition on its own):
[root@elemental mapper]# fsck /dev/mapper/domU-helium1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.10 (10-Feb-2009)
/dev/mapper/domU-helium1: clean, 44/128016 files, 78681/512000 blocks

But, domU-helium2 was a different story:
[root@elemental mapper]# fsck.ext4 /dev/mapper/domU-helium2
e2fsck 1.41.10 (10-Feb-2009)
fsck.ext4: Superblock invalid, trying backup blocks...
fsck.ext4: Bad magic number in super-block while trying to open /dev/mapper/domU-helium2

The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193

As a side note, I ran fsck.ext4 thinking that specifying the file system would be better. But if I just ran it normally, it probably would have been better:
[root@elemental mapper]# fsck /dev/mapper/domU-helium2
fsck from util-linux-ng 2.17.2
fsck: fsck.LVM2_member: not found
fsck: Error 2 while executing fsck.LVM2_member for /dev/mapper/domU-helium2

Note the “fsck.LVM2_member”. That would have helped put me on the right track, but I didn’t think of it.

Anyway, getting LVM to pick up the LV was difficult: there’s vgimport, but it failed:
[root@elemental mapper]# vgimport vg_helium
Volume group "vg_helium" is not exported

But it was definitely there after running kpartx:
[root@elemental mapper]# pvscan
PV /dev/md127 VG bulkdata lvm2 [2.66 TiB / 0 free]
PV /dev/sda3 VG domU lvm2 [235.44 GiB / 75.44 GiB free]
PV /dev/sda2 VG vg_elemental lvm2 [42.00 GiB / 0 free]
Total: 3 [2.93 TiB] / in use: 3 [2.93 TiB] / in no VG: 0 [0 ]
[root@elemental mapper]# kpartx -a /dev/domU/helium
[root@elemental mapper]# ls
bulkdata-lvm0 domU-Fedora13PV domU-fedora13pv.2 domU-helium domU-helium2 domU-nexentaroot vg_elemental-root
control domU-fedora13pv.1 domU-fedora14 domU-helium1 domU-nexenta3 domU-OIroot vg_elemental-swap
[root@elemental mapper]# pvscan
PV /dev/md127 VG bulkdata lvm2 [2.66 TiB / 0 free]
PV /dev/mapper/domU-helium2 VG vg_helium lvm2 [19.51 GiB / 0 free]
PV /dev/sda3 VG domU lvm2 [235.44 GiB / 75.44 GiB free]
PV /dev/sda2 VG vg_elemental lvm2 [42.00 GiB / 0 free]
Total: 4 [2.95 TiB] / in use: 4 [2.95 TiB] / in no VG: 0 [0 ]

So, Step 2: After much searching, the trick I found was vgchange -a y – change the VG to become active.
[root@elemental mapper]# vgchange -a y vg_helium
2 logical volume(s) in volume group "vg_helium" now active

Running ls in /dev/mapper showed the newly appeared LVs:
[root@elemental mapper]# ls
bulkdata-lvm0 domU-Fedora13PV domU-fedora13pv.2 domU-helium domU-helium2 domU-nexentaroot vg_helium-lv_root vg_elemental-root
control domU-fedora13pv.1 domU-fedora14 domU-helium1 domU-nexenta3 domU-OIroot vg_helium-lv_swap vg_elemental-swap

From there, it was a simple fsck /dev/mapper/vg_helium-lv_root:
[root@elemental mapper]# fsck /dev/mapper/vg_helium-lv_root
fsck from util-linux-ng 2.17.2
e2fsck 1.41.10 (10-Feb-2009)
[removed for brevity]
lv_root: ***** FILE SYSTEM WAS MODIFIED *****
lv_root: 65133/1246032 files (0.2% non-contiguous), 567364/4982784 blocks

And cleaning up after the fix was a simple vgchange -a n vg_helium and kpartx -d /dev/domU/helium:
[root@elemental mapper]# vgchange -a n vg_helium
0 logical volume(s) in volume group "vg_helium" now active
[root@elemental mapper]# kpartx -d /dev/domU/helium

Starting up the VM, it was now working:
[root@elemental mapper]# virsh start helium
Domain helium started

[root@elemental mapper]# virsh console helium
Connected to domain helium
Escape character is ^]
PCI: Fatal: No config space access function found
drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Welcome to Fedora
Press 'I' to enter interactive startup.
Starting udev: [ OK ]
Setting hostname helium: [ OK ]
Setting up Logical Volume Management: 2 logical volume(s) in volume group "vg_helium" now active [ OK ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext4 (1) -- /] fsck.ext4 -a /dev/mapper/vg_helium-lv_root
/dev/mapper/vg_helium-lv_root: clean, 65133/1246032 files, 567364/4982784 blocks
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/xvda1
ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/xvda1 is mounted.
/dev/xvda1: clean, 44/128016 files, 78681/512000 blocks
[/sbin/fsck.ext3 (1) -- /home] fsck.ext3 -a /dev/xvdb
ext2fs_check_if_mount: Can't check if filesystem is mounted due to missing mtab file while determining whether /dev/xvdb is mounted.
/dev/xvdb: clean, 7088/178651136 files, 448748230/714604544 blocks [ OK ]
Remounting root filesystem in read-write mode: [ OK ]
Mounting local filesystems: [ OK ]
Enabling local filesystem quotas: [ OK ]
Enabling /etc/fstab swaps: [ OK ]
Entering non-interactive startup
Calling the system activity data collector (sadc):
Starting monitoring for VG vg_helium: 2 logical volume(s) in volume group "vg_helium" monitored [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0:
Determining IP information for eth0... done. [ OK ]
[removed for brevity]
Fedora release 13 (Goddard)
Kernel 2.6.34.7-61.fc13.x86_64 on an x86_64 (/dev/hvc0)
helium login:

And we’re live again.

  1. No comments yet.
(will not be published)