Remove SCSI faulty spare hdd from RAID with mdadm

I manage a lot of servers with simple software RAID1 configuration. One of the disks started to fail notably on partition sdb8, so I had to set all other partitions to failed status before removing and replacing the disk:

mdadm --manage /dev/md0 --fail /dev/sdb5
mdadm --manage /dev/md3 --fail /dev/sdb7
mdadm --manage /dev/md1 --fail /dev/sdb6

After this my /proc/mdstat looked like this:

cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb8[1](F) sda8[0]
 284019576 blocks super 1.0 [2/1] [U_]
 bitmap: 3/3 pages [12KB], 65536KB chunk

md0 : active raid1 sda5[0] sdb5[1](F)
 528372 blocks super 1.0 [2/1] [U_]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sda7[0] sdb7[1](F)
 4200436 blocks super 1.0 [2/1] [U_]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sda6[0] sdb6[1](F)
 4199412 blocks super 1.0 [2/1] [U_]
 bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

What you will normally do is remove faulty drive sdb and replace it with a fresh one. Do not forget to properly remove SCSI device like I did and you will end up with new disk recognized as sdc instead sdb, which complicates things a bit, but it can be fixed. I will add a note how to remove SCSI disk properly to the end of this post.

Take a look at partition table on working drive sda:

fdisk -l /dev/sda

Disk /dev/sda: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders, total 585937500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000af798
Devce Boot Start End Blocks Id System
/dev/sda1 * 2048 585936895 292967424 f W95 Ext'd (LBA)
/dev/sda5 4096 1060863 528384 fd Linux raid autodetect
/dev/sda6 1062912 9461759 4199424 fd Linux raid autodetect
/dev/sda7 9463808 17864703 4200448 fd Linux raid autodetect
/dev/sda8 17866752 585906175 284019712 fd Linux raid autodetect

See that your newly inserted drive has no partitions:

fdisk -l /dev/sdc

Disk /dev/sdc: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders, total 585937500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Force copy partition table from disk sda to your new disk sdc with command:

sfdisk -d /dev/sda | sfdisk /dev/sdc --force
Checking that no-one is using this disk right now ...
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
OK

Disk /dev/sdc: 36472 cylinders, 255 heads, 63 sectors/track

sfdisk: ERROR: sector 0 does not have an msdos signature
 /dev/sdc: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sdc1 * 2048 585936895 585934848 f W95 Ext'd (LBA)
/dev/sdc2 0 - 0 0 Empty
/dev/sdc3 0 - 0 0 Empty
/dev/sdc4 0 - 0 0 Empty
/dev/sdc5 4096 1060863 1056768 fd Linux raid autodetect
/dev/sdc6 1062912 9461759 8398848 fd Linux raid autodetect
/dev/sdc7 9463808 17864703 8400896 fd Linux raid autodetect
/dev/sdc8 17866752 585906175 568039424 fd Linux raid autodetect
Warning: partition 1 does not end at a cylinder boundary
Successfully wrote the new partition table

Re-reading the partition table ...

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Check if the partition table is written correctly:

fdisk -l /dev/sdc

Disk /dev/sdc: 300.0 GB, 300000000000 bytes
255 heads, 63 sectors/track, 36472 cylinders, total 585937500 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Device Boot Start End Blocks Id System
/dev/sdc1 * 2048 585936895 292967424 f W95 Ext'd (LBA)
/dev/sdc5 4096 1060863 528384 fd Linux raid autodetect
/dev/sdc6 1062912 9461759 4199424 fd Linux raid autodetect
/dev/sdc7 9463808 17864703 4200448 fd Linux raid autodetect
/dev/sdc8 17866752 585906175 284019712 fd Linux raid autodetect

unused devices: <none>

Now it is time to add new partitions to RAID1 arrays with commands:

mdadm --manage /dev/md0 --add /dev/sdc5
mdadm --manage /dev/md1 --add /dev/sdc6
mdadm --manage /dev/md3 --add /dev/sdc7
mdadm --manage /dev/md2 --add /dev/sdc8

Synchronization will start immediately as you add the partition back to array. I usually wait for one synchronization to finish before I add another partition. You can check the status of synchronization with simple cat command:

cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdc8[2] sdb8[1](F) sda8[0]
 284019576 blocks super 1.0 [2/1] [U_]
 [>....................] recovery = 0.0% (264576/284019576) finish=107.2min speed=44096K/sec
 bitmap: 3/3 pages [12KB], 65536KB chunk

md0 : active raid1 sdc5[2] sda5[0] sdb5[1](F)
 528372 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdc7[2] sda7[0] sdb7[1](F)
 4200436 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdc6[2] sda6[0] sdb6[1](F)
 4199412 blocks super 1.0 [2/2] [UU]
 bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

When it is all finished it should look like this:

cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdc8[2] sdb8[1](F) sda8[0]
 284019576 blocks super 1.0 [2/2] [UU]
 bitmap: 3/3 pages [12KB], 65536KB chunk

md0 : active raid1 sdc5[2] sda5[0] sdb5[1](F)
 528372 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdc7[2] sda7[0] sdb7[1](F)
 4200436 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdc6[2] sda6[0] sdb6[1](F)
 4199412 blocks super 1.0 [2/2] [UU]
 bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

Now it is time to install GRUB to our newly added disk in case working disk fails we would need to boot from this new one and if we do not do this we would need to use rescue CD to boot.

First you should fix your device.map file where GRUB looks for physical hard drives:

In my case it looks like:

cat /boot/grub/device.map
(hd1) /dev/disk/by-id/scsi-35000c5003b2fda87
(hd0) /dev/disk/by-id/scsi-35000c5003b2f452f

To find out what is the ID of your newly inserted disk:

ls -la /dev/disk/by-id/
 ...
lrwxrwxrwx 1 root root 9 Jan 13 11:52 scsi-35000c500742c35cf -> ../../sdc
lrwxrwxrwx 1 root root 10 Jan 13 11:52 scsi-35000c500742c35cf-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Jan 13 11:52 scsi-35000c500742c35cf-part5 -> ../../sdc5
lrwxrwxrwx 1 root root 10 Jan 13 11:52 scsi-35000c500742c35cf-part6 -> ../../sdc6
lrwxrwxrwx 1 root root 10 Jan 13 11:52 scsi-35000c500742c35cf-part7 -> ../../sdc7
lrwxrwxrwx 1 root root 10 Jan 13 11:52 scsi-35000c500742c35cf-part8 -> ../../sdc8

So you should change your ID in device.map to look like this:

(hd1) /dev/disk/by-id/scsi-35000c500742c35cf
(hd0) /dev/disk/by-id/scsi-35000c5003b2f452f

Now start GRUB with grub command and do the following:

GNU GRUB version 0.97 (640K lower / 3072K upper memory)

[ Minimal BASH-like line editing is supported. For the first word, TAB
  lists possible command completions. Anywhere else TAB lists the possible
  completions of a device/filename. ]

grub> root (hd
  Possible disks are: hd0 hd1

grub> root (hd
  Possible disks are: hd0 hd1

grub> find /bo
 Error 12: Invalid device requested

grub> find /bogru
 Error 12: Invalid device requested

grub> find /boot/grub/stage2
  (hd0,4)
  (hd1,4)

grub> setup --stage2=/boot/grub/stage2 (hd1) (hd1,4)
  Checking if "/boot/grub/stage1" exists... yes
  Checking if "/boot/grub/stage2" exists... yes
  Checking if "/boot/grub/e2fs_stage1_5" exists... yes
  Running "embed /boot/grub/e2fs_stage1_5 (hd1)"... 17 sectors are embedded.
 succeeded
  Running "install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd1) (hd1)1+17 p (hd1,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
 Done.

This way you have installed GRUB boot loader to your newly added hd1 (sdc).

Now we come to the part that I missed and that is to properly remove old sdb hard disk from SCSI and as you could see there is still sdb disk shown in /proc/mdstat. You need to remove all faulty declared devices with following commands:

mdadm --manage /dev/md0 --remove faulty
mdadm --manage /dev/md1 --remove faulty
mdadm --manage /dev/md2 --remove faulty
mdadm --manage /dev/md3 --remove faulty

Now your /proc/mdstat will be clear of faulty devices:

cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdc8[2] sda8[0]
 284019576 blocks super 1.0 [2/2] [UU]
 bitmap: 3/3 pages [12KB], 65536KB chunk

md0 : active raid1 sdc5[2] sda5[0]
 528372 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md3 : active raid1 sdc7[2] sda7[0]
 4200436 blocks super 1.0 [2/2] [UU]
 bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdc6[2] sda6[0]
 4199412 blocks super 1.0 [2/2] [UU]
 bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

Notice: Here is what I should have done before removing the hard drive to replace it:

echo "scsi remove-single-device 0 0 2 0" > /proc/scsi/scsi

And after inserting the new drive add it back with:

echo "scsi add-single-device 0 0 2 0" > /proc/scsi/scsi

This will ensure your new drive is recognized properly again as sdb.

In case you forgot like me, change from sdc to sdb will happen next time when you reboot the server.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.