mdadm RAID5 random read errors. Dying disk?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

First the long story:

I have a RAID5 with mdadm on Debian 9. The Raid has 5 Disks, each 4TB of size. 4 of them are HGST Deskstar NAS, and one that came later is a Toshiba N300 NAS.

In the past days I noticed some read errors from that Raid. For example I had a 10GB rar archive in multiple parts. When I try to extract I get CRC errors on some of the parts. If I try it a second time, I get theses errors on other parts. That also happens with Torrents and a re-chack after download.

After a reboot my BIOS noticed me that the S.M.A.R.T status of a HGST drive on SATA Port 3 is bad. smartctl had sayed to me that there are DMA CRC errors, but claims that the Drive is OK.

Another reboot later, I can't see the crc errors in the smart anymore. But now I get this output

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: FAILED!

Drive failure expected in less than 24 hours. SAVE ALL DATA.

Failed Attributes:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1989

As the HGST aren't aviable for normale prices anymore, I bought another Toshiba N300 to replace the HGST. Both are labeled as 4TB. I tryed to make a Partition of the exact same size but it didn't worked. The partition programm claimed that my number is too big (I tried it with bytes and sectors). So I just made the Partition as big as posible. But now it looks like it is the same size, I'm a bit confused.

sdc is the old, and sdh is the new one

Disk /dev/sdc: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: gpt

Disk identifier: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC



Device     Start        End    Sectors  Size Type

/dev/sdc1   2048 7814028976 7814026929  3,7T Linux RAID





Disk /dev/sdh: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: gpt

Disk identifier: 3A173902-47DE-4C96-8360-BE5DBED1EAD3



Device     Start        End    Sectors  Size Type

/dev/sdh1   2048 7814037134 7814035087  3,7T Linux filesystem

Currently I have added the new one as a spare disk. The RAID is still working with the old Drive. I still have some read errors, especially on big files.

This is how my RAID Currently looks:

/dev/md/0:

        Version : 1.2

  Creation Time : Sun Dec 17 22:03:20 2017

     Raid Level : raid5

     Array Size : 15627528192 (14903.57 GiB 16002.59 GB)

  Used Dev Size : 3906882048 (3725.89 GiB 4000.65 GB)

   Raid Devices : 5

  Total Devices : 6

    Persistence : Superblock is persistent



  Intent Bitmap : Internal



    Update Time : Sat Jan  5 09:48:49 2019

          State : clean

 Active Devices : 5

Working Devices : 6

 Failed Devices : 0

  Spare Devices : 1



         Layout : left-symmetric

     Chunk Size : 512K



           Name : SERVER:0  (local to host SERVER)

           UUID : 16ee60d0:f055dedf:7bd40adc:f3415deb

         Events : 25839



    Number   Major   Minor   RaidDevice State

       0       8       49        0      active sync   /dev/sdd1

       1       8       33        1      active sync   /dev/sdc1

       3       8        1        2      active sync   /dev/sda1

       4       8       17        3      active sync   /dev/sdb1

       5       8       80        4      active sync   /dev/sdf



       6       8      113        -      spare   /dev/sdh1

And the disk structure is this

NAME                       MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT

sda                          8:0    0   3,7T  0 disk

└─sda1                       8:1    0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdb                          8:16   0   3,7T  0 disk

└─sdb1                       8:17   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdc                          8:32   0   3,7T  0 disk

└─sdc1                       8:33   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdd                          8:48   0   3,7T  0 disk

└─sdd1                       8:49   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdf                          8:80   1   3,7T  0 disk

└─md0                        9:0    0  14,6T  0 raid5

  └─storageRaid            253:4    0  14,6T  0 crypt

    └─vg_raid-raidVolume   253:5    0  14,6T  0 lvm   /media/raidVolume

sdh                          8:112  1   3,7T  0 disk

└─sdh1                       8:113  1   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

I'm a bit confused that the spare disk (sdh) is already in the crypt volume.

Questions:

Under what criteria will mdadm say that a disk has Failed?

Can the random read errors come from one broken Disk?

Dosn't detect the raid it when a disk sends the wrong data?

Is it dangerouse to mark a disk manually as failed when the spare Disk has not the exact same size?

asked Jan 5 at 13:05

kevinq

283

except of being "too broad" it's also an off-topic ("consumer workstations or networking (which belong on our sister site, Super User)") — serverfault.com/help/on-topic

– poige
Jan 5 at 18:35

add a comment |

First the long story:

I have a RAID5 with mdadm on Debian 9. The Raid has 5 Disks, each 4TB of size. 4 of them are HGST Deskstar NAS, and one that came later is a Toshiba N300 NAS.

After a reboot my BIOS noticed me that the S.M.A.R.T status of a HGST drive on SATA Port 3 is bad. smartctl had sayed to me that there are DMA CRC errors, but claims that the Drive is OK.

Another reboot later, I can't see the crc errors in the smart anymore. But now I get this output

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: FAILED!

Drive failure expected in less than 24 hours. SAVE ALL DATA.

Failed Attributes:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1989

sdc is the old, and sdh is the new one

Disk /dev/sdc: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: gpt

Disk identifier: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC



Device     Start        End    Sectors  Size Type

/dev/sdc1   2048 7814028976 7814026929  3,7T Linux RAID





Disk /dev/sdh: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: gpt

Disk identifier: 3A173902-47DE-4C96-8360-BE5DBED1EAD3



Device     Start        End    Sectors  Size Type

/dev/sdh1   2048 7814037134 7814035087  3,7T Linux filesystem

Currently I have added the new one as a spare disk. The RAID is still working with the old Drive. I still have some read errors, especially on big files.

This is how my RAID Currently looks:

/dev/md/0:

        Version : 1.2

  Creation Time : Sun Dec 17 22:03:20 2017

     Raid Level : raid5

     Array Size : 15627528192 (14903.57 GiB 16002.59 GB)

  Used Dev Size : 3906882048 (3725.89 GiB 4000.65 GB)

   Raid Devices : 5

  Total Devices : 6

    Persistence : Superblock is persistent



  Intent Bitmap : Internal



    Update Time : Sat Jan  5 09:48:49 2019

          State : clean

 Active Devices : 5

Working Devices : 6

 Failed Devices : 0

  Spare Devices : 1



         Layout : left-symmetric

     Chunk Size : 512K



           Name : SERVER:0  (local to host SERVER)

           UUID : 16ee60d0:f055dedf:7bd40adc:f3415deb

         Events : 25839



    Number   Major   Minor   RaidDevice State

       0       8       49        0      active sync   /dev/sdd1

       1       8       33        1      active sync   /dev/sdc1

       3       8        1        2      active sync   /dev/sda1

       4       8       17        3      active sync   /dev/sdb1

       5       8       80        4      active sync   /dev/sdf



       6       8      113        -      spare   /dev/sdh1

And the disk structure is this

NAME                       MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT

sda                          8:0    0   3,7T  0 disk

└─sda1                       8:1    0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdb                          8:16   0   3,7T  0 disk

└─sdb1                       8:17   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdc                          8:32   0   3,7T  0 disk

└─sdc1                       8:33   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdd                          8:48   0   3,7T  0 disk

└─sdd1                       8:49   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdf                          8:80   1   3,7T  0 disk

└─md0                        9:0    0  14,6T  0 raid5

  └─storageRaid            253:4    0  14,6T  0 crypt

    └─vg_raid-raidVolume   253:5    0  14,6T  0 lvm   /media/raidVolume

sdh                          8:112  1   3,7T  0 disk

└─sdh1                       8:113  1   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

I'm a bit confused that the spare disk (sdh) is already in the crypt volume.

asked Jan 5 at 13:05

kevinq

283

except of being "too broad" it's also an off-topic ("consumer workstations or networking (which belong on our sister site, Super User)") — serverfault.com/help/on-topic

– poige
Jan 5 at 18:35

add a comment |

First the long story:

I have a RAID5 with mdadm on Debian 9. The Raid has 5 Disks, each 4TB of size. 4 of them are HGST Deskstar NAS, and one that came later is a Toshiba N300 NAS.

After a reboot my BIOS noticed me that the S.M.A.R.T status of a HGST drive on SATA Port 3 is bad. smartctl had sayed to me that there are DMA CRC errors, but claims that the Drive is OK.

Another reboot later, I can't see the crc errors in the smart anymore. But now I get this output

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: FAILED!

Drive failure expected in less than 24 hours. SAVE ALL DATA.

Failed Attributes:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1989

sdc is the old, and sdh is the new one

Disk /dev/sdc: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: gpt

Disk identifier: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC



Device     Start        End    Sectors  Size Type

/dev/sdc1   2048 7814028976 7814026929  3,7T Linux RAID





Disk /dev/sdh: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: gpt

Disk identifier: 3A173902-47DE-4C96-8360-BE5DBED1EAD3



Device     Start        End    Sectors  Size Type

/dev/sdh1   2048 7814037134 7814035087  3,7T Linux filesystem

Currently I have added the new one as a spare disk. The RAID is still working with the old Drive. I still have some read errors, especially on big files.

This is how my RAID Currently looks:

/dev/md/0:

        Version : 1.2

  Creation Time : Sun Dec 17 22:03:20 2017

     Raid Level : raid5

     Array Size : 15627528192 (14903.57 GiB 16002.59 GB)

  Used Dev Size : 3906882048 (3725.89 GiB 4000.65 GB)

   Raid Devices : 5

  Total Devices : 6

    Persistence : Superblock is persistent



  Intent Bitmap : Internal



    Update Time : Sat Jan  5 09:48:49 2019

          State : clean

 Active Devices : 5

Working Devices : 6

 Failed Devices : 0

  Spare Devices : 1



         Layout : left-symmetric

     Chunk Size : 512K



           Name : SERVER:0  (local to host SERVER)

           UUID : 16ee60d0:f055dedf:7bd40adc:f3415deb

         Events : 25839



    Number   Major   Minor   RaidDevice State

       0       8       49        0      active sync   /dev/sdd1

       1       8       33        1      active sync   /dev/sdc1

       3       8        1        2      active sync   /dev/sda1

       4       8       17        3      active sync   /dev/sdb1

       5       8       80        4      active sync   /dev/sdf



       6       8      113        -      spare   /dev/sdh1

And the disk structure is this

NAME                       MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT

sda                          8:0    0   3,7T  0 disk

└─sda1                       8:1    0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdb                          8:16   0   3,7T  0 disk

└─sdb1                       8:17   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdc                          8:32   0   3,7T  0 disk

└─sdc1                       8:33   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdd                          8:48   0   3,7T  0 disk

└─sdd1                       8:49   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdf                          8:80   1   3,7T  0 disk

└─md0                        9:0    0  14,6T  0 raid5

  └─storageRaid            253:4    0  14,6T  0 crypt

    └─vg_raid-raidVolume   253:5    0  14,6T  0 lvm   /media/raidVolume

sdh                          8:112  1   3,7T  0 disk

└─sdh1                       8:113  1   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

I'm a bit confused that the spare disk (sdh) is already in the crypt volume.

asked Jan 5 at 13:05

kevinq

283

First the long story:

I have a RAID5 with mdadm on Debian 9. The Raid has 5 Disks, each 4TB of size. 4 of them are HGST Deskstar NAS, and one that came later is a Toshiba N300 NAS.

After a reboot my BIOS noticed me that the S.M.A.R.T status of a HGST drive on SATA Port 3 is bad. smartctl had sayed to me that there are DMA CRC errors, but claims that the Drive is OK.

Another reboot later, I can't see the crc errors in the smart anymore. But now I get this output

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org



=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: FAILED!

Drive failure expected in less than 24 hours. SAVE ALL DATA.

Failed Attributes:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1989

sdc is the old, and sdh is the new one

Disk /dev/sdc: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 4096 bytes

I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disklabel type: gpt

Disk identifier: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC



Device     Start        End    Sectors  Size Type

/dev/sdc1   2048 7814028976 7814026929  3,7T Linux RAID





Disk /dev/sdh: 3,7 TiB, 4000787030016 bytes, 7814037168 sectors

Units: sectors of 1 * 512 = 512 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disklabel type: gpt

Disk identifier: 3A173902-47DE-4C96-8360-BE5DBED1EAD3



Device     Start        End    Sectors  Size Type

/dev/sdh1   2048 7814037134 7814035087  3,7T Linux filesystem

Currently I have added the new one as a spare disk. The RAID is still working with the old Drive. I still have some read errors, especially on big files.

This is how my RAID Currently looks:

/dev/md/0:

        Version : 1.2

  Creation Time : Sun Dec 17 22:03:20 2017

     Raid Level : raid5

     Array Size : 15627528192 (14903.57 GiB 16002.59 GB)

  Used Dev Size : 3906882048 (3725.89 GiB 4000.65 GB)

   Raid Devices : 5

  Total Devices : 6

    Persistence : Superblock is persistent



  Intent Bitmap : Internal



    Update Time : Sat Jan  5 09:48:49 2019

          State : clean

 Active Devices : 5

Working Devices : 6

 Failed Devices : 0

  Spare Devices : 1



         Layout : left-symmetric

     Chunk Size : 512K



           Name : SERVER:0  (local to host SERVER)

           UUID : 16ee60d0:f055dedf:7bd40adc:f3415deb

         Events : 25839



    Number   Major   Minor   RaidDevice State

       0       8       49        0      active sync   /dev/sdd1

       1       8       33        1      active sync   /dev/sdc1

       3       8        1        2      active sync   /dev/sda1

       4       8       17        3      active sync   /dev/sdb1

       5       8       80        4      active sync   /dev/sdf



       6       8      113        -      spare   /dev/sdh1

And the disk structure is this

NAME                       MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT

sda                          8:0    0   3,7T  0 disk

└─sda1                       8:1    0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdb                          8:16   0   3,7T  0 disk

└─sdb1                       8:17   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdc                          8:32   0   3,7T  0 disk

└─sdc1                       8:33   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdd                          8:48   0   3,7T  0 disk

└─sdd1                       8:49   0   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

sdf                          8:80   1   3,7T  0 disk

└─md0                        9:0    0  14,6T  0 raid5

  └─storageRaid            253:4    0  14,6T  0 crypt

    └─vg_raid-raidVolume   253:5    0  14,6T  0 lvm   /media/raidVolume

sdh                          8:112  1   3,7T  0 disk

└─sdh1                       8:113  1   3,7T  0 part

  └─md0                      9:0    0  14,6T  0 raid5

    └─storageRaid          253:4    0  14,6T  0 crypt

      └─vg_raid-raidVolume 253:5    0  14,6T  0 lvm   /media/raidVolume

I'm a bit confused that the spare disk (sdh) is already in the crypt volume.

lvm mdadm raid5 smart

asked Jan 5 at 13:05

kevinq

283

asked Jan 5 at 13:05

kevinq

283

asked Jan 5 at 13:05

kevinq

283

asked Jan 5 at 13:05

kevinq

283

asked Jan 5 at 13:05

kevinq

283

except of being "too broad" it's also an off-topic ("consumer workstations or networking (which belong on our sister site, Super User)") — serverfault.com/help/on-topic

– poige
Jan 5 at 18:35

add a comment |

except of being "too broad" it's also an off-topic ("consumer workstations or networking (which belong on our sister site, Super User)") — serverfault.com/help/on-topic

– poige
Jan 5 at 18:35

except of being "too broad" it's also an off-topic ("consumer workstations or networking (which belong on our sister site, Super User)") — serverfault.com/help/on-topic

– poige
Jan 5 at 18:35

add a comment |

2 Answers
2

active

oldest

votes

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I must say I am surprised that you get errors on the application level. RAID5 should use the parity information to detect errors (edit, apparently, it doesn't; only during verification). Having said that, whether the disk is the cause or not, it's bad. Nearly 2000 reallocated sectors is really bad.

Partitions can be bigger, otherwise you can't add them as spare either, but to be sure everything is fine, you can clone partition tables using fdisk, sfdisk and gdisk. You have GPT, so let's use its backup feature. If you do gdisk /dev/sdX, you can use b to back the partition table up to disk. Then, on the new disk, gdisk /dev/sdY, you can use r for recovery options, then l to load the backup. Then you should have an identical partition and all mdadm --manage --add commands should work. (you will need to take out the new disk from the array before changing the partition table)

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

And, a final piece of advice: don't use RAID5. RAID5 with such huge disks is flaky. You should be able to add a disk and dynamically migrate to RAID6. Not sure how from the top of my head, but you can Google that.

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

1

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

1

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

1

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

|
show 1 more comment

it’s pretty common to have cron task initiating parity mismatch checks. i’m pretty sure debian 9 does it by default when mdadm package installs and hence your system’s logs would have reports in regards.

Besides if your system’s RAM fails it might be the primary reason

answered Jan 5 at 14:58

poige

7,09211437

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "2"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f947708%2fmdadm-raid5-random-read-errors-dying-disk%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

1

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

1

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

1

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

|
show 1 more comment

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

1

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

1

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

1

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

|
show 1 more comment

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

MD raid is far too conservative with kicking out disks, in my opinion. I always watch for ATA exceptions in the syslog/dmesg (I set rsyslog to notify me on those).

I actually tend to keep those backup partition tables around on servers. It makes for fast disk replacements.

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

edited Jan 5 at 22:03

answered Jan 5 at 13:17

Halfgaar

5,33543062

answered Jan 5 at 13:17

Halfgaar

5,33543062

answered Jan 5 at 13:17

Halfgaar

5,33543062

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

1

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

1

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

1

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

|
show 1 more comment

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

1

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

1

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

1

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

I already added the new disk (sdh1) with mdadm --manage --add and it is there as as spare disk now. Can i Just mark the old device as failed? Or should I remove the spare again, and vreate a Partition with a gdisk backup? I read somewhere that I need to set a new uuid if I install the parition backup.

– kevinq
Jan 5 at 13:39

“RAID5 should use the parity information to detect errors. ” — what do you mean “should”? Linux software RAID doesn’t work that way

– poige
Jan 5 at 14:46

@kevinq, if will work if you mark the old device as failed. You will just have to remember that in the future, new disks only need to have partitions has big as the smallest drive. As for the UUID, I think you're right. If you have a new enough sfdisk (with gpt support), you can see the UUIDs with sfdisk -d /dev/sdX. I used to clone MBR partition tables with sfdisk -d /dev/sda | sfdisk /dev/sdb, but with UUIDs, I need to rethink this.

– Halfgaar
Jan 5 at 15:57

@poige can you elaborate? You're saying (Linux software) RAID5 doesn't use the parity to verify what it just read? If not, I stand corrected, and what the OP sees is not unexpected, though the drive should have given an error instead of return wrong data.

– Halfgaar
Jan 5 at 16:02

As far as I'm aware, nobody's RAID uses the parity information to verify the read. The closest I'm aware of is ZRAID and BTRFS RAID, which use checksums to verify the read, and then the parity information to recover if the checksum fails.

– Mark
Jan 6 at 0:40

|
show 1 more comment

Besides if your system’s RAM fails it might be the primary reason

answered Jan 5 at 14:58

poige

7,09211437

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

add a comment |

Besides if your system’s RAM fails it might be the primary reason

answered Jan 5 at 14:58

poige

7,09211437

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

add a comment |

Besides if your system’s RAM fails it might be the primary reason

answered Jan 5 at 14:58

poige

7,09211437

Besides if your system’s RAM fails it might be the primary reason

answered Jan 5 at 14:58

poige

7,09211437

answered Jan 5 at 14:58

poige

7,09211437

answered Jan 5 at 14:58

poige

7,09211437

answered Jan 5 at 14:58

poige

7,09211437

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

add a comment |

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

To test the system RAM, the OP can run memtest86.

– Halfgaar
Jan 5 at 15:43

whatever, that's another story

– poige
Jan 5 at 18:20

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Server Fault!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg