[linux] Chybny disk v RAID - ako najst pricinu resp. chybny sektor

patrik na foral.sk patrik na foral.sk
Úterý Červen 2 15:20:34 CEST 2009


> Dakujem, presne toto bol problem.
> Prave skusam selftest:
> smartctl -t long -d ata /dev/sda
> smartctl -t long -d ata /dev/sdb
> a test uz frci.

Vysledky SMART-u:

ID# ATTRIBUTE_NAME          VALUE WORST THRESH TYPE      RAW_VALUE
   1 Raw_Read_Error_Rate     200   200   051    Pre-fail  5
   3 Spin_Up_Time            185   184   021    Pre-fail  1750
   4 Start_Stop_Count        100   100   000    Old_age   16
   5 Reallocated_Sector_Ct   200   200   140    Pre-fail  0
   7 Seek_Error_Rate         200   200   000    Old_age   0
   9 Power_On_Hours          095   095   000    Old_age   4341
  10 Spin_Retry_Count        100   253   000    Old_age   0
  11 Calibration_Retry_Count 100   253   000    Old_age   0
  12 Power_Cycle_Count       100   100   000    Old_age   16
192 Power-Off_Retract_Count 200   200   000    Old_age   3
193 Load_Cycle_Count        200   200   000    Old_age   16
194 Temperature_Celsius     117   108   000    Old_age   26
196 Reallocated_Event_Count 200   200   000    Old_age   0
197 Current_Pending_Sector  200   200   000    Old_age   0
198 Offline_Uncorrectable   200   200   000    Old_age   0
199 UDMA_CRC_Error_Count    200   200   000    Old_age   0
200 Multi_Zone_Error_Rate   200   200   000    Old_age   0

...niektore stlpce som vyhodil, kvoli prehladnenejsiemu formatovaniu 
(v jednom riadku), vsetky hodnoty mi pridu OK az na 
Raw_Read_Error_Rate, tym si nie som isty, ale podla toho co som 
vygooglil, je hodnota vo VALUE vyssia sko hodnota v THRESH - takze by 
to malo byt OK.

Vo vypise testu je ale niekolko zaznamov o chybe, ktore prikladam, 
dufam, ze nebude vadit ze je to trochu dlhsie:


Error 18 occurred at disk power-on lifetime: 3302 hours (137 days + 14 
hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   04 51 00 34 cf f3 a3

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ea 00 00 c6 f4 7f 66 08  49d+17:02:41.210  FLUSH CACHE EXIT
   ea 00 00 c6 f4 7f 66 08  49d+17:02:30.772  FLUSH CACHE EXIT
   ca 00 08 bf f4 7f 00 08  49d+17:02:30.772  WRITE DMA
   ea 00 00 66 f9 00 5e 08  49d+17:02:30.772  FLUSH CACHE EXIT
   ca 00 08 5f f9 00 00 08  49d+17:02:30.568  WRITE DMA

Error 17 occurred at disk power-on lifetime: 2109 hours (87 days + 21 
hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   10 51 08 bf f4 7f e0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 08 34 cf f3 00 08  49d+17:02:44.358  IDENTIFY DEVICE
   ea 00 08 b7 83 25 00 08  49d+17:02:44.350  FLUSH CACHE EXIT
   ec 00 08 b7 83 25 00 08  49d+17:02:44.147  IDENTIFY DEVICE
   ec 00 08 6f 18 24 00 08  49d+17:02:44.137  IDENTIFY DEVICE
   ec 00 00 34 cf f3 00 08  49d+17:02:44.127  IDENTIFY DEVICE

Error 16 occurred at disk power-on lifetime: 2109 hours (87 days + 21 
hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   04 51 08 34 cf f3 a3

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ea 00 08 b7 83 25 00 08  49d+17:02:44.350  FLUSH CACHE EXIT
   ec 00 08 b7 83 25 00 08  49d+17:02:44.147  IDENTIFY DEVICE
   ec 00 08 6f 18 24 00 08  49d+17:02:44.137  IDENTIFY DEVICE
   ec 00 00 34 cf f3 00 08  49d+17:02:44.127  IDENTIFY DEVICE
   ea 00 00 c6 f4 7f 37 08  49d+17:02:44.119  FLUSH CACHE EXIT

Error 15 occurred at disk power-on lifetime: 2109 hours (87 days + 21 
hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   10 51 08 b7 83 25 e0

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   ec 00 08 6f 18 24 00 08  49d+17:02:44.137  IDENTIFY DEVICE
   ec 00 00 34 cf f3 00 08  49d+17:02:44.127  IDENTIFY DEVICE
   ea 00 00 c6 f4 7f 37 08  49d+17:02:44.119  FLUSH CACHE EXIT
   ca 00 08 bf f4 7f 00 08  49d+17:02:44.119  WRITE DMA
   ea 00 08 37 89 5e 00 08  49d+17:02:44.119  FLUSH CACHE EXIT


priznam sa, moc nerozumiem, aka je zavaznost tychto chyb.
Predpokladam ale, ze nic, co by mohlo sposobit pad pola do 
degradovaneho rezimu.

Co sa tyka logov, /var/log/kern.log mam stale prazdny, co mam este 
skontrolovat, resp. kde moze byt chyba? (Debian Linux 4.0 / 2.6.18-6-486)
-- 
Patrik Jan (pa3k)


Další informace o konferenci linux