見ないほうがいい情報ってある
最近の立ち読みで得たショッキングな情報としては、「ハードディスクってのは、その3分の2は何の前触れもなくお亡くなりになる」というやつ。確か「Googleを支える技術 ?巨大システムの内側の世界 (WEB+DB PRESSプラスシリーズ)」だったかな。
で、それを知っててよせばいいのに自前のサーバでsmartmontoolを使ってしまった。もう不調なのは知ってるくせに(泣笑
# smartctl -s on -a /dev/hda
とかね。するとうじゃうじゃ出てきたよ(涙
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST320410A Serial Number: 3FG2JGXK Firmware Version: 5.33 User Capacity: 20,020,396,032 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Apr 20 00:11:11 2008 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF ENABLE/DISABLE COMMANDS SECTION === SMART Enabled. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 420) seconds. Offline data collection capabilities: (0x1d) SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 22) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 068 064 025 Pre-fail Always - 231057925 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 1 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 10 7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 170597557 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 12111 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 4 194 Temperature_Celsius 0x0022 043 055 000 Old_age Always - 43 195 Hardware_ECC_Recovered 0x001a 100 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 095 095 000 Old_age Always - 56 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 230 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 230 occurred at disk power-on lifetime: 12084 hours (503 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 07 b6 82 2b e2 Error: UNC 7 sectors at LBA = 0x022b82b6 = 36405942 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 08 b5 82 2b e2 00 00:00:01.265 READ DMA c8 00 08 35 ff 62 e0 00 00:00:12.945 READ DMA ca 00 08 35 a1 14 e1 00 00:00:00.312 WRITE DMA ca 00 08 15 a1 14 e1 00 00:00:00.318 WRITE DMA ca 00 10 e5 a0 14 e1 00 00:00:00.313 WRITE DMA (↑こういうエラーが最新の5個分だけ表示されて...) SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Device does not support Selective Self Tests/Logging
ちなみに、先日買ったばかりのSC440のハードディスクは、エラーなし。こんな具合。
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 253 253 025 Pre-fail Always - 4352 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 253 253 000 Old_age Always - 0 8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0 9 Power_On_Half_Minutes 0x0032 100 100 000 Old_age Always - 3h+53m 10 Spin_Retry_Count 0x0032 253 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 253 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 36915103 184 Unknown_Attribute 0x0033 253 253 099 Pre-fail Always - 0 187 Unknown_Attribute 0x0032 253 253 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000 Old_age Always - 0 190 Unknown_Attribute 0x0022 160 142 000 Old_age Always - 537067546 194 Temperature_Celsius 0x0022 160 142 000 Old_age Always - 26 (Lifetime Min/Max 0/8195) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 36915103 196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
燦然と輝く”No Errors Logged”
まぁ、実用的には、RAID-1で組んで、片方がお亡くなりになったことを確実に検知して、サーバをdegrade modeというか、hot stand-by のマシンに引き継ぐっていう構成が確実だよなぁ、と思う。
この場合、hardwareのRAIDカードの機種によっては、接続されているハードディスクの情報をちゃんと取得できないケースがあるんで要注意。しかも最新のOSにドライバが対応してない、なんてことも多い。
それならいっそ、drbdとかでsoftware-RAIDを組んで、heartbeatチェックをすればいいんじゃ無いかとか思ってしまうが、実は障害からの復帰手順とか面倒くさそうだったりして、there is no free lunch.という気分。