harddisks… are blackboxes of data – if something fails… you do not repair – you replace.

65% of all harddisk failurs can be announced with smart. wohoo! so don’t rely on it 😀

CentOS: https://www.linuxtechi.com/smartctl-monitoring-analysis-tool-hard-drive/

# apt based systems ubuntu/debian
apt-get install smartmontools
# rpm yum based systems fedora/redhat/centos
yum install smartmontools
# enable service
service smartd start ; chkconfig smartd on
lsblk; # check what harddisks are in your system
# enable smart for all
smartctl -s on /dev/sda
smartctl -s on /dev/sdb

# some commands
smartctl -i /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: LITEONIT LCS-256M6S 2.5 7mm 256GB
Serial Number: TW0X.....
Firmware Version: DC8110D
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 15 15:20:13 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

# this is probably all you wanna know, that it PASSED health test
smartctl -H /dev/sda

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.18.13] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

# more intensive testing
smartctl --test=long /dev/sda
# you can also redirect output to a log file
smartctl --test=long /dev/sdb > /var/log/sdb.smartl.txt
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 10 minutes for test to complete. (depending on harddisk size and speed this could take longer/shorter)
Test will complete after Tue Jan 15 16:09:36 2019

Use smartctl -X to abort test.

# when test is finished view results like this
smartctl -l selftest /dev/sdb
# check error log of drive
smartctl -l error /dev/sdb

# show temp of hd
smartctl -A /dev/sda | grep -i temperature
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 25/39)
smartctl -d ata -A /dev/sda | grep -i temperature
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Min/Max 25/39)
# Debian GNU/Linux 8 (jessie)/should be the same for ubuntu
# enable permanent supervision

# rpm/fedora/redhat/centos based distros, main config resides here
vim /etc/smartmontools/smartd.conf

# The word DEVICESCAN will cause any remaining lines in this
# configuration file to be ignored: it tells smartd to scan for all
# ATA and SCSI devices. DEVICESCAN may be followed by any of the
# Directives listed below, which will be applied to all devices that
# are found. Most users should comment out DEVICESCAN and explicitly
# list the devices that they wish to monitor.
DEVICESCAN -H -m root -M exec /usr/libexec/smartmontools/smartdnotify -n standby,10,q

# debian/ubuntu it is here
vim /etc/default/smartmontools; # uncomment those lines
# Defaults for smartmontools initscript (/etc/init.d/smartmontools)
# This is a POSIX shell fragment

# List of devices you want to explicitly enable S.M.A.R.T. for
# Not needed (and not recommended) if the device is monitored by smartd
enable_smart="/dev/sda"

# uncomment to start smartd on system startup
start_smartd=yes

# uncomment to pass additional options to smartd on startup
smartd_opts="--interval=1800"

# save and quit
/etc/init.d/smartmontools start
/etc/init.d/smartmontools status

grep smartd /var/log/syslog;
Jul 10 14:09:29 debian9 smartd[2214]: smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.12.0cuztom] (local build)
Jul 10 14:09:29 debian9 smartd[2214]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 10 14:09:29 debian9 smartd[2214]: Opened configuration file /etc/smartd.conf
Jul 10 14:09:29 debian9 smartd[2214]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
Jul 10 14:09:29 debian9 smartd[2214]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sda, opened
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sda, [Msft Virtual Disk 1.0 ], lu id: 0x600224809dc1e216e67567192f1dfd5d, 136 GB
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sda, Bad IEC (SMART) mode page, err=5, skip device
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sdb, opened
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sdb, [Msft Virtual Disk 1.0 ], lu id: 0x60022480e1ae953e1aeb2f0ea469dca8, 136 GB
Jul 10 14:09:29 debian9 smartd[2214]: Device: /dev/sdb, Bad IEC (SMART) mode page, err=5, skip device
Jul 10 14:09:29 debian9 smartd[2214]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Jul 10 14:09:29 debian9 systemd[1]: smartd.service: Main process exited, code=exited, status=17/n/a
Jul 10 14:09:29 debian9 systemd[1]: smartd.service: Unit entered failed state.
Jul 10 14:09:29 debian9 systemd[1]: smartd.service: Failed with result 'exit-code'.

grep smartd /var/log/*
grep: /var/log/anaconda: Is a directory
grep: /var/log/atop: Is a directory
grep: /var/log/audit: Is a directory
grep: /var/log/chrony: Is a directory
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: smartd 6.5 2016-05-07 r4318 [x86_64-linux-4] (local build)
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Opened configuration file /etc/smartmontools/smartd.conf
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Configuration file /etc/smartmontools/smartd.conf was parsed, found DEVICESCAN, scanning devices
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Device: /dev/sda, type changed from 'scsi' to 'sat'
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Device: /dev/sda [SAT], opened
/var/log/messages-20190106:Jan  5 20:02:40 domainName smartd[896]: Device: /dev/sda [SAT], TOSHIBA DT01ACA300, S/N: WWN:5-000039-ff4e20a1

# very detailed bulk mass of informations
smartctl -a /dev/sda

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: LITEONIT LCS-256M6S 2.5 7mm 256GB
Serial Number: ....
Firmware Version: DC8110D
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 15 15:49:47 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 10) seconds.
Offline data collection
capabilities: (0x15) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0003 100 100 000 Pre-fail Always - 0 (0 24)
12 Power_Cycle_Count 0x0003 100 100 000 Pre-fail Always - 2011
175 Program_Fail_Count_Chip 0x0003 100 100 000 Pre-fail Always - 0
176 Erase_Fail_Count_Chip 0x0003 100 100 000 Pre-fail Always - 0
177 Wear_Leveling_Count 0x0003 100 100 000 Pre-fail Always - 319134
178 Used_Rsvd_Blk_Cnt_Chip 0x0003 100 100 000 Pre-fail Always - 3
179 Used_Rsvd_Blk_Cnt_Tot 0x0003 100 100 000 Pre-fail Always - 96
180 Unused_Rsvd_Blk_Cnt_Tot 0x0003 093 093 005 Pre-fail Always - 1408
181 Program_Fail_Cnt_Total 0x0003 100 100 000 Pre-fail Always - 0
182 Erase_Fail_Count_Total 0x0003 100 100 000 Pre-fail Always - 0
187 Reported_Uncorrect 0x0003 100 100 000 Pre-fail Always - 96
195 Hardware_ECC_Recovered 0x0003 100 100 000 Pre-fail Always - 0
241 Total_LBAs_Written 0x0003 100 100 000 Pre-fail Always - 328352
242 Total_LBAs_Read 0x0003 100 100 000 Pre-fail Always - 492805

SMART Error Log Version: 1
Warning: ATA error count 0 inconsistent with error log pointer 1

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 768 hours (32 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 62 00 00 08 80 fc

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
62 00 08 00 08 80 fc 01 27d+14:41:32.897 [RESERVED FOR SERIAL ATA]
00 00 08 00 00 00 00 01 27d+14:39:35.936 NOP [Abort queued commands]
00 00 08 00 00 00 00 01 27d+14:38:30.400 NOP [Abort queued commands]
00 00 08 00 00 00 00 01 27d+14:37:24.864 NOP [Abort queued commands]
00 00 08 00 00 00 00 00 27d+14:36:19.328 NOP [Abort queued commands]

Error -3 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 00 00 00 00 00 00

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 03 00 00 00 00 00 00 00:00:00.000 NOP [Reserved subcommand] [OBS-ACS-2]

Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was in an unknown state.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 80 00 00 00 00 08

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 eb 01 00 00 00 00 ce 00:00:00.000 NOP [Reserved subcommand] [OBS-ACS-2]
80 00 00 00 08 00 08 00 39d+00:10:14.140 [VENDOR SPECIFIC]

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 0 -

Links:

https://www.linuxtechi.com/smartctl-monitoring-analysis-tool-hard-drive/

admin