[geeks] Disks: recommendations?

Fri Oct 30 11:02:44 CDT 2020

On Fri, 30 Oct 2020, Mouse wrote:

>> On pretty much any decent modern SSD, wear leveling really isn't an
>> issue any more, anything not bargain-basement third tier is typically
>> now rated for multiple full device writes per DAY for longer than the
>> entire rated service life of the drive.
>
> What _is_ "the entire rated service life of the drive"?  I would be
> surprised if it were not far shorter than the time I want to keep my
> data.

The seemingly-accelerated wear rate of moderns SSDs is exaggerated by how
fast they are.  In active use, an SSD under an extreme write-heavy load
will fail faster than a mechanical disk in terms of calendar time, but
generally not in terms of blocks written.

Unlike spinning media, though, SSDs do now have mechanical degradation at
rest, as there's no lubrication to dry up.

SSD lifetimes are rated on total-device-writes-per-day over a given number
of years, assuming an end-to-end wear pattern.  If you totally fill a
device, and then rewrite the last block infinitely, you'll exhaust the
spare block pool faster, but that's a pathological use case.

At the day job, we write to SSDs constantly (I work on an datacenter-scale
NVMe RAID product), format the drives *way* more often than any reasonable
customer would do, and generally abuse them to make sure our redundancy
layer deals with exceptions reasonably.

I think I've seen three device failures out of several hundred in two
years.  SSDs (from Micron/Crucial, Intel, WDC/SanDisk, and Samsung,
anyway) are astoundingly durable.  Remember to issue discard/dsm (a.k.a.
"trim") commands for the unused portions of the media, and they'll last
ages.

There are lots of low-end players in the SSD space (Adata, PNY, etc.), and
my experience with those drives is that they are about as reliable as
cheap thumbdrives.  Do not trust them for data you care about.

> Do SSDs fail similarly, or do they just cross a line and go from
> "working fine" to "completely dead" when their firmware decides it's had
> enough?

SSD failure modes are manufacturer-dependent, but, generally speaking, the
failures are either media, controller, or DRAM.  Media failure manifest in
stuck bits or a whole lost page (64MB or so at a time).  DRAM failures
manifest as unreliable transports.  Controller failures manifest as a dead
drive.

> I would hope they'd instead flip from "working fine" to "read-only", but
> I have little faith such hopes would be realized.

That'd only be possible if wear was totally predictable (either
detectably or artificially so).

For what it's worth, I off-site with LTO, but that's only because of the
price of SSDs and speed doesn't matter for disaster-recovery in my use
case.  I'm slowly migrating from spinning rust to SSDs for active
data because it's hard to argue with a seek time of nearly zero.

-- 
Jonathan Patschke   |   "The more you mess with it, the more you're
Austin, TX          |    going to *have* to mess with it."
USA                 |                            --Gearhead Proverb