A. Linux
B. Windows
C. MacOS
As far as I can tell, the correct answer is “none of these”. In the case of Linux, the fine manual sums up the problem pretty well:
If the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return.
Well, that sucks, doesn’t it? Getting the data into the drive’s write cache doesn’t really help very much, unless perhaps it’s a battery-backed cache. It seems that you’ll need to fiddle with the hdparm utility, or perhaps some other tool if you have something other than an IDE hard drive, to turn off write caching. Your choice of filesystem and filesystem mount options can also have an impact on behavior, and apparently most Linux file systems don’t handle this very well, though Greg Smith has some good things to say about the newest versions of ext4, at least in this limited regard.
MacOS X has a similar problem. On the MacBook Pro provided to me by my employer, EnterpriseDB, I get the following pgbench results with an out-of-the-box configuration.
starting vacuum…end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 4
number of threads: 4
number of transactions per client: 100000
number of transactions actually processed: 400000/400000
tps = 1292.258304 (including connections establishing)
tps = 1292.281493 (excluding connections establishing)
Hmm. Even if we rather optimistically assume that a single disk flush writes the commit records for all four threads to disk in a single go, that’s still 323 writes per second. Each commit should require a full revolution of the hard drive, so apparently this puppy has a 20K RPM drive in it? Doesn’t sound likely. MacOS X is kind enough to provide an system call that bypasses write caching and really writes data to disk. If I set wal_sync_method=fsync_writethrough in postgresql.conf, performance drops from 1292 tps to 27 tps. Ouch. I guess that’s why people don’t run critical production databases on laptop-class machines.
I’m not terribly familiar with Windows, except as an (occasional) user, so I can’t comment on the situation there in detail. However, the latest PostgreSQL documentation on this topic reads:
On Windows if wal_sync_method is open_datasync (the default), write caching is disabled by unchecking My Computer\Open\{select disk drive}\Properties\Hardware\Properties\Policies\Enable write caching on the disk. Also on Windows, fsync and fsync_writethrough never do write caching.
What that means is that if you’re running the default configuration, you’d better make sure to have unchecked that box; otherwise, you’d better change the default fsync method. Assuming this documentation reflects the actual behavior of Windows, it sounds like they’re a bit ahead of the curve, actually making fsync() force a flush of the drive’s write cache out of the box. On the other hand, apparently they only do this for fsync(), and not for similar methods such as opening the file with O_DSYNC, which is supposed to be similar in effect to an fsync() after every write, but more efficient.
I think the reason why there’s so much operating system (and database!) misbehavior in these cases is that everyone is afraid that doing the right thing will lead to massive loss of performance on benchmarks. You can understand why. In the above example with MacOS X, performance dropped by 50x when doing it the right way. If the next release of MacOS X were to enable this behavior by default, or if the next release of PostgreSQL were to use wal_sync_method=fsync_writethrough on MacOS X by default, someone would doubtless complain that there’d been a huge performance regression “for no reason”. The same is presumably true for Linux or Windows. This is an unfortunate state of affairs, and it’s not clear whether it will ever get fixed, but unless and until it does, watch out! Don’t get fooled into believing that your database is safe without very careful verification.
If you need to improve performance (and don’t have the budget for a machine with a battery-backed write cache), consider setting up the system for reliable operation and then configuring synchronous_commit=off. This should recapture much of the performance you’ve lost through making your write ahead log reliable, without risking database corruption. You may lose a few transactions in the event of a power failure, but for many applications that’s an acceptable risk, and certainly better than losing the whole database.
We’ve been working on updating the documentation on these topics lately, so that it will be more clear what you need to do on your system to be certain that data is being properly flushed to disk. If you haven’t read the first section of Chapter 29, on Reliability and the Write-Ahead Log, it’s worth your time to do so. You may want to read the version in the developer documentation, which has been recently updated with some additional details that aren’t present in the released versions.
If you're new here, you may want to subscribe to our RSS feed or follow us on Twitter for product announcements, event information, and industry news.
WAL Reliability






