Обсуждение: wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
От
"Colin 't Hart"
Дата:
Hi,
One of my clients has Microsoft Defender for Endpoint on Linux installed on their Postgres servers.
I was testing a database restore from pgBackRest. The restore itself seemed to complete in a reasonable amount of time, but then the Postgres recovery started and it was extremely slow to retrieve and apply the WAL files.
I noticed wdavdaemon taking most of the CPU, and Postgres getting very little.
I wonder if anyone here has any experience with configuring exclusions so that the WAL files can be processed faster?
AND
Any advice on what to communicate with their IT department about using this on their database servers? I've never encountered it on Linux before...
Thanks,
Colin
Re: wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
От
Adrian Klaver
Дата:
On 12/2/25 06:47, Colin 't Hart wrote: > Hi, > > One of my clients has Microsoft Defender for Endpoint on Linux installed > on their Postgres servers. > > I was testing a database restore from pgBackRest. The restore itself > seemed to complete in a reasonable amount of time, but then the Postgres > recovery started and it was extremely slow to retrieve and apply the WAL > files. > > I noticed wdavdaemon taking most of the CPU, and Postgres getting very > little. > > I wonder if anyone here has any experience with configuring exclusions > so that the WAL files can be processed faster? > > AND > > Any advice on what to communicate with their IT department about using > this on their database servers? I've never encountered it on Linux before... Advice, don't let any Microsoft product contact anything you care about. > > Thanks, > > Colin -- Adrian Klaver adrian.klaver@aklaver.com
Re: wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
От
Christoph Moench-Tegeder
Дата:
## Colin 't Hart (colinthart@gmail.com): > I wonder if anyone here has any experience with configuring exclusions so > that the WAL files can be processed faster? https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions mind this: https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#supported-exclusion-scopes and work from these examples (if you're allowed to): https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#example-3-add-or-remove-a-folder-exclusion > Any advice on what to communicate with their IT department about using this > on their database servers? I've never encountered it on Linux before... "Be glad it only slows your database down. All too often, AV/Endpoint Protection Products just don't like the access pattern and eat your database for breakfast." There is this joke "it has been 0 days since Anti-Virus ate a database". Regards, Christoph -- Spare Space
Re: wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
От
"Colin 't Hart"
Дата:
Thanks. I just get
This setting is managed by your organization
so I'm going to have to talk with the IT guys... we have a meeting scheduled tomorrow.
/Colin
On Tue, 2 Dec 2025 at 21:34, Christoph Moench-Tegeder <cmt@burggraben.net> wrote:
## Colin 't Hart (colinthart@gmail.com):
> I wonder if anyone here has any experience with configuring exclusions so
> that the WAL files can be processed faster?
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions
mind this:
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#supported-exclusion-scopes
and work from these examples (if you're allowed to):
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#example-3-add-or-remove-a-folder-exclusion
> Any advice on what to communicate with their IT department about using this
> on their database servers? I've never encountered it on Linux before...
"Be glad it only slows your database down. All too often, AV/Endpoint
Protection Products just don't like the access pattern and eat your
database for breakfast." There is this joke "it has been 0 days since
Anti-Virus ate a database".
Regards,
Christoph
--
Spare Space
On Tue, Dec 2, 2025 at 3:35 PM Christoph Moench-Tegeder <cmt@burggraben.net> wrote:
## Colin 't Hart (colinthart@gmail.com):
> I wonder if anyone here has any experience with configuring exclusions so
> that the WAL files can be processed faster?
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions
mind this:
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#supported-exclusion-scopes
and work from these examples (if you're allowed to):
https://learn.microsoft.com/en-us/defender-endpoint/linux-exclusions#example-3-add-or-remove-a-folder-exclusion
> Any advice on what to communicate with their IT department about using this
> on their database servers? I've never encountered it on Linux before...
"Be glad it only slows your database down. All too often, AV/Endpoint
Protection Products just don't like the access pattern and eat your
database for breakfast." There is this joke "it has been 0 days since
Anti-Virus ate a database".
Things must have improved, since we had Carbon Black for a number of years, and now use Coretex XDR.
CB would quite often consume 300% CPU, while XDR "only" uses 100% on occasion, but have never corrupted or crashed a PG instance. (This is standard installations, with no exclusions.)
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
Re: wdavdaemon / Microsoft Defender for Endpoint on Linux and slow Postgres recovery?
От
Thomas Munro
Дата:
On Wed, Dec 3, 2025 at 3:48 AM Colin 't Hart <colinthart@gmail.com> wrote: > One of my clients has Microsoft Defender for Endpoint on Linux installed on their Postgres servers. > > I was testing a database restore from pgBackRest. The restore itself seemed to complete in a reasonable amount of time,but then the Postgres recovery started and it was extremely slow to retrieve and apply the WAL files. > > I noticed wdavdaemon taking most of the CPU, and Postgres getting very little. These days, tools like that work by monitoring every read, write etc via kernel event queues (fanotify on Linux, ESF on macOS, IDK on Windows, it might still be using something more efficient but less isolated with tentacles inside the kernel). Those queues usually have a fixed size and when they overflow because the event consumer isn't keeping up, the monitored process can be blocked. That's probably true even if running in a mode where it doesn't have to reply to allow the operation to proceed. Presumably the consumer is running some kind of rolling fingerprint check over the data looking for things from its database of malware, which you'd hope would be very well optimised... My pet theory is that PostgreSQL suffers from these systems more than anything else not because of the total bandwidth but because of the per-operation overheads and our historical 8KB-at-a-time disk and network I/O. Your report about pgBackRest supports that idea: it probably copies a larger total size in big chunks, while recovery reads the WAL 8KB at a time (and evicts data 8KB at a time if your buffer pool is small), and then finally the checkpointer writes back 8KB at a time. Another factor is that it might be using only one fanotify queue for each process, or worse, but IDK if that matters, it sounds like the CPU might be saturated anyway? Future releases should improve all of that with bigger I/Os for WAL (read through an 8KB drinking straw, dunno if it's spying on reads too?) and data (I/O combining, various strategies, various prototypes[1][2], watch this space). It's also been proposed a few times that we should have an option to skip the end-of-recovery checkpoint, so then you'd get a regular "spread" checkpoint that the spyware could keep up with (assuming that it normally keeps up, just not in crash recovery). Another thing that probably makes this worse in this strange environment, if we assume it is due to small writes and reads are not affected, is that crash recovery currently dirties all pages that the WAL touches, forgetting progress that already made it to disk: it overwrites the LSN with an FPW and then replays all changes on top, when it could instead read the page in and skip a lot of work if the LSN is high enough, thereby often avoiding dirtying and re-writing the page, whenever checksums are on (as they are now by default). The checksum could be used as proof that the page wasn't torn by a non-atomic write interrupted by a power outage. I doubt anyone is really that interested in optimising for such setups per se when anyone will tell you to just turn it off, but the reason I've thought about it enough to take a guess is that my corporate-managed Mac was running the PostgreSQL test suite so slowly it would time out, and I was sufficiently nerd-sniped to figure out that it could keep up with bursts of I/O pretty well, but everything turned to custard under sustained workloads, notably in the recovery tests which deliberately run with a tiny buffer pool. As someone working on bits of our I/O plumbing, I couldn't help speculating that something that is objectively terrible about PostgreSQL is really just being magnified by strange new overheads that mess with the economics. It may not be a goal but I will still be happy if it copes with this stuff as a by-product of general improvements like generalised I/O combining. (Funnily enough I've actually got a bunch of unpublished tooling to simulate, detect and manage invisible I/O queuing.) > I wonder if anyone here has any experience with configuring exclusions so that the WAL files can be processed faster? Yep, it entirely fixed the cliff and vastly reduced the CPU usage on my corporate Mac. There is still a small measurable slowdown, but the recovery test suite couldn't even complete without timing out while monitored. I expect exactly the same on Linux but haven't tried it. > Any advice on what to communicate with their IT department about using this on their database servers? I've never encounteredit on Linux before... There is lots of writing on the internet about excluding pgdata from these types of tools. Much of it is concerned with Windows-specific problems: opening files and directories or mapping files at bad times can cause various PostgreSQL file operations to fail on that OS. I don't know of any reason why periodic scans of pgdata should interfere with PostgreSQL on Linux other than consuming I/O bandwidth, it seems to be just the per-syscall stuff that is unworkable. You might be able to show "meson test" failing as some kind of evidence that PostgreSQL is allergic to it. Or if you want to try to find a one-liner demonstration independent of PostgreSQL, you could test the can't-keep-up-with-stream-of-tiny-writes theory by experimenting with "dd" at different block sizes. I expect you'll find a size below which the fanotify queue quickly overflows and performance falls off a cliff. Current versions of PostgreSQL assumed fast and consistent buffered writes and pretended the system calls were free. These monitoring tools make them expensive and also non-linear by sending messages around with carrier pigeons. [1] https://www.postgresql.org/message-id/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKGK1in4FiWtisXZ%2BJo-cNSbWjmBcPww3w3DBM%2BwhJTABXA%40mail.gmail.com