Обсуждение: RADIUS tests and improvements
Hi, Here's a draft patch to tackle a couple of TODOs in the RADIUS code in auth.c. The first change is to replace select() with a standard latch loop that responds to interrupts, postmaster death etc promptly. It's not really too much of a big deal because the timeout was only 3 seconds (hardcoded), but it's not good to have places that ignore ProcSignal, and it's good to move code to our modern pattern for I/O multiplexing. We know from experience that we have to crank timeouts up to be able to run tests reliably on slow/valgrind/etc systems, so the second change is to change the timeout to a GUC, as also requested by a comment. One good side-effect is that it becomes easy and fast to test the timed-out code path too, with a small value. While adding the GUC I couldn't help wondering why RADIUS even needs a timeout separate from authentication_timeout; another way to go here would be to remove it completely, but that'd be a policy change (removing the 3 second timeout we always had). Thoughts? The patch looks bigger than it really is because it changes the indentation level. But first, some basic tests to show that it works. We can test before and after the change and have a non-zero level of confidence about whacking the code around. Like existing similar tests, you need to install an extra package (FreeRADIUS) and opt in with PG_EXTRA_TESTS=radius. I figured out how to do that for our 3 CI Unixen, so cfbot should run the tests and pass once I add this to the March commitfest. FreeRADIUS claims to work on Windows too, but I don't know how to set that up; maybe someday someone will fix that for all the PG_EXTRA_TESTS tests. I've also seen this work on a Mac with MacPorts. There's only one pathname in there that's a wild guess: non-Debianoid Linux systems; if you know the answer there please LMK.
Вложения
On 1/3/23 04:11, Thomas Munro wrote: > Here's a draft patch to tackle a couple of TODOs in the RADIUS code in auth.c. Nice to see someone working on this! I know of one company which could have used the configurable timeout for radius because the 3 second timeout is too short for 2FA. I think they ended up using PAM or some other solution in the end, but I am not 100% sure. > [...] While adding > the GUC I couldn't help wondering why RADIUS even needs a timeout > separate from authentication_timeout; another way to go here would be > to remove it completely, but that'd be a policy change (removing the 3 > second timeout we always had). Thoughts? It was some time since I last looked at the code but my impression was that the reason for having a separate timeout is that you can try the next server after the first one timed out (multiple radius servers are allowed). But I wonder if that really is a useful feature or if someone just was too clever or it just was an accidental feature. Andreas
On 1/3/23 22:03, Andreas Karlsson wrote: > On 1/3/23 04:11, Thomas Munro wrote: >> Here's a draft patch to tackle a couple of TODOs in the RADIUS code in >> auth.c. > > Nice to see someone working on this!. Another thing: shouldn't we set some wait event to indicate that we are waiting the RADIUS server or is that pointless during authentication since there are no queries running anyway? Andreas
On Wed, Jan 4, 2023 at 10:07 AM Andreas Karlsson <andreas@proxel.se> wrote: > Another thing: shouldn't we set some wait event to indicate that we are > waiting the RADIUS server or is that pointless during authentication > since there are no queries running anyway? I initially added a wait_event value, but I couldn't see it anywhere... there is no entry in pg_stat_activity for a backend that is in that phase of authentication, so I just set it to zero.
On Wed, Jan 4, 2023 at 10:03 AM Andreas Karlsson <andreas@proxel.se> wrote: > On 1/3/23 04:11, Thomas Munro wrote: > > [...] While adding > > the GUC I couldn't help wondering why RADIUS even needs a timeout > > separate from authentication_timeout; another way to go here would be > > to remove it completely, but that'd be a policy change (removing the 3 > > second timeout we always had). Thoughts? > > It was some time since I last looked at the code but my impression was > that the reason for having a separate timeout is that you can try the > next server after the first one timed out (multiple radius servers are > allowed). But I wonder if that really is a useful feature or if someone > just was too clever or it just was an accidental feature. Ah! Thanks, now that makes sense.
On 1/3/23 22:16, Thomas Munro wrote: > On Wed, Jan 4, 2023 at 10:07 AM Andreas Karlsson <andreas@proxel.se> wrote: >> Another thing: shouldn't we set some wait event to indicate that we are >> waiting the RADIUS server or is that pointless during authentication >> since there are no queries running anyway? > > I initially added a wait_event value, but I couldn't see it > anywhere... there is no entry in pg_stat_activity for a backend that > is in that phase of authentication, so I just set it to zero. Thanks for the explanation, that makes a lot of sense! Andreas
New improved version: * fixed stupid misuse of PG_FINALLY() (oops, must have been thinking of another language) * realised that it was strange to have a GUC for the timeout, and made a new HBA parameter instead * added documentation for that * used TimestampDifferenceMilliseconds() instead of open-coded TimestampTz maths I don't exactly love the PG_TRY()/PG_CATCH() around the CHECK_FOR_INTERRUPTS(). In fact this kind of CFI-with-cleanup problem has been haunting me across several projects. For cases that memory contexts and resource owners can't help with, I don't currently know what else to do here. Better ideas welcome. If I just let that socket leak because I know this backend will soon exit, I'd expect a knock at the door from the programming police. I don't actually know why we have src/test/authentication/t/...{password,sasl,peer}..., but then src/test/{kerberos,ldap,ssl}/t/001_auth.pl. For this one, I just copied the second style, creating src/test/radius/t/001_auth.pl. I can't explain why it should be like that, though. If I propose another test for PAM, where should it go?
Вложения
On Sat, Mar 04, 2023 at 02:23:10PM +1300, Thomas Munro wrote: > I don't exactly love the PG_TRY()/PG_CATCH() around the > CHECK_FOR_INTERRUPTS(). > In fact this kind of CFI-with-cleanup problem > has been haunting me across several projects. For cases that memory > contexts and resource owners can't help with, I don't currently know > what else to do here. Better ideas welcome. Like adding a Open/CloseSocket() in fd.c to control the leaks? > If I just let that > socket leak because I know this backend will soon exit, I'd expect a > knock at the door from the programming police. Hmm. It seems to me that you'd better have two patches instead of one here? First, one to introduce the new parameter to control the timeout, and a second to improve the responsiveness with a WaitLatch()? If the CFI proves to be an issue, it would be sad to have to revert the configuration part, which is worth on its own. > I don't actually know why we have > src/test/authentication/t/...{password,sasl,peer}..., but then > src/test/{kerberos,ldap,ssl}/t/001_auth.pl. For this one, I just > copied the second style, creating src/test/radius/t/001_auth.pl. I > can't explain why it should be like that, though. If I propose > another test for PAM, where should it go? My take would be to keep the number of directories in src/test/ to a minimum in the long run. Still, this is a case-by-case, as it depends on if a set of tests needs an expanded set of modules, configuration files and/or multiple scripts. ssl has its own set of configuration files with its module, so it makes sense to be independent. ldap has its LdapServer.pm with a configuration file, again I'm OK with a separate case. Kerberos has its own README, but IMO it could also be moved to src/test/authentication/ as it has a simple structure, with its requirements moved into a different README. What this patch set does for the RADIUS test is simple enough in structure that I would also add it in src/test/authentication/. That means less Make-fu and less Meson-fu. In 0001, PG_TEST_EXTRA requires radius for the test. This needs an update of regress.sgml where the values available are listed. I think that you'd better document that freeradius is required for the test in one of the README (either create a new one in radius/, or add this information to the one in authentication, as you feel). -- Michael
Вложения
Hi Thomas, Have you have a chance to look at and address the feedback given in this thread? -- Daniel Gustafsson