Обсуждение: Microsecond sleeps with select()
A comment on microsecond delays using select(). Most Unix kernels run at 100hz, meaning that they have a programmable timer that interrupts the CPU every 10 milliseconds. The kernel only gets to control the cpu during those tick interrupts or if a user application makes a kernel call. Therefore, it is no surprise the most Unix kernels can't do 5 microsecond sleeps. The only way they could do it would be to reprogram the timer interrupt if a wakeup was going to occur in less than 10 milliseconds. I doubt many kernels do that because I don't think timer interrupt programming is a very quick or accurate operation. Also, reprogramming it would make the actual 100hz timer unreliable. Now, kernels could check on return from kernel to user code to see if someone is ready to be woken up, but I doubt they do that either. Looking at the BSDI kernel, all timeouts are expressed in ticks, which are 10 milliseconds. Obviously there is no checking during kernel call returns because they don't even store the sleeps with enough resolution to perform a check. In fact, the kernel doesn't even contain have a way to measure microsecond timings. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > In fact, the kernel doesn't even contain have a way > to measure microsecond timings. Linux has patches available to do microsecond timings, but they're nonportable, of course. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
Bruce Momjian <pgman@candle.pha.pa.us> writes: > A comment on microsecond delays using select(). Most Unix kernels run > at 100hz, meaning that they have a programmable timer that interrupts > the CPU every 10 milliseconds. Right --- this probably also explains my observation that some kernels seem to add an extra 10msec to the requested sleep time. Actually they're interpreting a one-clock-tick select() delay as "wait till the next clock tick, plus one tick". The actual delay will be between one and two ticks depending on just when you went to sleep. I have been thinking some more about the s_lock() delay loop in connection with this. We currently have /** Each time we busy spin we select the next element of this array as the* number of microseconds to wait. This accomplishespseudo random back-off.* Values are not critical but 10 milliseconds is a common platform* granularity.** Totaltime to cycle through all 20 entries might be about .07 sec,* so the given value of S_MAX_BUSY results in timeout after~70 sec.*/ #define S_NSPINCYCLE 20 #define S_MAX_BUSY 1000 * S_NSPINCYCLE int s_spincycle[S_NSPINCYCLE] = { 0, 0, 0, 0, 10000, 0, 0, 0, 10000, 0,0, 10000, 0, 0, 10000, 0, 10000, 0, 10000, 10000 }; Having read the select(2) man page more closely, I now realize that it is *defined* not to yield the processor when the requested delay is zero: it just checks the file ready status and returns immediately. Therefore, on a single-CPU machine, the zero entries in this array are a gold-plated, guaranteed waste of time: we'll cycle through the kernel call, go back and retest the spinlock, and find it still locked. On a multi-CPU machine, the time wasted in the kernel call might possibly be enough to allow the lock holder (if running on another CPU) to finish its work and release the lock. But it's more likely that we're just wasting CPU. On either single or multi CPUs, there is no "pseudo random backoff" behavior here whatsoever. Spinning through the zero entries will take some small number of microseconds, and then we hit the one-clock-tick select() delay. In reality, every backend waiting for a spinlock will be awoken on every clock tick. If your kernel is one of the ones that interprets a one-tick delay request as "rest of the current tick plus a tick", then all the delays are actually two ticks. In this case, on average half the population of waiting backends will be awoken on each alternate tick. A little better but not much. In short: s_spincycle in its current form does not do anything anywhere near what the author thought it would. It's wasted complexity. I am thinking about simplifying s_lock_sleep down to simple wait-one-tick-on-every-call logic. An alternative is to keep s_spincycle, but populate it with, say, 10000, 20000 and larger entries, which would offer some hope of actual random-backoff behavior. Either change would clearly be a win on single-CPU machines, and I doubt it would hurt on multi-CPU machines. Comments? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > A comment on microsecond delays using select(). Most Unix kernels run > > at 100hz, meaning that they have a programmable timer that interrupts > > the CPU every 10 milliseconds. > > Right --- this probably also explains my observation that some kernels > seem to add an extra 10msec to the requested sleep time. Actually > they're interpreting a one-clock-tick select() delay as "wait till > the next clock tick, plus one tick". The actual delay will be between > one and two ticks depending on just when you went to sleep. > The BSDI code would be pselect(): /* * If poll wait was tiny, this could be zero; we will * have to round it up to avoid sleeping forever. If * we retry below, the timercmp above will get us out. * Note that if wait was 0, the timercmp will prevent * us from getting here the first time. */ timo = hzto(&atv); if (timo == 0) timo = 1; -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I have been thinking some more about the s_lock() delay loop in > connection with this. We currently have > > /* > * Each time we busy spin we select the next element of this array as the > * number of microseconds to wait. This accomplishes pseudo random back-off. > * Values are not critical but 10 milliseconds is a common platform > * granularity. > * > * Total time to cycle through all 20 entries might be about .07 sec, > * so the given value of S_MAX_BUSY results in timeout after ~70 sec. > */ > #define S_NSPINCYCLE 20 > #define S_MAX_BUSY 1000 * S_NSPINCYCLE > > int s_spincycle[S_NSPINCYCLE] = > { 0, 0, 0, 0, 10000, 0, 0, 0, 10000, 0, > 0, 10000, 0, 0, 10000, 0, 10000, 0, 10000, 10000 > }; > > Having read the select(2) man page more closely, I now realize that > it is *defined* not to yield the processor when the requested delay > is zero: it just checks the file ready status and returns immediately. Actually, a kernel call is something. On kernel call return, process priorities are checked and the CPU may be yielded to a higher-priority backend that perhaps just had its I/O completed. I think the 0 and 10000 are correct. They would be zero ticks and one tick. You think 5000 and 10000 would be better? I can see that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Having read the select(2) man page more closely, I now realize that >> it is *defined* not to yield the processor when the requested delay >> is zero: it just checks the file ready status and returns immediately. > Actually, a kernel call is something. On kernel call return, process > priorities are checked and the CPU may be yielded to a higher-priority > backend that perhaps just had its I/O completed. So *if* some I/O just completed, the call *might* do what we need, which is yield the CPU. Otherwise we're just wasting cycles, and will continue to waste them until we do a select with a nonzero delay. I propose we cut out the spinning and just do a nonzero delay immediately. > I think the 0 and 10000 are correct. They would be zero ticks and one > tick. You think 5000 and 10000 would be better? I can see that. No, I am not suggesting that, because there is no difference between 5000 and 10000. All of this stuff probably ought to be replaced with a less-bogus mechanism (POSIX semaphores maybe?), but not in late beta. regards, tom lane
> So *if* some I/O just completed, the call *might* do what we need, > which is yield the CPU. Otherwise we're just wasting cycles, and > will continue to waste them until we do a select with a nonzero > delay. I propose we cut out the spinning and just do a nonzero delay > immediately. Well, any backend with a higher piority would get run over the current process. The question is how would that happen. I will say that because of CPU cache issues, the system tries _not_ to change processes if the current one still needs the CPU, so the zero may be bogus. > > > I think the 0 and 10000 are correct. They would be zero ticks and one > > tick. You think 5000 and 10000 would be better? I can see that. > > No, I am not suggesting that, because there is no difference between > 5000 and 10000. > > All of this stuff probably ought to be replaced with a less-bogus > mechanism (POSIX semaphores maybe?), but not in late beta. Good question. We have sched_yield, that is a threads function, or at least only in the pthreads library. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sat, Feb 17, 2001 at 12:26:31PM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > A comment on microsecond delays using select(). Most Unix kernels run > > at 100hz, meaning that they have a programmable timer that interrupts > > the CPU every 10 milliseconds. > > Right --- this probably also explains my observation that some kernels > seem to add an extra 10msec to the requested sleep time. Actually > they're interpreting a one-clock-tick select() delay as "wait till > the next clock tick, plus one tick". The actual delay will be between > one and two ticks depending on just when you went to sleep. > ... > In short: s_spincycle in its current form does not do anything anywhere > near what the author thought it would. It's wasted complexity. > > I am thinking about simplifying s_lock_sleep down to simple > wait-one-tick-on-every-call logic. An alternative is to keep > s_spincycle, but populate it with, say, 10000, 20000 and larger entries, > which would offer some hope of actual random-backoff behavior. > Either change would clearly be a win on single-CPU machines, and I doubt > it would hurt on multi-CPU machines. > > Comments? I don't believe that most kernels schedule only on clock ticks. They schedule on a clock tick *or* whenever the process yields, which on a loaded system may be much more frequently. The question is whether, scheduling, the kernel considers processes that have requested to sleep less than a clock tick as "ready" once their actual request time expires. On V7 Unix, the answer was no, because the kernel had no way to measure any time shorter than a tick, so it rounded up all sleeps to "the next tick". Certainly there are machines and kernels that count time more precisely (isn't PG ported to QNX?). We do users of such kernels no favors by pretending they only count clock ticks. Furthermore, a 1ms clock tick is pretty common, e.g. on Alpha boxes. A 10ms initial delay is ten clock ticks, far longer than seems appropriate. This argues for yielding the minimum discernable amount of time (1us) and then backing off to a less-minimal time (1ms). On systems that chug at 10ms, this is equivalent to a sleep of up-to-10ms (i.e. until the next tick), then a sequence of 10ms sleeps; on dumbOS Alphas, it's equivalent to a sequence of 1ms sleeps; and on a smartOS on an Alpha it's equivalent to a short, variable time (long enough for other runnable processes to run and yield) followed by a sequence of 1ms sleeps. (Some of the numbers above are doubled on really dumb kernels, as Tom noted.) Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes: > Certainly there are machines and kernels that count time more precisely > (isn't PG ported to QNX?). We do users of such kernels no favors by > pretending they only count clock ticks. Furthermore, a 1ms clock > tick is pretty common, e.g. on Alpha boxes. Okay, I didn't know there were any popular systems that did that. > This argues for yielding the minimum discernable amount of time (1us) > and then backing off to a less-minimal time (1ms). Fair enough. As you say, it's the same result on machines with coarse time resolution, and it should help on smarter boxes. The main thing is that I want to change the zero entries in s_spincycle, which clearly aren't doing what the author intended. regards, tom lane