Обсуждение: regex cache
Folks, I'm doing some analysis of PostgreSQL site traffic, and am being frequently hung up by the compile-time-fixed size of our regex cache (32 regexes, per MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem or some other dynamically changeable limit for regex caching? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: > I'm doing some analysis of PostgreSQL site traffic, and am being frequently > hung up by the compile-time-fixed size of our regex cache (32 regexes, per > MAX_CACHED_RES). Is there a reason why it would be hard to use work_mem > or some other dynamically changeable limit for regex caching? Hmmm ... Spencer's regex library makes a point of hiding its internal representation of a compiled regex from the calling code. So measuring the size of the regex cache in bytes would involve doing a lot of violence to that API. We could certainly allow the size of the cache measured in number-of-regexes to be controlled, though. Having said that, I'm not sure it'd help your problem. If your query is using more than 32 regexes concurrently, it likely is using $BIGNUM regexes concurrently. How do we fix that? regards, tom lane
Tom, > Having said that, I'm not sure it'd help your problem. If your query is > using more than 32 regexes concurrently, it likely is using $BIGNUM > regexes concurrently. How do we fix that? Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example, where I'm hitting a wall is 300 regexes. Some quick testing on my opteron text machine right now shows that the execution time difference between 20rx and 50rx is around 20x. -- Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> Having said that, I'm not sure it'd help your problem. If your query is >> using more than 32 regexes concurrently, it likely is using $BIGNUM >> regexes concurrently. How do we fix that? > Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example, > where I'm hitting a wall is 300 regexes. Some quick testing on my opteron > text machine right now shows that the execution time difference between 20rx > and 50rx is around 20x. Hmm. Well, I still don't want to tie it to work_mem; how do you feel about a new GUC to determine the max number of cached REs? regards, tom lane
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >>> Having said that, I'm not sure it'd help your problem. If your query is >>> using more than 32 regexes concurrently, it likely is using $BIGNUM >>> regexes concurrently. How do we fix that? > >> Hmmm. I think there's a lot of ground between 32 and $BIGNUM. For example, >> where I'm hitting a wall is 300 regexes. Some quick testing on my opteron >> text machine right now shows that the execution time difference between 20rx >> and 50rx is around 20x. > > Hmm. Well, I still don't want to tie it to work_mem; how do you feel > about a new GUC to determine the max number of cached REs? Yeah. You know me, I was just trying to avoid having more GUCs. --Josh
Josh Berkus <josh@agliodbs.com> writes: > Tom Lane wrote: >> Hmm. Well, I still don't want to tie it to work_mem; how do you feel >> about a new GUC to determine the max number of cached REs? > Yeah. You know me, I was just trying to avoid having more GUCs. I'm not excited about it either, but I think if we're going to make this adjustable it does need its own knob. I can easily believe that a large list of precompiled GUCs could be counterproductive given a workload where you don't get much reuse, so I don't want the list size going to the moon just because someone cranked up work_mem for other purposes. (I'm not real sure that that "self-organizing list" data structure would work well beyond 1000 or so entries even if you did have enough re-use to justify them all. Anyone want to try to do some performance testing? In particular I think we might want to drop the move-to-front approach in favor of move-up-one, just to avoid O(N^2) memmove costs.) regards, tom lane
Tom, > I'm not excited about it either, but I think if we're going to make > this adjustable it does need its own knob. I can easily believe > that a large list of precompiled GUCs could be counterproductive > given a workload where you don't get much reuse, so I don't want > the list size going to the moon just because someone cranked up > work_mem for other purposes. Yes. I was just trying to avoid thinking about it. ;-) > > (I'm not real sure that that "self-organizing list" data structure > would work well beyond 1000 or so entries even if you did have > enough re-use to justify them all. Anyone want to try to do some > performance testing? In particular I think we might want to drop > the move-to-front approach in favor of move-up-one, just to avoid > O(N^2) memmove costs.) Hmmm. Yeah, I can see that. Well, I have a test case here (the PostgreSQL download logs), or I wouldn't have brought up the issue. I just need to find a way to multi-thread it so I can get the effect of multiple clients. --Josh