Re: Accounting for between table correlation
От | Adrian Klaver |
---|---|
Тема | Re: Accounting for between table correlation |
Дата | |
Msg-id | 65a6db71-9182-1f0a-02c7-39cd011b34dd@aklaver.com обсуждение исходный текст |
Ответ на | Re: Accounting for between table correlation (Alexander Stoddard <alexander.stoddard@gmail.com>) |
Список | pgsql-general |
On 1/15/21 9:21 AM, Alexander Stoddard wrote: > Unfortunately I'm not free to share the specific schema or the query > plans. They derive from an upstream vendor that is 'protective' of their > data model. To get to a proper example I'll need to recreate the > behavior with generic data in a generified schema. > > I apologize for being frustratingly vague. I do feel like an idiot for > not at least saying this was with version PG 11.10. > It has been beneficial to me that the replies so far appear to validate > my understanding that no version of postgres has cross table correlation > statistics in the planner. > > Analyze is done frequently, and I think at sufficient sample size. The > context is in a large data analysis setting and the data is changed via > bulk ETL not OLTP. The effect on analyzing is just instability - the > plan can flip in either direction (without underlying data changing) > between giving an answer in minutes and timing out after 10s of hours. > That could be indicative of too small a sample but I think in this case > it is more the statistics aren't useful because it is cross table > correlation that can't be accounted for. So to be clear, the process imports the data, then you run a query and it completes in x time, you then ANALYZE the same data and it runs in y time. Is that correct? > > The 'fast plans' use parallel seq scans. The 'slow plans' is using index > scans. It appears a good query plan correctly predicts it should be bulk > processing the tables but bad ones get fooled into trashing (hard disk, > not SSD) by mispredicting too few rows to join between the tables. > > If anyone has similar experiences and is generous enough with their time > to share possible solutions/work arounds then I'm most grateful. If my > description is too vague to be worthwhile answering then I quite > understand and apologize for the time wasted in reading. > > Thank you. > > > > > > On Fri, Jan 15, 2021 at 10:26 AM David G. Johnston > <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote: > > On Fri, Jan 15, 2021 at 9:10 AM Thomas Kellerer <shammat@gmx.net > <mailto:shammat@gmx.net>> wrote: > > Atul Kumar schrieb am 15.01.2021 um 16:29: > > As per Ron, you are not supposed to ask your questions here. > > > > As According to him, we should keep on doing research on internet > > rather than asking for support directly even you have done enough > > research and until unless “Ron” won’t be satisfied you have to do > > keep on researching. > > Ron's question was perfectly valid. > > Missing and wrong statistics are one reason for the planner to > choose a bad execution plan. > > > Yeah, at first blush I didn't think analyze really mattered (and it > mostly doesn't because while you can keep the statistics up-to-date > the multi-table nature of the problem means they are only marginally > helpful here), but that just points out the under-specified nature > of the original posting. Taken as a simple question of "is there a > way to work around the lack of multi-table statistics" the analyze, > and even the specific queries, don't matter all that much. But it > also would be much more useful if the OP would choose a single > problematic query and show the schema, query, and explain results, > hopefully both good and bad, and comment on how analyze seems to > affect the plan choice. But for the general question about > overcoming our statistics limitations the analyze point is not relevant. > > David J. > -- Adrian Klaver adrian.klaver@aklaver.com
В списке pgsql-general по дате отправления: