Re: tsvector pg_stats seems quite a bit off.
| От | Jan Urbański | 
|---|---|
| Тема | Re: tsvector pg_stats seems quite a bit off. | 
| Дата | |
| Msg-id | 1275230500.1541.4.camel@Nokia-N900-42-11 обсуждение исходный текст | 
| Ответ на | Re: tsvector pg_stats seems quite a bit off. (Tom Lane <tgl@sss.pgh.pa.us>) | 
| Ответы | Re: tsvector pg_stats seems quite a bit off. | 
| Список | pgsql-hackers | 
> Jesper Krogh <jesper@krogh.cc> writes: > > On 2010-05-29 15:56, Jan Urbański wrote: > > > AFAIK statistics for everything other than tsvectors are built based > > > on the values of whole rows. > > > Wouldn't it make sense to treat array types like the tsvectors? > > Yeah, I have a personal TODO item to look into that in the future. There were plans to generalise the functions in ts_typanalyze and use LC for array types as well. If one day I'd find myselfwith a lot of free time I'd take a stab at that. > > > The results are attached in a text (CSV) file, to preserve > > > formatting. Based on them I'd like to propose top_stopwords and > > > error_factor to be 100. > > > I know it is not percieved the correct way to do things, but I would > > really like to keep the "stop words" in the dataset and have > > something that is robust to that. > > Any stop words would already have been eliminated in the transformation > to tsvector (or not, if none were configured in the dictionary setup). > We should not assume that there are any in what ts_typanalyze is seeing. Yes, and as a side note, if you want to be indexing stopwords, just don't pass a stopword file when creating the text searchdictionary (or pass a custom one). > > I think the only relevance of stopwords to the current problem is that > *if* stopwords have been removed, we would see a Zipfian distribution > with the first few entries removed, and I'm not sure if it's still > really Zipfian afterwards. However, we only need the assumption of > Zipfianness to compute a target frequency cutoff, so it's not like > things will be completely broken if the distribution isn't quite > Zipfian. That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much. Jan
В списке pgsql-hackers по дате отправления: