However, while that paper is in general very clear, I don't think it gives a very clear explanation of that particular formula, and it doesn't explain what it represents. It merely says that if "i" can be ignored "for some reason (e.g. i << Nr)", then that formula is an approximation to the exact "without replacement" formula, which is the subject of that paper.
But actually, that formula is the exact formula for the expected number of distinct values when selecting *with replacement* from a collection where each of the distinct values occurs an equal number of times. So I think we should say that.
Perhaps something along the lines of:
/* * Update the estimate based on the restriction selectivity. * * This uses the formula for the expected number of distinct values * when selecting with replacement from a collection where each of * the distinct values occurs an equal number of times (a uniform * distribution of values). This is a very close approximation to * the more correct method of selecting without replacement when * the number of distinct values is quite large --- for example, * see http://www.adellera.it/investigations/distinct_balls/. * Additionally, it also turns out to work very well even when the * number of distinct values is small. */
+1
Thank you for work on this patch. The formula you propose and explanation look great!