Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types
Дата
Msg-id 20170126.181230.142084427.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: [HACKERS] Floating point comparison inconsistencies of thegeometric types  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-hackers
Hello, Emre.

At Wed, 25 Jan 2017 12:24:18 +0100, Emre Hasegeli <emre@hasegeli.com> wrote in
<CAE2gYzzLWCdYDPLi_UQXLSJ+pqc7g4WGNHsb8tHkmosKK+rO7g@mail.gmail.com>
> I am responding both of your emails together.
> 
> > Perhaps I don't understand it. Many opclass are defined for
> > btree. But since (ordinary) btree handles only one-dimentional,
> > totally-orderable sets, geometric types even point don't fit
> > it. Even if we relaxed it by removing EPSILON, multidimentional
> > data still doesn't fit.
> 
> Yes, multidimentional data doesn't fit into btree.  Let's say we would
> have to introduce operators called *<, *<=, *>, *>= to support btree
> opclass for point.  I agree those operators would be useless, but
> btree opclass is used for other purposes too.  "ORDER BY point",
> merge-joins, btree index support for ~= would be useful.

Btree cannot be used unless any kind of total-orderedness is
defined but hash index would be usable so.

> Lack of ORDER BY, GROUP BY, and DISTINCT support is the major
> inconvenience about the geometric types.  There are many complaints
> about this on the mailing list.  Many frameworks and ORMs are not
> prepared to deal with getting an error when they use certain types on
> those clauses.

Even though I'm not sure but I don't see a "natural" (or
agreeable by many poeple) ordering of geometric types in
general. Anyway it's quite application (not application program
but the relationship with the real world) specific.

> >> - Only some operators are using the epsilon.
> >>
> >> I included the list of the ones which doesn't use the epsilon on
> >> initial post of this thread.  This makes the operators not compatible
> >> with each other.  For example, you cannot say "if box_a @> box_b and
> >> box_b @> point then box_a @> box_b".
> >
> > (It must be "then box_a @> point".)
> 
> Yes, I am sorry.
> 
> >  Both of the operators don't
> > seem to use EPSILON so the transitivity holds, but perhaps we
> > should change them to more appropriate shape, that is, to use
> > EPSILON. Even having the tolerance, we can define the operators
> > to keep the transitivity.
> 
> Yes, that would be another way.  In my view, this would make things
> worse, because I believe epsilon approach is completely broken.  The
> operators which are not using the epsilon are the only ones people can
> sanely use to develop applications.

What we should not forget is that PostGIS does the same thing and
it is widly used (I believe..). This means it not broken at least
on a certain context. But it is a fact that it also quite
inconvenient to get performance from, say, indexes.

> >> > - The application of epsilon is implementation dependent.
> >> >
> >> > Epsilon works between two numbers.  There is not always a single way
> >> > of implementing geometric operators.  This is surprising to the users.
> >> > For example, you cannot say "if box_a @> box_b then box_a <-> box_b <=
> >> > epsilon".
> >>
> >> Maybe it should be like "if box_a ~= box_b && box_b @< box_a then
> >> ..". I'm not sure off-hand. But I don't see the relationship is
> >> significant.
> 
> What I meant was the behaviour being unclear to even people who knows
> about the epsilon.  If two boxes are overlapping with each other, one
> who knows about EPSILON can expect the distance between them to be
> less than EPSILON, but this wouldn't be true.

Yeah, the EPSILON is mere a fuzz factor so we cannot expect any
kind of stable behavior for differences under the level. But on
the analogy of comparisons of floating point numbers, I suppose
that inequality comparison could be done without the tolerance.

> >> - Some operators are violating commutative property.
> >>
> >> For example, you cannot say "if line_a ?|| line_b then line_b ?|| line_a".
> >
> > Whether EPSILON is introduced or not, commutativity cannot be
> > assured for it from calculation error, I suppose.
> 
> It can easily be assured by treating both sides of the operator the
> same.  It is actually assured on my patch.

It surely holds for certain cases. Even how many applicable cases
we guess, finally we cannot proof that it works generally. Just
three times of 1/3 rotation breakes it.

> >> - Some operators are violating transitive property.
> >>
> >> For example, you cannot say "if point_a ~= point_b and point_b ~=
> >> point_c then point_a ~= point_c".
> >
> > It holds only in ideal (or mathematical) world, or for similars
> > of integer (which are strictly implemented mathematica world on
> > computer).  Without the EPSILON, two points hardly match by
> > floating point error, as I mentioned. I don't have an evidence
> > though (sorry), mere equality among three floating point numbers
> > is not assured.
> 
> Of course, it is assured.  Floating point numbers are deterministic.

Hmm, I have nothing more to say if you don't agree that floating
point numbers involving any kind of arithmetic is hardly
deterministic especially not defining its usage.

> >> - The operators with epsilon are not compatible with float operators.
> >>
> >> This is also surprising for the users.  For example, you cannot say
> >> "if point_a << point_b then point_a[0] < point_b[0]".
> >
> > It is because "is strictly left of" just doesn't mean their
> > x-coordinates are in such a relationship. The difference might be
> > surprising but is a specification (truely not a bug:).
> 
> Then what does that mean?  Every operator with EPSILON is currently
> surprising in different way.  People use databases to build
> applications.  They often need to implement same operations on the
> application side.  It is very hard to be bug-compatible with our
> geometric types.

The world of the geometric types in PostgreSQL *is* built
so. There is nothing different with that Windows client can make
different results from PostgreSQL on a Linux server for a simple
floating point arithmetics, or even different binaries made by
different version of compilers on the same platoform can. Relying
on such coherency by accident is a bad policy.

> >> = Missing Bits on the Operator Classes
> 
> > This final section seems to mention the application but sorry, it
> > still don't describe for me what application that this patch
> > aiming to enable. But I can understand that you want to handle
> > floating point numbers like integers. The epsilon is surely quite
> > annoying for the purpose.
> 
> I will try to itemise what applications I am aiming to enable:
> 
> - Applications with very little GIS requirement
> 
> PostGIS can be too much work for an application in s different
> business domain but has a little requirement to GIS.  The built-in
> geometric types are incomplete to support real world geography.
> Getting rid of epsilon would make this worse. In contrary, it would
> allow people to deal with the difficulties on their own.
> 
> - Applications with geometric objects
> 
> I believe people could make use the geometric types in many different
> business areas, if they would be in better shape.  I am working for a
> gaming company.  There is great overlap between the logic of many
> games and the geometric types.  I could support using them more, if
> they wouldn't be so buggy and impractical.
> 
> - Demonstrating indexes
> 
> We require indexing features to be committed with an example.
> Geometric types have served this purpose many times.  Many operators
> are actually added to demonstrate indexes.  These examples are also
> used by other projects like PostGIS.  They can re-use part of the
> implementations, or the implementation as a starting point.  I believe
> we should make those examples as clean as possible.
> 
> > Ok, returning to the basis of the current geometric types, it is
> > described as below. It is not perfect (so needs amendment) but
> > achived to a certain degree.
> >
> > 1. For geometric types, equality needs to be defined with some
> >    tolerance. Concretely an absolute value of 1E-06, which is
> >    quite random but would be suitable to ignore error of
> >    longitude and latitude.
> 
> I don't agree this requirement as it is put.  Returning true to
> something like "point(0.000001, 0.000001) ~= point(0.000002,
> 0.000002)" is obviously wrong.

No. It can be right. For example, it is right for an application
where the precision of data is 0.1 or 1E-6. But it is wrong if
the tolerance designed for its purpose is 1E-7.

> > 2. Translative comparisons should consider the tolerance to make
> >    a result that does not contradict the equality.
> 
> The current implementation fails to provide it:
> 
> > hasegeli=# select point(0.000001, 0.000001) ~= point(0.000002, 0.000002);
> > ?column?
> > ----------
> > t
> > (1 row)
> >
> > hasegeli=# select point(0.000001, 0.000001) << point(0.0000021, 0.0000021);
> > ?column?
> > ----------
> > t
> > (1 row)
> >
> > hasegeli=# select point(0.000002, 0.000002) << point(0.0000021, 0.0000021);
> > ?column?
> > ----------
> > f
> > (1 row)

So this example without any concrete application is
irrelevant.

> > 3. It is not sure about other comparsons such like parallel and
> >    perpendicular, but it is hardly usable without any amount of
> >    tolerance.
> 
> I agree.  I don't think they are useable with our tolerance either.
> 
> > On the other hand, some people complains about the following.
> 
> I believe have listed enough complaints on my previous post.
> 
> > As a presupposition, we have the common knowlegde, or principle,
> > about floating point handling on a digital computer.
> >
> > X. We shouldn't detect equalty of two floating point numbers just
> >    using '=' for real usage. It hardly works for arbitrary
> >    numbers including results of any arithmetics, so it should
> >    incorporate any amount of tolerance that differs among
> >    applications.
> >
> > Sorry for some refrains from the past discussion, but the above X
> > and 1-3 are the immovable basis in this area. (I belive it . But
> > any objection are welcome)
> >
> > You might want to object that float has an equality operator, but
> > it is because it is an atomic type, which has no specific
> > purpose. On the contrary geometric types are application types,
> > which is assumed specific purpose and it should behave
> > reasonablly against the expected purpose.
> 
> Are they really?  We have really simple geometric types which are not

PostGIS or libgeos seems to prove it. They are designed exactly
for this purpose and actually used.

> for any specific kind of application.  We are not claiming that they
> are for earth scale or any other scale.  I think we should aim of
> making them predictable, extensible, as general purpose as possible to
> let people use them for their purpose.

If settable tolerance, and preferablly, with definable method
were introduced, it could be said that generic geometric types.


> > That being said, if we want to, for example, use btree index for
> > each coordinate of geometric types, it is doable by normalizing
> > the values before applying operators, instead of applying epsilon
> > arbitrarily in operators.
> >
> > Normalization may be done both on storing or on-the-fly just
> > befor usage. It depends on whether we want to preserve precision
> > of stored values. If we want, normalization should be done just
> > before comparisons, or on creating indexes.
> >
> > I don't know much about the normalization method for floating
> > point numbers but libgeos might give some hints.
> 
> Me neither.  I have been told that storing the minimum and and maximum
> possible value is a good strategy for tolerance.  Then all operators
> can return true, false, or maybe.  This would be a much bigger and
> incompatible change that I don't think I can implement.
> 
> We cannot come into an agreement.  Geometric types are not well
> maintained.  I am trying my best to put them into a shape.  Please
> help me do that.  Let's try to find a way to leave them better than we
> found.

Although some operators have apparently wrong implement and
fixing it would be a progress, it is not a progress toward your
objective if I don't agree to just remove the tolerance.

# Sorry for the long detour..

So, the union of the two requirements seems to be having such
parameter as a GUC.

Geometric types don't work without any amount of tolerance for
many cases, and surely some people needs the EPSILON but the
amout is quite arbitrary. On the other hand, some other people
want no-tolerance. This seems enough reason to make it variable.


If there's no objection for the above, how about proceeding by
the following steps?

1: Replace fixed EPSILON with a guc parameter 'geometric_tolerance' or such and its default value is 1E-6. This also
shouldhave documentation.
 

2: Fix some currently broken operators to give sane results, considring geometric_tolerance.

Once we get to this stage, setting geometric_tolerance to 0 would
give the preferable behavior for you, and it is mere a concrete
value for a specific application. No problem.

After that, perhaps we may add some indexing method for geometric
types not considering tolerance. Then someone may want make them
support geometric_tolerance.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





В списке pgsql-hackers по дате отправления:

Предыдущее
От: vinayak
Дата:
Сообщение: Re: [HACKERS] Transactions involving multiple postgres foreignservers
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: [HACKERS] multivariate statistics (v19)