Re: Does people favor to have matrix data type?

Поиск
Список
Период
Сортировка
От Kouhei Kaigai
Тема Re: Does people favor to have matrix data type?
Дата
Msg-id 9A28C8860F777E439AA12E8AEA7694F8011F554A@BPXM15GP.gisp.nec.co.jp
обсуждение исходный текст
Ответ на Re: Does people favor to have matrix data type?  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: Does people favor to have matrix data type?  ("ktm@rice.edu" <ktm@rice.edu>)
Список pgsql-hackers
> -----Original Message-----
> From: Simon Riggs [mailto:simon@2ndQuadrant.com]
> Sent: Wednesday, May 25, 2016 4:39 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 25 May 2016 at 03:52, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> 
> 
>     In a few days, I'm working for a data type that represents matrix in
>     mathematical area. Does people favor to have this data type in the core,
>     not only my extension?
> 
> 
> If we understood the use case, it might help understand whether to include it or not.
> 
> Multi-dimensionality of arrays isn't always useful, so this could be good.
>
As you may expect, the reason why I've worked for matrix data type is one of
the groundwork for GPU acceleration, but not limited to.

What I tried to do is in-database calculation of some analytic algorithm; not
exporting entire dataset to client side.
My first target is k-means clustering; often used to data mining.
When we categorize N-items which have M-attributes into k-clusters, the master
data can be shown in NxM matrix; that is equivalent to N vectors in M-dimension.
The cluster centroid is also located inside of the M-dimension space, so it
can be shown in kxM matrix; that is equivalent to k vectors in M-dimension.
The k-means algorithm requires to calculate the distance to any cluster centroid
for each items, thus, it produces Nxk matrix; that is usually called as distance
matrix. Next, it updates the cluster centroid using the distance matrix, then
repeat the entire process until convergence.

The heart of workload is calculation of distance matrix. When I tried to write
k-means algorithm using SQL + R, its performance was not sufficient (poor).
https://github.com/kaigai/toybox/blob/master/Rstat/pgsql-kmeans.r

If we would have native functions we can use instead of the complicated SQL
expression, it will make sense for people who tries in-database analytics.

Also, fortunately, PostgreSQL's 2-D array format is binary compatible to BLAS
library's requirement. It will allow GPU to process large matrix in HPC grade
performance.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ants Aasma
Дата:
Сообщение: Re: Does people favor to have matrix data type?
Следующее
От: Andreas Seltenreich
Дата:
Сообщение: Re: [sqlsmith] PANIC: failed to add BRIN tuple