Обсуждение: how to effectively SELECT new "customers"

Поиск

Список

Период

Сортировка

how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

27 февраля 2014 г., 17:20:23

<div style="font-family: times new roman,new york,times,serif; font-size: 12pt; color: #000000"><pre>Hello,</pre><pre>I
amsolving following task and it seems hard to me to find effective solution. Maybe somebody knows how to help
me:</pre><pre>Wehave table "purchases" and each one record is identified by "customer_id".</pre><pre>We want to know
notonly how many different customers did at least one purchase per time period, grouped by time periods</pre><pre>(easy
task:"COUNT(DISTINCT customer_id)" with "GROUP BY period"), but also to know how many NEW customers there
were.</pre><pre>Wedefine new customer as customer_id, which had first record in table "purchases" after 12 month of
inactivity(no record</pre><pre>in table "purchases" previous 12 months).</pre><pre>I have found one solution, but it is
veryslow and ugly. I tried several other concepts, but without success.</pre><pre>Any hint could be helpful for me.
Thanksin advance!</pre><pre>Jano</pre></div>

Re: how to effectively SELECT new "customers"

От

David Johnston

Дата:

27 февраля 2014 г., 18:09:39

Jan Ostrochovsky wrote
> Hello, I am solving following task and it seems hard to me to find
> effective solution. Maybe somebody knows how to help me: We have table
> "purchases" and each one record is identified by "customer_id". We want to
> know not only how many different customers did at least one purchase per
> time period, grouped by time periods (easy task: "COUNT(DISTINCT
> customer_id)" with "GROUP BY period"), but also to know how many NEW
> customers there were. We define new customer as customer_id, which had
> first record in table "purchases" after 12 month of inactivity (no record
> in table "purchases" previous 12 months). I have found one solution, but
> it is very slow and ugly. I tried several other concepts, but without
> success. Any hint could be helpful for me. Thanks in advance! Jano

Without incorporating additional meta-data about the purchases onto the
customer table the most basic solution would be:

SELECT DISTINCT customer_id FROM products WHERE date > (now() - '12
months'::interval)
EXCEPT
SELECT DISTINCT customer_id FROM products WHERE date <= (now() - '12
months'::interval)

---

Another solution:
WHERE ... >12 AND NOT EXISTS (SELECT ... WHERE <= 12)

---

Depending on the frequency that you need to run this query it may be
worthwhile to create a materialized view that captures the necessary data
and then whenever a new sale is generated you simply update that view by
changing the attributes of that single customer.  At any point you can
quickly determine, using the view, which customers were active at the time
of last purchases and which ones were dormant or non-existent.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/how-to-effectively-SELECT-new-customers-tp5793867p5793882.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

28 февраля 2014 г., 14:44:23

From: "David Johnston" <polobo@yahoo.com>
To: pgsql-sql@postgresql.org
Sent: Thursday, February 27, 2014 4:09:34 PM
Subject: Re: [SQL] how to effectively SELECT new "customers"

Jan Ostrochovsky wrote
> Hello, I am solving following task and it seems hard to me to find
> effective solution. Maybe somebody knows how to help me: We have table
> "purchases" and each one record is identified by "customer_id". We want to
> know not only how many different customers did at least one purchase per
> time period, grouped by time periods (easy task: "COUNT(DISTINCT
> customer_id)" with "GROUP BY period"), but also to know how many NEW
> customers there were. We define new customer as customer_id, which had
> first record in table "purchases" after 12 month of inactivity (no record
> in table "purchases" previous 12 months). I have found one solution, but
> it is very slow and ugly. I tried several other concepts, but without
> success. Any hint could be helpful for me. Thanks in advance! Jano

Without incorporating additional meta-data about the purchases onto the
customer table the most basic solution would be:

SELECT DISTINCT customer_id FROM products WHERE date > (now() - '12
months'::interval)
EXCEPT
SELECT DISTINCT customer_id FROM products WHERE date <= (now() - '12
months'::interval)

---

Another solution:
WHERE ... >12 AND NOT EXISTS (SELECT ... WHERE <= 12)

---

Depending on the frequency that you need to run this query it may be
worthwhile to create a materialized view that captures the necessary data
and then whenever a new sale is generated you simply update that view by
changing the attributes of that single customer. At any point you can
quickly determine, using the view, which customers were active at the time
of last purchases and which ones were dormant or non-existent.

David J.

Re: how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

28 февраля 2014 г., 14:45:46

From: "David Johnston" <polobo@yahoo.com>
To: pgsql-sql@postgresql.org
Sent: Thursday, February 27, 2014 4:09:34 PM
Subject: Re: [SQL] how to effectively SELECT new "customers"

Jan Ostrochovsky wrote
> Hello, I am solving following task and it seems hard to me to find
> effective solution. Maybe somebody knows how to help me: We have table
> "purchases" and each one record is identified by "customer_id". We want to
> know not only how many different customers did at least one purchase per
> time period, grouped by time periods (easy task: "COUNT(DISTINCT
> customer_id)" with "GROUP BY period"), but also to know how many NEW
> customers there were. We define new customer as customer_id, which had
> first record in table "purchases" after 12 month of inactivity (no record
> in table "purchases" previous 12 months). I have found one solution, but
> it is very slow and ugly. I tried several other concepts, but without
> success. Any hint could be helpful for me. Thanks in advance! Jano

Without incorporating additional meta-data about the purchases onto the
customer table the most basic solution would be:

SELECT DISTINCT customer_id FROM products WHERE date > (now() - '12
months'::interval)
EXCEPT
SELECT DISTINCT customer_id FROM products WHERE date <= (now() - '12
months'::interval)

---

Another solution:
WHERE ... >12 AND NOT EXISTS (SELECT ... WHERE <= 12)

---

Depending on the frequency that you need to run this query it may be
worthwhile to create a materialized view that captures the necessary data
and then whenever a new sale is generated you simply update that view by
changing the attributes of that single customer. At any point you can
quickly determine, using the view, which customers were active at the time
of last purchases and which ones were dormant or non-existent.

David J.

Thank you David, your alternatives seem as smart alternatives of my solution, what is based on the difference between COUNT(DISTINCT customer_id) in e.g. 13 months (in case when reporting period is 1 month over previous 12 months) and COUNT(DISTINCT customer_id) in 12 months prior to reporting period (but without the reporting period). But I would also like to incorporate other data in the output, which should seem e.g.:

town, period, payment_channel_id, purchases_count, purchases_total_revenue, distinct_customer_count, distinct_NEW_customer_count

BA, 201401, A, 495, 847, 103, 25

BA, 201401, B, 345, 456, 99, 21

BA, 201312, A, 554, 1021, 105, 30

ZA, 201401, A, 323, 987, 95, 23
...

there is 12-month-long time-window for new customer determination, sliding according to particular period, back to the past

how to achieve this? e.g. by running your or my concept repeatedly (manually or in the higher level loop), but this is too slow... or to find more sophisticated SQL query to accomplish this in one step... any other ideas for such SQL query running once, not repeatedly (in loop) for each reporting period? (I was considering window functions, connectby, grouping sets, ..., but no success yet)

materialized view could be part of that solution, when we will have the right query and uprgade from 9.0 to 9.3 (as I see it: http://www.postgresql.org/docs/9.3/static/rules-materializedviews.html)... but it should be scalable to several to tens purchases every minute

thanks again for your contribution, David

Jano

Re: how to effectively SELECT new "customers"

От

David Johnston

Дата:

28 февраля 2014 г., 18:39:43

Jan Ostrochovsky wrote
> materialized view could be part of that solution, when we will have the
> right query and uprgade from 9.0 to 9.3 (as I see it:
> http://www.postgresql.org/docs/9.3/static/rules-materializedviews.html)... 

9.3 adds materialized views syntax and functionality directly into
PostgreSQL but you can "roll your own" in any version and that is what I
would suggest.

I would probably focus on getting a single reporting period to execute
performantly and just use a loop to build up the materialized view
period-by-period.

I don't know how you want to go about dealing with your payment channel
since it depends on whether a customer can make use of more than one and
whether their "new-ness" is impacted by such.

Incorporating other data is as simple as building the different pieces and
joining them together into a final output; usually through a series of
CTEs/WITH sub-queries.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/how-to-effectively-SELECT-new-customers-tp5793867p5794056.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

28 февраля 2014 г., 19:12:29

From: "David Johnston" <polobo@yahoo.com>

9.3 adds materialized views syntax and functionality directly into
PostgreSQL but you can "roll your own" in any version and that is what I
would suggest.

I would probably focus on getting a single reporting period to execute
performantly and just use a loop to build up the materialized view
period-by-period.

I don't know how you want to go about dealing with your payment channel
since it depends on whether a customer can make use of more than one and
whether their "new-ness" is impacted by such.

Incorporating other data is as simple as building the different pieces and
joining them together into a final output; usually through a series of
CTEs/WITH sub-queries.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/how-to-effectively-SELECT-new-customers-tp5793867p5794056.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

customer may have various payment channels during the time... new-ness is not impacted by the channel, it does not matter from which channel, all customer_id occurences count (in determined filtration criteria, e.g. town, service, subservice)

and there are also other filtration and grouping criteria (town, service, subservice) and user of reporting tool should have possibility to select from those... there are dozens of services and subservices, cca 4 payment channels, dozens of towns... therefore preprocessing through materialized view (if I understand your suggestion correctly), would contain a lot of combinations, it seems quite complex for me in these circumstances

I also considered WITH (CTEs) previously, I will rethink it yet, after these your recommendations

thanks

Re: how to effectively SELECT new "customers"

От

David Johnston

Дата:

28 февраля 2014 г., 19:33:14

Jan Ostrochovsky wrote
> customer may have various payment channels during the time... new-ness is
> not impacted by the channel, it does not matter from which channel, all
> customer_id occurences count (in determined filtration criteria, e.g.
> town, service, subservice) 
> 
> and there are also other filtration and grouping criteria (town, service,
> subservice) and user of reporting tool should have possibility to select
> from those... there are dozens of services and subservices, cca 4 payment
> channels, dozens of towns... therefore preprocessing through materialized
> view (if I understand your suggestion correctly), would contain a lot of
> combinations, it seems quite complex for me in these circumstances 
> 
> I also considered WITH (CTEs) previously, I will rethink it yet, after
> these your recommendations 

I'm not sure what you are going for since you keep adding additional
criteria/constraints to your problem.  At this point you are faced with a
series of trade-offs between caching, speed, and flexibilty, complexity.  I
would suggest you break up your requirements into smaller pieces and not go
looking for some kind of magic bullet that will solve all your problems in a
single query.  It likely does not exist.

I would also suggest that you look into resources on data warehousing and
the star schema; doing what you are trying directly within the OLTP is
probably not the best solution - especially not on front-end servers.  My
experience in this area is thin but maybe someone else can make some
suggestions and/or provide some useful resource links.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/how-to-effectively-SELECT-new-customers-tp5793867p5794066.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Re: how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

01 марта 2014 г., 17:24:55

I'm not sure what you are going for since you keep adding additional
criteria/constraints to your problem. At this point you are faced with a
series of trade-offs between caching, speed, and flexibilty, complexity. I
would suggest you break up your requirements into smaller pieces and not go
looking for some kind of magic bullet that will solve all your problems in a
single query. It likely does not exist.

I would also suggest that you look into resources on data warehousing and
the star schema; doing what you are trying directly within the OLTP is
probably not the best solution - especially not on front-end servers. My
experience in this area is thin but maybe someone else can make some
suggestions and/or provide some useful resource links.

David J.

Probably you are right. It seems, that there is no such single-query solution, just only with PostgreSQL. I will break it up into smaller pieces. There is medium-tern plan for data-warehousing, just this one task I will solve yet without it. Maybe this is the way to go then: http://www.jaspersoft.com/tour.

Thanks.

Jano

Re: how to effectively SELECT new "customers"

От

Jan Ostrochovsky

Дата:

01 марта 2014 г., 19:09:12

Without incorporating additional meta-data about the purchases onto the
customer table the most basic solution would be:

SELECT DISTINCT customer_id FROM products WHERE date > (now() - '12
months'::interval)
EXCEPT
SELECT DISTINCT customer_id FROM products WHERE date <= (now() - '12
months'::interval)

---

Another solution:
WHERE ... >12 AND NOT EXISTS (SELECT ... WHERE <= 12)

David J.

subsidiary matter: in what circumstances is better to use EXCEPT and in what NOT EXISTS?

are those equivalents? tried to google their comparison, but no relevant results found for PostgreSQL

Re: how to effectively SELECT new "customers"

От

David Johnston

Дата:

02 марта 2014 г., 07:08:44

Jan Ostrochovsky wrote
>> Without incorporating additional meta-data about the purchases onto
>> the
>> customer table the most basic solution would be:
> 
>> SELECT DISTINCT customer_id FROM products WHERE date > (now() - '12
>> months'::interval)
>> EXCEPT
>> SELECT DISTINCT customer_id FROM products WHERE date <= (now() - '12
>> months'::interval)
> 
>> ---
> 
>> Another solution:
>> WHERE ... >12 AND NOT EXISTS (SELECT ... WHERE <= 12)
> 
>> David J.
> 
> subsidiary matter: in what circumstances is better to use EXCEPT and in
> what NOT EXISTS? 
> 
> are those equivalents? tried to google their comparison, but no relevant
> results found for PostgreSQL

I don't know; it somewhat depends on how smart the planner is which is out
of my league.  I would expect that NOT EXISTS is typically a better first
option since EXCEPT needs to do sorting and de-duplicating (maybe?) of large
amounts of data while the NOT EXISTS method seems to require some level of
nested looping to process but only needs to find a single matching record to
return false so less memory constraints.

Someone more familiar with the internals may be able to give a more detailed
answer from the top of their head.

But in a critical (or under-performing) piece of code you should probably
test both to see in your reality which one perform better as I would guess
hardware is going to have an impact.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/how-to-effectively-SELECT-new-customers-tp5793867p5794267.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: how to effectively SELECT new "customers"