Обсуждение: Select where id in (LARGE LIST) ?

Поиск
Список
Период
Сортировка

Select where id in (LARGE LIST) ?

От
Jasper Potts
Дата:
I am working on a gui application which has a list of ids of selected
items. To perform an operation on the selected items I do a
"select/update .. where id in(...)". There seems to be a limit of
100-200 items in the list. Is there a way of doing this with large
lists, maybe 10,000, 100,000 long?

The best solution I have so far is to create a selection table and write
the select out to that and perform a join but the 100,000 inserts are slow.

The ids are int8(long), so not talking large amounts of data, couple Mb
at most. Database and client running on same machine over localhost.

Many Thanks

Jasper

Re: Select where id in (LARGE LIST) ?

От
Csaba Nagy
Дата:
Jasper,

You can chunk your operation. That means to only use 100 entries in one
run, and repeat it until all ids were processed. Use a prepared
statement, that will save you some overhead. The last chunk will have
less entries than the parameter placeholders, so you will have to build
a special last statement, or to set the superfluous parameters to null
or to one of the values from the last chunk, depends on what kind of
query you have. We do all our data import/export this way, and it works
fine.

Cheers,
Csaba.


On Fri, 2005-07-29 at 17:30, Jasper Potts wrote:
> I am working on a gui application which has a list of ids of selected
> items. To perform an operation on the selected items I do a
> "select/update .. where id in(...)". There seems to be a limit of
> 100-200 items in the list. Is there a way of doing this with large
> lists, maybe 10,000, 100,000 long?
>
> The best solution I have so far is to create a selection table and write
> the select out to that and perform a join but the 100,000 inserts are slow.
>
> The ids are int8(long), so not talking large amounts of data, couple Mb
> at most. Database and client running on same machine over localhost.
>
> Many Thanks
>
> Jasper
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match


Re: Select where id in (LARGE LIST) ?

От
Jasper Potts
Дата:
I have been trying the chunk method and it wins in cases where the
number of ids is not too high. I was comparing:

(1) joining the main table with a table that had a column with ids and a
boolean column for selected, both id columns have indexes.
(2) select over main table where id in (...) done in 100s with prepared
statement and results accumulated in java

with a select that did a sum() of  another column the results were:

No. of Selected Items  |   Join Select Time in sec   |  Chunk Select
Time in sec
30                               |   0.4
|  0.007
4000                           |   0.5
|  0.24
30000                         |   0.7                                 |
1.12

All of these were with 30,000 rows in main table from JDBC.

These results don't take into account the time it takes to clear then
populate the select table with selection data from client. At the moment
this can take 1-2 seconds which is far from interactive for the user.
Which is why I am looking for a better method.

Any other suggestions? :-)

I would like to push the data size up to a million, which means the user
could in theory select a million rows and apply operation. This is going
to get very slow with the chunk method.

Many Thanks

Jasper

Csaba Nagy wrote:

>Jasper,
>
>You can chunk your operation. That means to only use 100 entries in one
>run, and repeat it until all ids were processed. Use a prepared
>statement, that will save you some overhead. The last chunk will have
>less entries than the parameter placeholders, so you will have to build
>a special last statement, or to set the superfluous parameters to null
>or to one of the values from the last chunk, depends on what kind of
>query you have. We do all our data import/export this way, and it works
>fine.
>
>Cheers,
>Csaba.
>
>
>On Fri, 2005-07-29 at 17:30, Jasper Potts wrote:
>
>
>>I am working on a gui application which has a list of ids of selected
>>items. To perform an operation on the selected items I do a
>>"select/update .. where id in(...)". There seems to be a limit of
>>100-200 items in the list. Is there a way of doing this with large
>>lists, maybe 10,000, 100,000 long?
>>
>>The best solution I have so far is to create a selection table and write
>>the select out to that and perform a join but the 100,000 inserts are slow.
>>
>>The ids are int8(long), so not talking large amounts of data, couple Mb
>>at most. Database and client running on same machine over localhost.
>>
>>Many Thanks
>>
>>Jasper
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>       choose an index scan if your joining column's datatypes do not
>>       match
>>
>>
>
>
>


Re: Select where id in (LARGE LIST) ?

От
Oliver Jowett
Дата:
Jasper Potts wrote:

> The best solution I have so far is to create a selection table and write
> the select out to that and perform a join but the 100,000 inserts are slow.

How are you constructing this table? If you use batched inserts with
autocommit off, it should be fairly fast; I get 8000+ inserts per second
over larger data sets.

Can you move the selection list into the DB semi-permanently? Really,
the underlying problem is that you need to move a 100,000-item list to
the database on every SELECT.. if you can have the list "already there"
it's going to be faster.

-O

Re: Select where id in (LARGE LIST) ?

От
Csaba Nagy
Дата:
I don't quite understand what you're doing here, but smells to me for
something which shouldn't be interactive in the first place. If some
query expectedly exceeds a few seconds, we make the operation
asynchronous, i.e. the user starts it, gets a page saying the thing is
currently being done, and then he receives some kind of notification
when the thing is finished, either on the web page (by periodically
reloading) or by email for really long lasting things.
Now the chunking method does have an overhead indeed, but it has a set
of good properties too: it's easily interruptible, it won't bug your
database down (it will let some other things go too between the chunks),
and you won't get connection timeouts for really long operations... it's
just more manageable in many respects.
And if you do have some long lasting things, make sure you won't let the
user hit reload 50 times... an asynchronous way of doing it will help in
this regard too, cause then you know what the user did and reload will
only show the status.

HTH,
Csaba.



On Fri, 2005-07-29 at 21:18, Jasper Potts wrote:
> I have been trying the chunk method and it wins in cases where the
> number of ids is not too high. I was comparing:
>
> (1) joining the main table with a table that had a column with ids and a
> boolean column for selected, both id columns have indexes.
> (2) select over main table where id in (...) done in 100s with prepared
> statement and results accumulated in java
>
> with a select that did a sum() of  another column the results were:
>
> No. of Selected Items  |   Join Select Time in sec   |  Chunk Select
> Time in sec
> 30                               |   0.4
> |  0.007
> 4000                           |   0.5
> |  0.24
> 30000                         |   0.7                                 |
> 1.12
>
> All of these were with 30,000 rows in main table from JDBC.
>
> These results don't take into account the time it takes to clear then
> populate the select table with selection data from client. At the moment
> this can take 1-2 seconds which is far from interactive for the user.
> Which is why I am looking for a better method.
>
> Any other suggestions? :-)
>
> I would like to push the data size up to a million, which means the user
> could in theory select a million rows and apply operation. This is going
> to get very slow with the chunk method.
>
> Many Thanks
>
> Jasper
>
> Csaba Nagy wrote:
>
> >Jasper,
> >
> >You can chunk your operation. That means to only use 100 entries in one
> >run, and repeat it until all ids were processed. Use a prepared
> >statement, that will save you some overhead. The last chunk will have
> >less entries than the parameter placeholders, so you will have to build
> >a special last statement, or to set the superfluous parameters to null
> >or to one of the values from the last chunk, depends on what kind of
> >query you have. We do all our data import/export this way, and it works
> >fine.
> >
> >Cheers,
> >Csaba.
> >
> >
> >On Fri, 2005-07-29 at 17:30, Jasper Potts wrote:
> >
> >
> >>I am working on a gui application which has a list of ids of selected
> >>items. To perform an operation on the selected items I do a
> >>"select/update .. where id in(...)". There seems to be a limit of
> >>100-200 items in the list. Is there a way of doing this with large
> >>lists, maybe 10,000, 100,000 long?
> >>
> >>The best solution I have so far is to create a selection table and write
> >>the select out to that and perform a join but the 100,000 inserts are slow.
> >>
> >>The ids are int8(long), so not talking large amounts of data, couple Mb
> >>at most. Database and client running on same machine over localhost.
> >>
> >>Many Thanks
> >>
> >>Jasper
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 9: In versions below 8.0, the planner will ignore your desire to
> >>       choose an index scan if your joining column's datatypes do not
> >>       match
> >>
> >>
> >
> >
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend


Re: Select where id in (LARGE LIST) ?

От
Jasper Potts
Дата:
We are working on a desktop(Swing) application, postgresql is running
embedded in the application and is hence single user. We have a
scrollable view of grid of icons much like explorer. The data set is a
list of item ids resulting from a complex query that can take 30
seconds. This data set of the query results are stored in table in the
database. The user then needs to be able to browse(scroll) through the
list of items and make a selection. Then they can perform operations on
that selection. Those operations require a way of the selection being
used in a database query. The options are:

(1)    Store the selection on memory on the client.
(2)   Store the selection as a column in the query results table
(boolean maybe)
(3)    Hybrid solution where the selection is in memory and in the database.

The problem with (1) is how to transfer the selection to the db if it
gets large. This is where the chunking comes in.
The problem with the pure selection in the database route(2) is the
simple selection operations like clicking on an item to select only it.
Are too slow.
(3) could be good but is complex to implement with Threading/Locking etc.

My current plan is to go with (1) but change the storage model for the
selection. The first idea was just a set of ids of selected items,
problem is if that is 100,000 then is hard to transfer to the WHERE part
of a query. The new idea is to store it as a list of ranges eg.
(26-32,143-198,10922-10923) this could then be sent as a "WHERE (row
 >=26 and row <=32) or (row >=143 and row <=198) or (row >=10922 and row
<=10923)". As the user has to choose each range by hand it is unlikely
there could be more than 100 ranges.

The big question here is whats the max length of a query sent though
JDBC to Postgresql???

Many Thanks

Jasper

Csaba Nagy wrote:

>I don't quite understand what you're doing here, but smells to me for
>something which shouldn't be interactive in the first place. If some
>query expectedly exceeds a few seconds, we make the operation
>asynchronous, i.e. the user starts it, gets a page saying the thing is
>currently being done, and then he receives some kind of notification
>when the thing is finished, either on the web page (by periodically
>reloading) or by email for really long lasting things.
>Now the chunking method does have an overhead indeed, but it has a set
>of good properties too: it's easily interruptible, it won't bug your
>database down (it will let some other things go too between the chunks),
>and you won't get connection timeouts for really long operations... it's
>just more manageable in many respects.
>And if you do have some long lasting things, make sure you won't let the
>user hit reload 50 times... an asynchronous way of doing it will help in
>this regard too, cause then you know what the user did and reload will
>only show the status.
>
>HTH,
>Csaba.
>
>
>
>On Fri, 2005-07-29 at 21:18, Jasper Potts wrote:
>
>
>>I have been trying the chunk method and it wins in cases where the
>>number of ids is not too high. I was comparing:
>>
>>(1) joining the main table with a table that had a column with ids and a
>>boolean column for selected, both id columns have indexes.
>>(2) select over main table where id in (...) done in 100s with prepared
>>statement and results accumulated in java
>>
>>with a select that did a sum() of  another column the results were:
>>
>>No. of Selected Items  |   Join Select Time in sec   |  Chunk Select
>>Time in sec
>>30                               |   0.4
>>|  0.007
>>4000                           |   0.5
>>|  0.24
>>30000                         |   0.7                                 |
>>1.12
>>
>>All of these were with 30,000 rows in main table from JDBC.
>>
>>These results don't take into account the time it takes to clear then
>>populate the select table with selection data from client. At the moment
>>this can take 1-2 seconds which is far from interactive for the user.
>>Which is why I am looking for a better method.
>>
>>Any other suggestions? :-)
>>
>>I would like to push the data size up to a million, which means the user
>>could in theory select a million rows and apply operation. This is going
>>to get very slow with the chunk method.
>>
>>Many Thanks
>>
>>Jasper
>>
>>Csaba Nagy wrote:
>>
>>
>>
>>>Jasper,
>>>
>>>You can chunk your operation. That means to only use 100 entries in one
>>>run, and repeat it until all ids were processed. Use a prepared
>>>statement, that will save you some overhead. The last chunk will have
>>>less entries than the parameter placeholders, so you will have to build
>>>a special last statement, or to set the superfluous parameters to null
>>>or to one of the values from the last chunk, depends on what kind of
>>>query you have. We do all our data import/export this way, and it works
>>>fine.
>>>
>>>Cheers,
>>>Csaba.
>>>
>>>
>>>On Fri, 2005-07-29 at 17:30, Jasper Potts wrote:
>>>
>>>
>>>
>>>
>>>>I am working on a gui application which has a list of ids of selected
>>>>items. To perform an operation on the selected items I do a
>>>>"select/update .. where id in(...)". There seems to be a limit of
>>>>100-200 items in the list. Is there a way of doing this with large
>>>>lists, maybe 10,000, 100,000 long?
>>>>
>>>>The best solution I have so far is to create a selection table and write
>>>>the select out to that and perform a join but the 100,000 inserts are slow.
>>>>
>>>>The ids are int8(long), so not talking large amounts of data, couple Mb
>>>>at most. Database and client running on same machine over localhost.
>>>>
>>>>Many Thanks
>>>>
>>>>Jasper
>>>>
>>>>---------------------------(end of broadcast)---------------------------
>>>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>>>      choose an index scan if your joining column's datatypes do not
>>>>      match
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 6: explain analyze is your friend
>>
>>
>
>
>

Re: Select where id in (LARGE LIST) ?

От
Oliver Jowett
Дата:
Jasper Potts wrote:

> The big question here is whats the max length of a query sent though
> JDBC to Postgresql???

You'll hit backend limits (I'm not sure if there are any explicit limits
there other than available memory) before you hit any protocol limits;
the protocol limits kick in at around 1GB..

-O

Re: Select where id in (LARGE LIST) ?

От
Csaba Nagy
Дата:
In this case I wouldn't go with a boolean flag, setting it on a row will
actually insert a new row in your table (that's what the updates really
do). Then you're better off with the range approach, and keep your
ranges in memory. I have no idea how big the query can get, I guess it
should be somewhere in the documentation at some chapter dealing with
limitations...

Cheers,
Csaba.

On Tue, 2005-08-02 at 11:23, Jasper Potts wrote:
> We are working on a desktop(Swing) application, postgresql is running
> embedded in the application and is hence single user. We have a
> scrollable view of grid of icons much like explorer. The data set is a
> list of item ids resulting from a complex query that can take 30
> seconds. This data set of the query results are stored in table in the
> database. The user then needs to be able to browse(scroll) through the
> list of items and make a selection. Then they can perform operations on
> that selection. Those operations require a way of the selection being
> used in a database query. The options are:
>
> (1)    Store the selection on memory on the client.
> (2)   Store the selection as a column in the query results table
> (boolean maybe)
> (3)    Hybrid solution where the selection is in memory and in the database.
>
> The problem with (1) is how to transfer the selection to the db if it
> gets large. This is where the chunking comes in.
> The problem with the pure selection in the database route(2) is the
> simple selection operations like clicking on an item to select only it.
> Are too slow.
> (3) could be good but is complex to implement with Threading/Locking etc.
>
> My current plan is to go with (1) but change the storage model for the
> selection. The first idea was just a set of ids of selected items,
> problem is if that is 100,000 then is hard to transfer to the WHERE part
> of a query. The new idea is to store it as a list of ranges eg.
> (26-32,143-198,10922-10923) this could then be sent as a "WHERE (row
>  >=26 and row <=32) or (row >=143 and row <=198) or (row >=10922 and row
> <=10923)". As the user has to choose each range by hand it is unlikely
> there could be more than 100 ranges.
>
> The big question here is whats the max length of a query sent though
> JDBC to Postgresql???
>
> Many Thanks
>
> Jasper
>
> Csaba Nagy wrote:
>
> >I don't quite understand what you're doing here, but smells to me for
> >something which shouldn't be interactive in the first place. If some
> >query expectedly exceeds a few seconds, we make the operation
> >asynchronous, i.e. the user starts it, gets a page saying the thing is
> >currently being done, and then he receives some kind of notification
> >when the thing is finished, either on the web page (by periodically
> >reloading) or by email for really long lasting things.
> >Now the chunking method does have an overhead indeed, but it has a set
> >of good properties too: it's easily interruptible, it won't bug your
> >database down (it will let some other things go too between the chunks),
> >and you won't get connection timeouts for really long operations... it's
> >just more manageable in many respects.
> >And if you do have some long lasting things, make sure you won't let the
> >user hit reload 50 times... an asynchronous way of doing it will help in
> >this regard too, cause then you know what the user did and reload will
> >only show the status.
> >
> >HTH,
> >Csaba.
> >
> >
> >
> >On Fri, 2005-07-29 at 21:18, Jasper Potts wrote:
> >
> >
> >>I have been trying the chunk method and it wins in cases where the
> >>number of ids is not too high. I was comparing:
> >>
> >>(1) joining the main table with a table that had a column with ids and a
> >>boolean column for selected, both id columns have indexes.
> >>(2) select over main table where id in (...) done in 100s with prepared
> >>statement and results accumulated in java
> >>
> >>with a select that did a sum() of  another column the results were:
> >>
> >>No. of Selected Items  |   Join Select Time in sec   |  Chunk Select
> >>Time in sec
> >>30                               |   0.4
> >>|  0.007
> >>4000                           |   0.5
> >>|  0.24
> >>30000                         |   0.7                                 |
> >>1.12
> >>
> >>All of these were with 30,000 rows in main table from JDBC.
> >>
> >>These results don't take into account the time it takes to clear then
> >>populate the select table with selection data from client. At the moment
> >>this can take 1-2 seconds which is far from interactive for the user.
> >>Which is why I am looking for a better method.
> >>
> >>Any other suggestions? :-)
> >>
> >>I would like to push the data size up to a million, which means the user
> >>could in theory select a million rows and apply operation. This is going
> >>to get very slow with the chunk method.
> >>
> >>Many Thanks
> >>
> >>Jasper
> >>
> >>Csaba Nagy wrote:
> >>
> >>
> >>
> >>>Jasper,
> >>>
> >>>You can chunk your operation. That means to only use 100 entries in one
> >>>run, and repeat it until all ids were processed. Use a prepared
> >>>statement, that will save you some overhead. The last chunk will have
> >>>less entries than the parameter placeholders, so you will have to build
> >>>a special last statement, or to set the superfluous parameters to null
> >>>or to one of the values from the last chunk, depends on what kind of
> >>>query you have. We do all our data import/export this way, and it works
> >>>fine.
> >>>
> >>>Cheers,
> >>>Csaba.
> >>>
> >>>
> >>>On Fri, 2005-07-29 at 17:30, Jasper Potts wrote:
> >>>
> >>>
> >>>
> >>>
> >>>>I am working on a gui application which has a list of ids of selected
> >>>>items. To perform an operation on the selected items I do a
> >>>>"select/update .. where id in(...)". There seems to be a limit of
> >>>>100-200 items in the list. Is there a way of doing this with large
> >>>>lists, maybe 10,000, 100,000 long?
> >>>>
> >>>>The best solution I have so far is to create a selection table and write
> >>>>the select out to that and perform a join but the 100,000 inserts are slow.
> >>>>
> >>>>The ids are int8(long), so not talking large amounts of data, couple Mb
> >>>>at most. Database and client running on same machine over localhost.
> >>>>
> >>>>Many Thanks
> >>>>
> >>>>Jasper
> >>>>
> >>>>---------------------------(end of broadcast)---------------------------
> >>>>TIP 9: In versions below 8.0, the planner will ignore your desire to
> >>>>      choose an index scan if your joining column's datatypes do not
> >>>>      match
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 6: explain analyze is your friend
> >>
> >>
> >
> >
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Select where id in (LARGE LIST) ?

От
Jasper Potts
Дата:
I know from when I was testing "SELECT ... WHERE id IN (...,...)" I got
a buffer over flow or similar exception from the postgresql native
process at about 200-300 longs in the IN clause. Could do with finding
what the real limits are so I can protect against it. I will have a dig
though the documentation later see what I can find :-)

Thanks

Jasper

Oliver Jowett wrote:

>Jasper Potts wrote:
>
>
>
>>The big question here is whats the max length of a query sent though
>>JDBC to Postgresql???
>>
>>
>
>You'll hit backend limits (I'm not sure if there are any explicit limits
>there other than available memory) before you hit any protocol limits;
>the protocol limits kick in at around 1GB..
>
>-O
>
>

Re: Select where id in (LARGE LIST) ?

От
Oliver Jowett
Дата:
Jasper Potts wrote:
> I know from when I was testing "SELECT ... WHERE id IN (...,...)" I got
> a buffer over flow or similar exception from the postgresql native
> process at about 200-300 longs in the IN clause.

Well I'd need to see the actual exception not a vague description of it
to comment further.

> Could do with finding
> what the real limits are so I can protect against it. I will have a dig
> though the documentation later see what I can find :-)

There may be a (backend) limit on the number of parameters in a query,
not sure.

-O

Re: Select where id in (LARGE LIST) ?

От
Tom Lane
Дата:
Oliver Jowett <oliver@opencloud.com> writes:
> There may be a (backend) limit on the number of parameters in a query,
> not sure.

I'm pretty certain there's no hard limit.  However, I doubt anyone has
tested with thousands of parameters ... there might well be performance
issues, such as loops that are O(N^2) in the number of parameters.
Feel free to send along problem cases.

            regards, tom lane

Re: Select where id in (LARGE LIST) ?

От
Oliver Jowett
Дата:
Tom Lane wrote:
> Oliver Jowett <oliver@opencloud.com> writes:
>
>>There may be a (backend) limit on the number of parameters in a query,
>>not sure.
>
>
> I'm pretty certain there's no hard limit.  However, I doubt anyone has
> tested with thousands of parameters ... there might well be performance
> issues, such as loops that are O(N^2) in the number of parameters.
> Feel free to send along problem cases.

With a really big IN list (16384 triggered it, 8192 didn't) I get
"ERROR: stack depth limit exceeded", though I guess this is more a limit
in the handling of IN than a number-of-parameters limit per se.

-O

Re: Select where id in (LARGE LIST) ?

От
Oliver Jowett
Дата:
Oliver Jowett wrote:

> With a really big IN list (16384 triggered it, 8192 didn't) I get
> "ERROR: stack depth limit exceeded", though I guess this is more a limit
> in the handling of IN than a number-of-parameters limit per se.

And the end of the stack trace is this:

(... lots of transformExpr() recursion ...)
#11909 0x000000000048d4d2 in transformExpr ()
#11910 0x000000000048d4d2 in transformExpr ()
#11911 0x000000000048d4d2 in transformExpr ()
#11912 0x000000000048aa0e in transformWhereClause ()
#11913 0x0000000000479efc in transformStmt ()
#11914 0x000000000047b596 in do_parse_analyze ()
#11915 0x000000000047cad9 in parse_analyze_varparams ()
#11916 0x00000000005446f5 in exec_parse_message ()
#11917 0x0000000000545fc3 in PostgresMain ()
#11918 0x000000000051da0d in ServerLoop ()
#11919 0x000000000051ee6f in PostmasterMain ()
#11920 0x00000000004ee0af in main ()

-O