Обсуждение: Slow select times on select with xpath
I've got a table set up with an XML field that I would like to search on with 2.5 million records. The xml are serialized objects from my application which are too complex to break out into separate tables. I'm trying to run a query similar to this: SELECT serialized_object as outVal from object where ( array_to_string(xpath('/a:root/a:Identification/b:ObjectId/text()', serialized_object, ARRAY [ ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] ]), ' ') = 'fdc3da1f-060f-4c34-9c30-d9334d9272ae' ) limit 1000; I've also set up an index on the xpath query like this... CREATE INDEX concurrently idx_object_nodeid ON object USING btree( cast(xpath('/a:root/a:Identification/b:ObjectId/text()', serialized_object, ARRAY [ ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] ])as text[]) ); The query takes around 30 minutes to complete with or without the index in place and does not cache the query. Additionally the EXPLAIN say that the index is not being used. I've looked everywhere but can't seem to find solid info on how to achieve this. Any ideas would be greatly appreciated. -- View this message in context: http://www.nabble.com/Slow-select-times-on-select-with-xpath-tp25259351p25259351.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
astro77 <astro_coder@yahoo.com> wrote: > I've got a table set up with an XML field that I would like to search on > with > 2.5 million records. The xml are serialized objects from my application > which are too complex to break out into separate tables. I'm trying to run a > query similar to this: > > SELECT serialized_object as outVal > from object where > ( > array_to_string(xpath('/a:root/a:Identification/b:ObjectId/text()', > serialized_object, > ARRAY > [ > ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], > ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] > > ]), ' ') = 'fdc3da1f-060f-4c34-9c30-d9334d9272ae' > > ) > limit 1000; I would try to minimize how many XML values it had to read, parse, and search. The best approach that comes to mind would be to use tsearch2 techniques (with a GIN or GiST index on the tsvector) to identify which rows contain 'fdc3da1f-060f-4c34-9c30-d9334d9272ae', and use AND to combine that with your xpath search. -Kevin
On Wed, Sep 2, 2009 at 11:04 AM, astro77<astro_coder@yahoo.com> wrote: > > I've got a table set up with an XML field that I would like to search on with > 2.5 million records. The xml are serialized objects from my application > which are too complex to break out into separate tables. I'm trying to run a > query similar to this: > > SELECT serialized_object as outVal > from object where > ( > array_to_string(xpath('/a:root/a:Identification/b:ObjectId/text()', > serialized_object, > ARRAY > [ > ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], > ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] > > ]), ' ') = 'fdc3da1f-060f-4c34-9c30-d9334d9272ae' > > ) > limit 1000; > > I've also set up an index on the xpath query like this... > > CREATE INDEX concurrently > idx_object_nodeid > ON > object > USING > btree( > > cast(xpath('/a:root/a:Identification/b:ObjectId/text()', serialized_object, > ARRAY > [ > ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], > ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] > > ])as text[]) > ); > > The query takes around 30 minutes to complete with or without the index in > place and does not cache the query. Additionally the EXPLAIN say that the > index is not being used. I've looked everywhere but can't seem to find solid > info on how to achieve this. Any ideas would be greatly appreciated. Why do you have a cast in the index definition? ...Robert
I was receiving an error that an XML field does not support the various indexes available in postgresql. Is there an example of how to do this properly? Robert Haas wrote: > > On Wed, Sep 2, 2009 at 11:04 AM, astro77<astro_coder@yahoo.com> wrote: >> >> I've got a table set up with an XML field that I would like to search on >> with >> 2.5 million records. The xml are serialized objects from my application >> which are too complex to break out into separate tables. I'm trying to >> run a >> query similar to this: >> >> SELECT serialized_object as outVal >> from object where >> ( >> >> array_to_string(xpath('/a:root/a:Identification/b:ObjectId/text()', >> serialized_object, >> ARRAY >> [ >> ARRAY['a', >> 'http://schemas.datacontract.org/2004/07/Objects'], >> ARRAY['b', >> 'http://schemas.datacontract.org/2004/07/Security'] >> >> ]), ' ') = 'fdc3da1f-060f-4c34-9c30-d9334d9272ae' >> >> ) >> limit 1000; >> >> I've also set up an index on the xpath query like this... >> >> CREATE INDEX concurrently >> idx_object_nodeid >> ON >> object >> USING >> btree( >> >> cast(xpath('/a:root/a:Identification/b:ObjectId/text()', >> serialized_object, >> ARRAY >> [ >> ARRAY['a', >> 'http://schemas.datacontract.org/2004/07/Objects'], >> ARRAY['b', >> 'http://schemas.datacontract.org/2004/07/Security'] >> >> ])as text[]) >> ); >> >> The query takes around 30 minutes to complete with or without the index >> in >> place and does not cache the query. Additionally the EXPLAIN say that the >> index is not being used. I've looked everywhere but can't seem to find >> solid >> info on how to achieve this. Any ideas would be greatly appreciated. > > Why do you have a cast in the index definition? > > ...Robert > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > > -- View this message in context: http://www.nabble.com/Slow-select-times-on-select-with-xpath-tp25259351p25283175.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Thu, Sep 3, 2009 at 4:06 PM, astro77<astro_coder@yahoo.com> wrote: > I was receiving an error that an XML field does not support the various > indexes available in postgresql. Please post what happens when you try. > Is there an example of how to do this > properly? Not sure. ...Robert
CREATE INDEX CONCURRENTLY idx_serializedxml ON "object" (serialized_object ASC NULLS LAST); yields the error: ERROR: data type xml has no default operator class for access method "btree" The same error occurs when I try to use the other access methods as well. On Thu, Sep 3, 2009 at 4:06 PM, astro77<astro_coder@yahoo.com> wrote: > I was receiving an error that an XML field does not support the various > indexes available in postgresql. Please post what happens when you try. -- View this message in context: http://www.nabble.com/Slow-select-times-on-select-with-xpath-tp25259351p25530433.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Thanks Kevin. I thought about using tsearch2 but I need to be able to select exact values on other numerical queries and cannot use "contains" queries. It's got to be fast so I cannot have lots of records returned and have to do secondary processing on the xml for the records which contain the exact value I'm looking for. This is one of the reasons I moved from using Lucene for searching. I hope this makes sense. Kevin Grittner wrote: > wrote: > > > I would try to minimize how many XML values it had to read, parse, and > search. The best approach that comes to mind would be to use tsearch2 > techniques (with a GIN or GiST index on the tsvector) to identify > which rows contain 'fdc3da1f-060f-4c34-9c30-d9334d9272ae', and use AND > to combine that with your xpath search. > > -Kevin > > > -- View this message in context: http://www.nabble.com/Slow-select-times-on-select-with-xpath-tp25259351p25530439.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
As a follow-up, when I try to create the index like this... CREATE INDEX concurrently idx_object_nodeid2 ON object USING btree( xpath('/a:root/a:Identification/b:ObjectId/text()', serialized_object, ARRAY [ ARRAY['a', 'http://schemas.datacontract.org/2004/07/Objects'], ARRAY['b', 'http://schemas.datacontract.org/2004/07/Security'] ]) ) ; The index begins to build but fails after about 90 seconds with this error: ERROR: could not identify a comparison function for type xml SQL state: 42883 Robert Haas wrote: > > On Thu, Sep 3, 2009 at 4:06 PM, astro77<astro_coder@yahoo.com> wrote: >> I was receiving an error that an XML field does not support the various >> indexes available in postgresql. > > Please post what happens when you try. > >> Is there an example of how to do this >> properly? > > Not sure. > > ...Robert > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > > -- View this message in context: http://www.nabble.com/Slow-select-times-on-select-with-xpath-tp25259351p25530455.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
astro77 wrote: > Thanks Kevin. I thought about using tsearch2 but I need to be able to select > exact values on other numerical queries and cannot use "contains" queries. You might be able to make use of a custom parser for tsearch2 that creates something like a single "word" for xml fragments like <whatever>1</whatever> which would let you quickly find exact matches for those words/phrases. > It's got to be fast so I cannot have lots of records returned and have to do > secondary processing on the xml for the records which contain the exact > value I'm looking for. This is one of the reasons I moved from using Lucene > for searching. I hope this makes sense. > > > Kevin Grittner wrote: >> wrote: >> >> >> I would try to minimize how many XML values it had to read, parse, and >> search. The best approach that comes to mind would be to use tsearch2 >> techniques (with a GIN or GiST index on the tsvector) to identify >> which rows contain 'fdc3da1f-060f-4c34-9c30-d9334d9272ae', and use AND >> to combine that with your xpath search. >> >> -Kevin >> >> >> >
astro77 <astro_coder@yahoo.com> writes: > Kevin Grittner wrote: >> I would try to minimize how many XML values it had to read, parse, and >> search. The best approach that comes to mind would be to use tsearch2 >> techniques (with a GIN or GiST index on the tsvector) to identify >> which rows contain 'fdc3da1f-060f-4c34-9c30-d9334d9272ae', and use AND >> to combine that with your xpath search. > > Thanks Kevin. I thought about using tsearch2 but I need to be able to select > exact values on other numerical queries and cannot use "contains" queries. > It's got to be fast so I cannot have lots of records returned and have to do > secondary processing on the xml for the records which contain the exact > value I'm looking for. This is one of the reasons I moved from using Lucene > for searching. I hope this makes sense. I think he meant something following this skeleton: SELECT ... FROM ( SELECT ... FROM ... WHERE /* insert preliminary filtering here */ ) WHERE /* insert xpath related filtering here */ Hopefully you have a preliminary filtering available that's restrictive enough for the xpath filtering to only have to check few rows. Kevin proposes that this preliminary filtering be based on Tsearch with an adequate index (GiST for data changing a lot, GIN for pretty static set). As you can see the two-steps filtering can be done in a single SQL query. Regards, -- dim