Обсуждение: HTTP Frontend? (and a brief thought on materialized views)
			
				Hi guys,
			
		
		
	Something from Josh's recent blog post about summer of code clicked with me - the HTTP / SQL concept.
It was something I'd been thinking about earlier, how people really like HTTP APIs and this is one of the drivers behind adoption of  some "NoSQL" databases out there.
Some things that I think are great about NoSQL that are missing in PostgreSQL are:
- The "reduce" function which could be modeled as a kind of materialized view with an index on it. With materialized views and the ability to pull fields from JSON and XML data structures you could easily do a "NoSQL" database inside of PostgreSQL by saving the document as a big string and then using materialized views to have all kinds of derived data from those documents. Sounds cool on paper, anyway.
- HTTP RESTful API. This is obviously not as useful as a good admin tool like pgAdmin or a fast binary protocol like we have now but it's way more buzzword compliant and would come in handy sometimes. With CouchDB I was able to allow "public" data in the database to be accessed directly from the browser using JSONP and/or a simple HTTP proxy in the server instead of doing any of that work within the app server. So, that saves a step somewhere. With some basic support for stored procedures and serving files directly via this HTTP thing you could potentially eliminate the app server for public, read-only parts of a site/application.
- The perception of extremely low latency and/or high throughput - like fetching a row in less than 1ms or whatever. I don't know the exact numbers I just know they have to be insanely low (latency) or impressively high (throughput). We could use PostgreSQL as a query cache to accelerate your MySQL! Something like that :-P.
- Rebelliousness against "the man". In our case SQL can't be "the man", but we can find something PostgreSQL isn't and position against that.
Anyway, 1&2 seem inspiring to me and I'd love to help out with both.  3&4 seem more like marketing tasks of some sort...
I think I could probably start hacking up an HTTP API of some sort, I wasn't able to log into the PostgreSQL Wiki so I created a page with some notes here:
For materialized views I think that can be a bit of a can of worms, especially if you want to do incremental updates of some sort because you have to figure out what rows need to be recalculated when you update a row in a "source" table, and how best to update them.  But if someone thinks they know how I can assist in implementation.
Anyway, I hope I can help AND that I posted this in the correct mailing list!
Cheers,
Dobes
On 03/29/2012 10:37 AM, Dobes Vandermeer wrote: > Hi guys, > > Something from Josh's recent blog post about summer of code clicked > with me - the HTTP / SQL concept. > > It was something I'd been thinking about earlier, how people really > like HTTP APIs and this is one of the drivers behind adoption of some > "NoSQL" databases out there. > > Some things that I think are great about NoSQL that are missing in > PostgreSQL are: > > 1. The "reduce" function which could be modeled as a kind of > materialized view with an index on it. With materialized views > and the ability to pull fields from JSON and XML data structures > you could easily do a "NoSQL" database inside of PostgreSQL by > saving the document as a big string and then using materialized > views to have all kinds of derived data from those documents. > Sounds cool on paper, anyway. > 2. HTTP RESTful API. This is obviously not as useful as a good admin > tool like pgAdmin or a fast binary protocol like we have now but > it's way more buzzword compliant and would come in handy > sometimes. With CouchDB I was able to allow "public" data in the > database to be accessed directly from the browser using JSONP > and/or a simple HTTP proxy in the server instead of doing any of > that work within the app server. So, that saves a step somewhere. > With some basic support for stored procedures and serving files > directly via this HTTP thing you could potentially eliminate the > app server for public, read-only parts of a site/application. > 3. The perception of extremely low latency and/or high throughput - > like fetching a row in less than 1ms or whatever. I don't know > the exact numbers I just know they have to be insanely low > (latency) or impressively high (throughput). We could use > PostgreSQL as a query cache to accelerate your MySQL! Something > like that :-P. > 4. Rebelliousness against "the man". In our case SQL can't be "the > man", but we can find something PostgreSQL isn't and position > against that. > > Anyway, 1&2 seem inspiring to me and I'd love to help out with both. > 3&4 seem more like marketing tasks of some sort... > > I think I could probably start hacking up an HTTP API of some sort, I > wasn't able to log into the PostgreSQL Wiki so I created a page with > some notes here: > > http://dobesv.com/docs/postgresql-http-api.html > > For materialized views I think that can be a bit of a can of worms, > especially if you want to do incremental updates of some sort because > you have to figure out what rows need to be recalculated when you > update a row in a "source" table, and how best to update them. But if > someone thinks they know how I can assist in implementation. > > Anyway, I hope I can help AND that I posted this in the correct > mailing list! > 1. I've been in discussion with some people about adding simple JSON extract functions. We already have some (i.e. xpath()) for XML. 2. You might find htsql <http://htsql.org/> interesting. cheers andrew
On Thu, Mar 29, 2012 at 8:04 AM, Andrew Dunstan <andrew@dunslane.net> wrote:1. I've been in discussion with some people aboutadding simple JSON extract > functions. We already have some (i.e. xpath()) for XML. > > 2. You might find htsql <http://htsql.org/> interesting. My colleagues and myself have thought about this quite a lot. Even without changing any of FEBE's semantics (so far, I think they're basically fine, with the glaring problem of hidden session state) I think it may be a good time to experiment in supporting SPDY (clearly, 9.3+), which solves some of the problems of HTTP that make it pretty unusable for interactive database workload. For a long time, it looked like SPDY would become the IETF's HTTP 2.0.But the future is still clouded, as just recently Microsoft introduced their own mostly-compatible design. So that's a bit scary. More concretely, here's what I really want. The first two concerns are almost entirely technical, the latter two social, in that they rely on notion of "common" or "standard': * A standard frame extension format. For example, last I checked Postgres-XC, it required snapshot information to be passed, and it'd be nice that instead of having to hack the protocol that they could attach an X-Snapshot-Info header, or whatever. This could also be a nice way to pass out-of-band hint information of all sorts. * Something similar to the HTTP "Host" header, so that one can route to databases without having to conflate database identity with the actual port being connected to. Yes, theoretically it can be done with weird startup packet gyrations, but that is firmly in the "weird" category. * HTTP -- and *probably* its hypothetical progeny -- are more common than FEBE packets, and a lot of incidental complexity of writing routers is reduced by the commonality of routing HTTP traffic. * Protocol compression -- but a bit of sand in the gears is *which* compression -- for database workloads, the performance of zlib can be a meaningful bottleneck. Lastly, a case that can not as easily be fixed without some more thinking is leveraging caching semantics of HTTP. think people would really, really like that, if they could get away from having to hand-roll their own cache regeneration in common cases. -- fdr
On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina <daniel@heroku.com> wrote: D'oh, I munged the order. More technical concerns: > * Protocol compression -- but a bit of sand in the gears is *which* > compression -- for database workloads, the performance of zlib can be > a meaningful bottleneck. > * Something similar to the HTTP "Host" header, so that one can route > to databases without having to conflate database identity with the > actual port being connected to. Yes, theoretically it can be done > with weird startup packet gyrations, but that is firmly in the "weird" > category. Socialish (but no less important): > * A standard frame extension format. For example, last I checked > Postgres-XC, it required snapshot information to be passed, and it'd > be nice that instead of having to hack the protocol that they could > attach an X-Snapshot-Info header, or whatever. This could also be a > nice way to pass out-of-band hint information of all sorts. > > > * HTTP -- and *probably* its hypothetical progeny -- are more common > than FEBE packets, and a lot of incidental complexity of writing > routers is reduced by the commonality of routing HTTP traffic. -- fdr
On Thu, Mar 29, 2012 at 11:04 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
 
			
		
On 03/29/2012 10:37 AM, Dobes Vandermeer wrote:Hi guys,
Something from Josh's recent blog post about summer of code clicked with me - the HTTP / SQL concept.
1. I've been in discussion with some people about adding simple JSON extract functions. We already have some (i.e. xpath()) for XML.
2. You might find htsql <http://htsql.org/> interesting.
As a reference, or should we just bundle / integrate that with PostgreSQL somehow?
On Thu, Mar 29, 2012 at 6:25 PM, Dobes Vandermeer <dobesv@gmail.com> wrote: >> 2. You might find htsql <http://htsql.org/> interesting. > > > As a reference, or should we just bundle / integrate that with PostgreSQL > somehow? It's a totally different language layer without wide-spread popularity and, as of last year, still figuring out how to model updates and transactions, with a Python dependency. It is good work, but to build it in and to commit maintaining it seems very premature, to my eyes. Also, it would tie someone -- possibly the htsql maintainers -- to a upgrade and maintenance policy not unlike Postgres itself, and that is a very expensive proposition while they are still figuring things out. -- fdr
On Fri, Mar 30, 2012 at 3:59 AM, Daniel Farina <daniel@heroku.com> wrote:
 
 
			
		On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina <daniel@heroku.com> wrote:
More technical concerns:> * Protocol compression -- but a bit of sand in the gears is *which*
> compression -- for database workloads, the performance of zlib can be
> a meaningful bottleneck.
I think if performance is the issue, people should use the native protocol.  This HTTP thing should be more of a RAD / prototyping thing, I think.  So people can be in their comfort zone when talking to the server.
> * Something similar to the HTTP "Host" header, so that one can route
> to databases without having to conflate database identity with the
> actual port being connected to. Yes, theoretically it can be done
> with weird startup packet gyrations, but that is firmly in the "weird"
> category.
Isn't the URL good enough (/databases/<dbname>) or are you talking about having some some of "virtual host" setup where you have multiple sets of databases available on the same port?
Socialish (but no less important):
> * A standard frame extension format. For example, last I checked
> Postgres-XC, it required snapshot information to be passed, and it'd
> be nice that instead of having to hack the protocol that they could
> attach an X-Snapshot-Info header, or whatever. This could also be a
> nice way to pass out-of-band hint information of all sorts.
I am sorry to admit I don't understand the terms "frame extension format" or "Postgres-XC" in this paragraph ... help?
> * HTTP -- and *probably* its hypothetical progeny -- are more common> than FEBE packets, and a lot of incidental complexity of writing
> routers is reduced by the commonality of routing HTTP traffic.
This is an interesting comment.  I'm not sure how to test whether an HTTP based protocol will be better supported than a proprietary one, but I think we have enough other reasons that we can move forward.  Well we have the reason that there's some kind of love affair with HTTP based protocols going on out there ... might as well ride the wave while it's still rising (I hope).
As for SPDY I can see how it may be helpful but as it is basically a different way to send HTTP requests (from what I understand) the migration to SPDY is mainly a matter of adding support for it to whatever HTTP library is used.
Anyone have a thought on whether, for the HTTP server itself, it should be integrated right into the PostgreSQL server itself?  Or would it be better to have a separate process that proxies requests to PostgreSQL using the existing protocol?  Is there an API that can be used in both cases semi-transparently (i.e. the functions have the same name when linked right in, or when calling via a socket)?
Cheers,
Dobes
On Fri, Mar 30, 2012 at 3:57 AM, Daniel Farina <daniel@heroku.com> wrote:
 
			
		On Thu, Mar 29, 2012 at 8:04 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
Lastly, a case that can not as easily be fixed without some morethinking is leveraging caching semantics of HTTP. think people would
really, really like that, if they could get away from having to
hand-roll their own cache regeneration in common cases.
I think this could be an interesting possibility.  I wonder if the cost of a round-trip makes the cost of sending the actual data (vs a 304 response) irrelevant - as long as PostgreSQL is caching effectively internally it's possible it can send back the actual content as fast as it can calculate the ETag for it, so doing an extra query to check for changes could possibly slow things down, or at least eliminate the benefit.  Probably worth trying, though.
Supporting this "automagically" would require some kind of generic algorithm for calculating the Last-Modifed time or ETag for a given query.  As a default we may be able to just fall back on some internal global value that is guaranteed to change if the database has changed (I think the WAL files have some kind of serial number system we might use) so at the very least you could send back a 304 Not Modified if literally nothing in the database has changed.  Narrowing that down to specific table timestamps might be possible, too, for simple queries.  It depends what data is already available, I wouldn't want to add any extra book keeping for it.
A more pragmatic may be to have the HTTP request include SQL code to generate an ETag or Last-Modified value to test with; the app could run that first and it would be used for caching.  Something like calcLastModified=max(modified_date) on the query string.
On Thu, Mar 29, 2012 at 6:37 PM, Dobes Vandermeer <dobesv@gmail.com> wrote: > On Fri, Mar 30, 2012 at 3:59 AM, Daniel Farina <daniel@heroku.com> wrote: >> >> On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina <daniel@heroku.com> >> wrote: >> >> More technical concerns: >> > * Protocol compression -- but a bit of sand in the gears is *which* >> > compression -- for database workloads, the performance of zlib can be >> > a meaningful bottleneck. > > > I think if performance is the issue, people should use the native protocol. > This HTTP thing should be more of a RAD / prototyping thing, I think. So > people can be in their comfort zone when talking to the server. No. I do not think so. I think a reasonable solution (part of what MS is actually proposing to the IETF) is to make compression optional, or have clients support an LZ77 family format like Snappy. I get the impression that SPDY is waffling a little on this fact, it mandates compression, and definitely zlib, but is less heavy handed about pushing, say Snappy. Although I can understand why a Google-originated technology probably doesn't want to push another Google-originated implementation so hard, it would have been convenient for me for Snappy to have become a more common format. > Isn't the URL good enough (/databases/<dbname>) or are you talking about > having some some of "virtual host" setup where you have multiple sets of > databases available on the same port? Virtual hosts. Same port. >> > * A standard frame extension format. For example, last I checked >> > Postgres-XC, it required snapshot information to be passed, and it'd >> > be nice that instead of having to hack the protocol that they could >> > attach an X-Snapshot-Info header, or whatever. This could also be a >> > nice way to pass out-of-band hint information of all sorts. > > > I am sorry to admit I don't understand the terms "frame extension format" or > "Postgres-XC" in this paragraph ... help? I'm being vague. Postgres-XC is a project to provide a shared-nothing sync-rep cluster for Postgres. My last understanding of it is that it needed to pass snapshot information between nodes, and FEBE was expanded to make room for this, breaking compatibility, as well as probably being at least a small chore. It'd be nice if it wasn't necessary to do that and they had a much easier path to multiplex additional information into the connection. For my own purposes, I'm intensely interest in lacing the connection with: * EXPLAIN ANALYZE returns when the query has already run, getting both the actual timings *and* the results to the client. * Partition IDs, whereby you can find the right database and (potentially!) even influence how the queries are scoped to a tenant * Read-only vs. Write workload: As is well established, it's hard to know a-priori if a query is going to do a write. Fine. Let the client tag it, signal an error if something is wrong. Yes, theoretically all these features -- or just a general multiplexing scheme -- can be added to FEBE, but if we're even going to consider such an upheaval, maybe we can get *lot* more bang for our buck by trying to avoid being unnecessarily different from the most common application-level protocol in existence, causing extraneous work for router and proxy authors. Notably, a vanilla Postgres database knows nothing about these extension headers. >> > * HTTP -- and *probably* its hypothetical progeny -- are more common >> > than FEBE packets, and a lot of incidental complexity of writing >> > routers is reduced by the commonality of routing HTTP traffic. > > This is an interesting comment. I'm not sure how to test whether an HTTP > based protocol will be better supported than a proprietary one, but I think > we have enough other reasons that we can move forward. Well we have the > reason that there's some kind of love affair with HTTP based protocols going > on out there ... might as well ride the wave while it's still rising (I > hope). At its core, what may be growing unnecessary is FEBE's own mechanism for delimiting messages. All the other protocol actions -- such as shipping Binds, Executes, Describes, et al, are not going to be obsoleted or even changed by laying web-originated technologies under FEBE. Consider the wealth of projects, products, and services that filter HTTP vs FEBE, many quite old, now. In my mind, "wave" might be better rendered "tsunami". The very real problem, as I see it, is that classic, stateless HTTP would be just too slow to be practical. > As for SPDY I can see how it may be helpful but as it is basically a > different way to send HTTP requests (from what I understand) the migration > to SPDY is mainly a matter of adding support for it to whatever HTTP library > is used. I think SPDY or like-protocols (there's only one other I can think of, the very recently introduced and hilariously branded "S&M" from Microsoft. It's memorable, at least) are the only things that appear to give a crisp treatment to interactive, stateful workloads involving back-and-forth between client and server with multiplexing, fixing some problems with the hacks in HTTP-land from before. > Anyone have a thought on whether, for the HTTP server itself, it should be > integrated right into the PostgreSQL server itself? Or would it be better > to have a separate process that proxies requests to PostgreSQL using the > existing protocol? Is there an API that can be used in both cases > semi-transparently (i.e. the functions have the same name when linked right > in, or when calling via a socket)? If SPDY/HTTP2.0 were more common/existent in the latter case, I'd advocate for swallowing it and making it just part of the monolithic system. But it is still a time of transition, and jumping the gun on it would be expensive. -- fdr
On Fri, Mar 30, 2012 at 10:55 AM, Daniel Farina <daniel@heroku.com> wrote:
 
 
 
			
		On Thu, Mar 29, 2012 at 6:37 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:No. I do not think so. I think a reasonable solution (part of what MS
> On Fri, Mar 30, 2012 at 3:59 AM, Daniel Farina <daniel@heroku.com> wrote:
>>
>> On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina <daniel@heroku.com>
>> wrote:
>>
>> More technical concerns:
>> > * Protocol compression -- but a bit of sand in the gears is *which*
>> > compression -- for database workloads, the performance of zlib can be
>> > a meaningful bottleneck.
>
> I think if performance is the issue, people should use the native protocol.
is actually proposing to the IETF) is to make compression optional, or
have clients support an LZ77 family format like Snappy. I get the
impression that SPDY is waffling a little on this fact, it mandates
compression, and definitely zlib, but is less heavy handed about
pushing, say Snappy. Although I can understand why a
Google-originated technology probably doesn't want to push another
Google-originated implementation so hard, it would have been
convenient for me for Snappy to have become a more common format.Virtual hosts. Same port.
> Isn't the URL good enough (/databases/<dbname>) or are you talking about
> having some some of "virtual host" setup where you have multiple sets of
> databases available on the same port?
In that case, the frontend would not be tied to a specific PostgreSQL server, then?  I think initially this might complicate things a bit, and you could solve it by putting an HTTP proxy in front to do the virtual hosts for you.
>> > * A standard frame extension format. For example, last I checkedIt'd be nice if it wasn't
>> > Postgres-XC, it required snapshot information to be passed, and it'd
>> > be nice that instead of having to hack the protocol that they could
>> > attach an X-Snapshot-Info header, or whatever. This could also be a
>> > nice way to pass out-of-band hint information of all sorts.
>
>
> I am sorry to admit I don't understand the terms "frame extension format" or
> "Postgres-XC" in this paragraph ... help?
necessary to do that and they had a much easier path to multiplex
additional information into the connection.
Ah, I get it - you want a way to add some extra information to the protocol in a backwards compatible way.  HTTP (and SPDY) provides a "standard" way to do that.  Makes sense.
For my own purposes, I'm intensely interest in lacing the connection with:
* EXPLAIN ANALYZE
* Partition IDs
* Read-only vs. Write workload
I'll make a note of these and hash out the details a bit more once there's something working to add them to.
> As for SPDY I can see how it may be helpful but as it is basically a
> different way to send HTTP requestsI think SPDY or like-protocols [...] give a crisp treatment to interactive, stateful workloads involving
back-and-forth between client and server with multiplexing, fixing
some problems with the hacks in HTTP-land from before.
It sounds like at some level you're really talking about replacing the built-in protocol with SPDY because SPDY is possibly a better baseline than updating the existing protocol.  That's an interesting idea, I think this project could evolve in that direction if there's demand for it.
On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer <dobesv@gmail.com> wrote: >> Virtual hosts. Same port. > > In that case, the frontend would not be tied to a specific PostgreSQL > server, then? I think initially this might complicate things a bit, and you > could solve it by putting an HTTP proxy in front to do the virtual hosts for > you. I think these problems are treatable, as you mention, and if there is to be any value at all by using mostly off-the-shelf components (which, at this time, are rarer for the new-generation stuff). That's the draw, for me. > Ah, I get it - you want a way to add some extra information to the protocol > in a backwards compatible way. HTTP (and SPDY) provides a "standard" way to > do that. Makes sense. > > I'll make a note of these and hash out the details a bit more once there's > something working to add them to. A lot of them are old ideas, but it would be nice to encourage experimentation by getting over some of the small-matter-of-programming and backwards-compatibility issues. >> I think SPDY or like-protocols [...] give a crisp treatment to >> interactive, stateful workloads involving >> >> back-and-forth between client and server with multiplexing, fixing >> some problems with the hacks in HTTP-land from before. > > It sounds like at some level you're really talking about replacing the > built-in protocol with SPDY because SPDY is possibly a better baseline than > updating the existing protocol. That's an interesting idea, I think this > project could evolve in that direction if there's demand for it. If only so there is a smaller set of arbitrary decisions to make about how to delimit messages...but if SPDY doesn't get widely deployed, or exacts an unacceptable performance penalty, it is game over. The worst possible outcome is the thing that becomes common also is impractical for session-oriented sequential SQL execution, but I am hopeful that given the use cases driving this standardization process that this is less likely to happen. FEBE's framing/message format doesn't seem like an area where people are itching to try wild and crazy changes unique to Postgres (maybe someone has...) so using a more commonly seen delimitation format seems like a concession that opens a lot of more useful interesting doors. This discussion is somewhat premature because HTTP 2.0 isn't standardized, nor has SPDY become a wide-spread defacto format (although the percentage of well-known web-infrastructure projects implementing it has grown both impressively both recently and overall), and there are not even straw-man bandwidth/latency measurements (in terms of vs. FEBE), but as long as someone is bringing up HTTP, I thought it worth discussing in a little more depth, because it's something I poll regularly mentally, looking for a sign that It's Time. It wasn't ready enough for me to start a thread, but clearly I couldn't quite resist replying to one... -- fdr
On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > 1. I've been in discussion with some people about adding simple JSON extract > functions. We already have some (i.e. xpath()) for XML. I've built a couple of applications that push data in and out of xml via manual composition going out and xpath coming in. TBH, I found this to be a pretty tedious way of developing a general application structure and a couple of notches down from the more sql driven approach. Not that jsonpath/xpath aren't wonderful functions -- but I thing for general information passing there's a better way. Your json work is a great start in marrying document level database features with a relational backend. My take is that storing rich data inside the database in json format, while tempting, is generally a mistake. Unless the document is black box it should be decomposed and stored relationally and marked back up into a document as it goes out the door. This is why brevity and flexibility of syntax is so important when marshaling data in and out of transport formats. It encourages people to take the right path and get the best of both worlds -- a rich backend with strong constraints that can natively speak such that writing data driven web services is easy. What I'm saying is that jsonpath probably isn't the whole story: another way of bulk moving json into native backend structures without parsing would also be very helpful. For example, being able to cast a json document into a record or a record array would be just amazing. merlin
On 03/30/2012 09:57 AM, Merlin Moncure wrote: > What I'm saying is that jsonpath probably isn't the whole story: > another way of bulk moving json into native backend structures without > parsing would also be very helpful. For example, being able to cast a > json document into a record or a record array would be just amazing. It should be possible. After all, hstore's populate_record() does that. If we married that logic up with the json parser we should be half way there. cheers andrew
On Fri, Mar 30, 2012 at 4:30 PM, Daniel Farina <daniel@heroku.com> wrote:
			
		On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer <dobesv@gmail.com> wrote:
>> Virtual hosts. Same port.>> I think SPDY or like-protocols [...] give a crisp treatment to>> interactive, stateful workloads involvingIf only so there is a smaller set of arbitrary decisions to make about
>>
>> back-and-forth between client and server with multiplexing, fixing
>> some problems with the hacks in HTTP-land from before.
>
> It sounds like at some level you're really talking about replacing the
> built-in protocol with SPDY because SPDY is possibly a better baseline than
> updating the existing protocol. That's an interesting idea, I think this
> project could evolve in that direction if there's demand for it.
how to delimit messages...but if SPDY doesn't get widely deployed, or
exacts an unacceptable performance penalty, it is game over.
Well, in our case HTTP is a clear win (but not replacement) and SPDY a potential one (even as a replacement).  Even if SPDY is not widely adopted it could still replace FEBE if there's a clear advantage to using it, I don't know enough to make the call right now.
On Fri, Mar 30, 2012 at 9:57 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
			
		On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan <andrew@dunslane.net> wrote:Your json work is a great start in marrying document level database
> 1. I've been in discussion with some people about adding simple JSON extract
> functions. We already have some (i.e. xpath()) for XML.
features with a relational backend. My take is that storing rich data
inside the database in json format, while tempting, is generally a
mistake. Unless the document is black box it should be decomposed and
stored relationally and marked back up into a document as it goes out
the door.
I found storing documents (with no schema) in CouchDB very sexy.  For analytics purposes it's very handy to throw whatever data you've got into a document and save it to the database and sort out the schema later, when you have a need for it.  It could be you want to allow hundreds of different fields in the document, and the document has a hierarchal structure, and so on ... and yet there's no interest in spending time figuring out how to map all that effectively into a SQL schema.  Some things might start out a black box but become interesting for reporting purposes later.
While I was doing this I always thought this would have been a better approach for my previous project, an accounting application.  If I could just have stored entities like invoice & customer as a single document that is inserted, updated, etc. atomically it would be a lot simpler and faster than having to break things out into columns and rows spread over various tables.  Some fields are just "information" that isn't searched for or used relationally but ends up getting a field (or a whole bunch of fields, like your address) in the DB schema anyway.  Adding, removing, and changing columns is always scary to me in the SQL database - when there's no schema you can fix these things as part of the definition of your "view" instead.
So I kind of think the document database kind of bridges the gap between an OODBMS and the RDBMS because the document is like a little cluster of objects and then you populate your relations by mapping a view of those documents and use that for analysis and search.
What I'm saying is that jsonpath probably isn't the whole story:
another way of bulk moving json into native backend structures without
parsing would also be very helpful. For example, being able to cast a
json document into a record or a record array would be just amazing.
Hmm whatever that exactly is, it does sound cool! 
On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer <dobesv@gmail.com> wrote: > Well, in our case HTTP is a clear win (but not replacement) and SPDY a > potential one (even as a replacement). Even if SPDY is not widely adopted > it could still replace FEBE if there's a clear advantage to using it, I > don't know enough to make the call right now. I can see that there are some advantages to having an HTTP interface to the database, but I think throwing our existing protocol out the window or relegating it to the status of a second-class citizen would be foolish. HTTP is a non-trivial protocol that tends to impose lots of escaping and de-escaping overhead which is unnecessary for people who just want to connect to the database and run queries. I can completely understand that someone might want the ability to do GET /db/table/pk and have that return an answer very, very quickly, by bypassing the usual parser and planner and just firing off an index-scan and returning the results as JSON or somesuch. But I think it would be a serious mistake to assume that GET /q?q=myquery is going to come out better than what we have now in the general case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/30/2012 11:41 AM, Robert Haas wrote: > On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer<dobesv@gmail.com> wrote: >> Well, in our case HTTP is a clear win (but not replacement) and SPDY a >> potential one (even as a replacement). Even if SPDY is not widely adopted >> it could still replace FEBE if there's a clear advantage to using it, I >> don't know enough to make the call right now. > I can see that there are some advantages to having an HTTP interface > to the database, but I think throwing our existing protocol out the > window or relegating it to the status of a second-class citizen would > be foolish. Right, I can't imagine it happening. And I wouldn't really be keen to add an alternative protocol either. I could imagine a client which presented a SPDY interface to the world and translated it into standard calls, possibly via libpq. It's well to remember that we are not a green fields project here. > HTTP is a non-trivial protocol that tends to impose lots > of escaping and de-escaping overhead which is unnecessary for people > who just want to connect to the database and run queries. I can > completely understand that someone might want the ability to do GET > /db/table/pk and have that return an answer very, very quickly, by > bypassing the usual parser and planner and just firing off an > index-scan and returning the results as JSON or somesuch. But I think > it would be a serious mistake to assume that GET /q?q=myquery is going > to come out better than what we have now in the general case. Indeed. cheers andrew
On Fri, Mar 30, 2012 at 9:11 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 03/30/2012 11:41 AM, Robert Haas wrote: >> >> On Fri, Mar 30, 2012 at 10:55 AM, Dobes Vandermeer<dobesv@gmail.com> >> wrote: >>> >>> Well, in our case HTTP is a clear win (but not replacement) and SPDY a >>> potential one (even as a replacement). Even if SPDY is not widely >>> adopted >>> it could still replace FEBE if there's a clear advantage to using it, I >>> don't know enough to make the call right now. >> >> I can see that there are some advantages to having an HTTP interface >> to the database, but I think throwing our existing protocol out the >> window or relegating it to the status of a second-class citizen would >> be foolish. > > > > Right, I can't imagine it happening. And I wouldn't really be keen to add an > alternative protocol either. > > I could imagine a client which presented a SPDY interface to the world and > translated it into standard calls, possibly via libpq. Any enhancement here that can't be used with libpq via, say, drop-in .so seems unworkable to me, and that's why any solution that is basically proxying to the database is basically a non-starter outside the very earliest prototyping stages. The tuple scanning and protocol semantics can and even should remain the same, especially at first. What I really want is for a mechanism to push FEBE messages down into a transport in a extendable way (not unlike SSL, except this time sensitive to FEBE message boundaries), taking unmodified libpq binaries (but with a hook, of course...) into consideration. > It's well to remember that we are not a green fields project here. Perhaps not, which is a good reason to not put cart before horse -- I've tried to be very careful to suggest that real useful caching/proxy/routing software needs to appear that can work, unmodified, with hypothetical FEBE-over-SPDY communications, and accomplish some interesting use cases, and also very careful to suggest this is a *transport*-level feature. It think would be fair to demand: "Okay, when Apache or NGINX or a credible embedded web server in some language can accomplish something useful for us if we absorb this complexity, let's put this on the table". To have to write continuously write poolers, cachers and redirectors (for any reason: HA, partitioning, or even just clean-looking domain names...) is a burden on community resources. The criteria for success is to expand the number of projects attacking these problems and decrease the cost of writing one, and I think demanding a straw-man implementation of, say, a query cacher or pooler and seeing how that implementation feels is not unreasonable along with accepting the burden of maintenance. Robert Haas Wrote: >> HTTP is a non-trivial protocol that tends to impose lots >> of escaping and de-escaping overhead which is unnecessary for people >> who just want to connect to the database and run queries. I can >> completely understand that someone might want the ability to do GET >> /db/table/pk and have that return an answer very, very quickly, by >> bypassing the usual parser and planner and just firing off an >> index-scan and returning the results as JSON or somesuch. But I think >> it would be a serious mistake to assume that GET /q?q=myquery is going >> to come out better than what we have now in the general case. I don't think this holds for SPDY, which is the only way I even entertain thoughts of doing this. I want FEBE, at least at first and probably even then for quite a long while, just with framing and headers (during start-up) that other software will understand. Control frames are like this: +----------------------------------+ |C| Version(15bits) | Type(16bits) | +----------------------------------+ | Flags (8) | Length (24 bits) | +----------------------------------+ | Data | +----------------------------------+ Data frames are like this: +----------------------------------+ |C| Stream-ID (31bits) | +----------------------------------+ | Flags (8) | Length (24 bits) | +----------------------------------+ | Data | +----------------------------------+ The (XML, not HTML, I read this with "view source", or try "curl") RFC draft is here: http://mbelshe.github.com/SPDY-Specification/draft-mbelshe-spdy-00.xml -- fdr
On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina <daniel@heroku.com> wrote: > Any enhancement here that can't be used with libpq via, say, drop-in > .so seems unworkable to me, and that's why any solution that is > basically proxying to the database is basically a non-starter outside > the very earliest prototyping stages. The tuple scanning and protocol > semantics can and even should remain the same, especially at first. I should add: proxying could work well if libpq had all the right hooks. The server could remain ignorant. Regardless, upstream changes result. -- fdr
On Sat, Mar 31, 2012 at 1:44 AM, Daniel Farina <daniel@heroku.com> wrote:
 
Just to be clear, what you are saying that writing a process that accepts requests by HTTP and translates them into requests using the existing protocol to send to the server would have unacceptable performance?  Or is there something else about it that is a non-starter?On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina <daniel@heroku.com> wrote:I should add: proxying could work well if libpq had all the right
> Any enhancement here that can't be used with libpq via, say, drop-in
> .so seems unworkable to me, and that's why any solution that is
> basically proxying to the database is basically a non-starter outside
> the very earliest prototyping stages. The tuple scanning and protocol
> semantics can and even should remain the same, especially at first.
hooks. The server could remain ignorant. Regardless, upstream changes
result.
On Sat, Mar 31, 2012 at 6:37 AM, Dobes Vandermeer <dobesv@gmail.com> wrote: > On Sat, Mar 31, 2012 at 1:44 AM, Daniel Farina <daniel@heroku.com> wrote: >> >> On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina <daniel@heroku.com> wrote: >> > Any enhancement here that can't be used with libpq via, say, drop-in >> > .so seems unworkable to me, and that's why any solution that is >> > basically proxying to the database is basically a non-starter outside >> > the very earliest prototyping stages. The tuple scanning and protocol >> > semantics can and even should remain the same, especially at first. >> >> I should add: proxying could work well if libpq had all the right >> hooks. The server could remain ignorant. Regardless, upstream changes >> result. > > > Just to be clear, what you are saying that writing a process that accepts > requests by HTTP and translates them into requests using the existing > protocol to send to the server would have unacceptable performance? Or is > there something else about it that is a non-starter? I don't think it's so much an unworkable performance regression as getting people to seriously try experimenting with useful extensions to FEBE, and so that real applications (and their maintainers) can justify a little time to test and experience (for good or for ill) those enhancements in a production or at least staging setting and giving them a fair shake. As a reminder, this is based on a conjecture that there is a large dimension of useful experimentation that involves simply interlacing information between FEBE messages, and then intercepting and processing those on both server and client, and that these message can be composed in many situations (i.e. multiple extensions can work well). For me, HTTP2 figures into all of this because it may be one way to paint some aspects of the protocol-extension bikeshed with the same color more people might use, and as long as that color is basically functional we can seek to understand if a standard bikeshed-color allows us to take advantage of anticipated large, low-cost reserves of paint. Consider this analogy stretched. -- fdr
> While I was doing this I always thought this would have been a better
> approach for my previous project, an accounting application.  If I could
> just have stored entities like invoice & customer as a single document that
> is inserted, updated, etc. atomically it would be a lot simpler and faster
> than having to break things out into columns and rows spread over various
> tables. 
Actually, an accounting application is the *worst* candidate for
document-oriented storage.  For example, it's pretty normal for
customers to have multiple invoices, and for each invoice to have
multiple line-items which relate to inventory items.  Without relational
structure, answering questions like "Do I have enough green shoes in
stock to cover pending orders within the next 30 days in Texas?" goes
from a relatively simple query to a major exercise in frustration.  And
it's not possible to implement required business constraints ("A
customers balance may not go below zero") at all in a document database.
There's strong arguments to be made for document-structured storage for
some applications.  But RDBMS covers financial applications very well;
it's what RDBMS was designed for in the first place.
> Some fields are just "information" that isn't searched for or used
> relationally but ends up getting a field (or a whole bunch of fields, like
> your address) in the DB schema anyway.  Adding, removing, and changing
> columns is always scary to me in the SQL database - when there's no schema
> you can fix these things as part of the definition of your "view" instead.
Yes, having "documents" *in addition* to relational data gives you the
best of both worlds.  You can use relational structures to store data
which is well-defined and business-critical, and document structures to
store data which is undefined and not critical.
> So I kind of think the document database kind of bridges the gap between an
> OODBMS and the RDBMS because the document is like a little cluster of
OODBMS != DocumentDB
-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
			
		On Wed, Apr 4, 2012 at 6:26 AM, Josh Berkus <josh@agliodbs.com> wrote:
  
			
		Actually, an accounting application is the *worst* candidate for
> While I was doing this I always thought this would have been a better
> approach for my previous project, an accounting application. If I could
> just have stored entities like invoice & customer as a single document that
> is inserted, updated, etc. atomically it would be a lot simpler and faster
> than having to break things out into columns and rows spread over various
> tables.
document-oriented storage.
I guess I didn't go enough into detail.  As it's a small business bookkeeping system the records are added after the fact.  Constraint checking isn't a priority; we would allow someone to enter sales before purchases and things like that which means the constraint checking has to be more about flagging issues (we didn't get around to that yet, either).  It wasn't an ERP and didn't support inventory so there's no worry about searching for inventory counts in particular locations.  The idea is to input source documents like invoices & receipts and generate reports for stakeholders.
I think there is something to be gained by having a first-class concept of a "document" in the database.  It might save some trouble managing parent/child relations, versioning, things like that.
I hand-craft some materialized views back then, too, since the conversion from a document (like an invoice) to the actual impact of that on account ledgers and balances was non-trivial and evolving as the feature set expanded, so it wasn't something you wanted to try and build into your reporting queries.
Yes, having "documents" *in addition* to relational data gives you the
best of both worlds. You can use relational structures to store data
which is well-defined and business-critical, and document structures to
store data which is undefined and not critical.
Well that's exactly what I was trying to get at in the first place :-).  I'd love to see this kind of functionality in PostgreSQL and I think materialized views are a pretty powerful way to do that when you are automatically pulling fields out of the document to make the relational tables.
> So I kind of think the document database kind of bridges the gap between anOODBMS != DocumentDB
> OODBMS and the RDBMS because the document is like a little cluster of
Yes, I know.  I was just saying that a document DB is a bit more OO because the document itself is stored as an object graph rather than just tuples.  Thus it fits in between RDBMS and OODBMS in a way.  It makes sense in my head somehow, no need to agree with me on this one.
Regards,
Dobes
On Wed, Apr 4, 2012 at 9:53 AM, Dobes Vandermeer <dobesv@gmail.com> wrote: > I think there is something to be gained by having a first-class concept of a > "document" in the database. It might save some trouble managing > parent/child relations, versioning, things like that. Methinks this needs a *lot* more specific description of what you mean by "document." The thought that is occurring to me in this context is that the "document" is simply an image (.png, .jpeg, .pdf) of a paper document which might get associated with some of the 'business transactions' in the database. Thus, I'd be happy to be able to capture images of invoices, receipts, and such, and associate them with the highly structured data for the accounting transactions that they are associated with. I'm not sure that this is the same thing that you are thinking of. I suspect that you might be thinking of a "document" as being a "loosely structured set of data". Though with the similarity that such documents would get associated with the highly structured accounting transaction data that they relate to. It's not a ludicrously bad idea to have a series of supplementary data tables that can get tied to transactions... create table supplementary_bitmap ( h_id serial primary key, created_on timestamptz not null default now(), metadata text not null, bitmap bytea ); create table supplementary_xml ( x_id serial primary key, created_on timestamptz not null default now(), metadata text not null, data xml ); create table supplementary_hstore ( hs_id serial primary key, created_on timestamptz not null default now(), metadata text not null, data hstore ); And add some optional references to these to some of your tables. That doesn't notably lend itself to doing a lot of work with the relationships between bits of supplementary data. There's not much that I *can* do if I'm attaching images of pictures I took of invoices with my phone; there's not much about that that's amenable to further automatic analysis. It's still pretty useful, if someone wants some proof that there was an invoice; I can produce a copy that's tied to the transaction. That's rather less than "OODBMS" :-). -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"