Обсуждение: Shorter archive URLs

Поиск
Список
Период
Сортировка

Shorter archive URLs

От
Magnus Hagander
Дата:
We've been discussing shorting the archives URLs for some time, but never got around to actually doing it.

I've now got a preliminary patch going, that basically replaces the message-id portion of the URL with a sha1 hash (base64 encoded) of the message-id.

This brings the messageid part down to 28 characters. Looking at the current archives, that will make 92% of the URLs shorter and 6% longer than they are today. Looking at just 2018 and 2019, the number is 99% shorter and 0.5% longer (only 429 messages in total gets longer).

(The average messageid length has consistently increased from 39 characters in 1996, through 41 in 2006 and 51 in 2016 and on -- probably much thanks to gmail)

This means that instead of being:

The url would be:


We could of course also use the internal surrogate key that each message has and make it even shorter, but if we expose that then it becomes a lot harder to change the underlying representation. (we'd need big annoying mapping tables like we still carry from the old implementation). Using the sha hash means we can still generate the URL just from the contents of the message itself -- which also means that the URLs can be generated offline for those that prefer to.

Any access to the old message-id based URLs would automatically receive a permanent redirect to the new ones.

The postgr.es redirector would keep working, and accept both old and new style URLs (the old ones would just cause a double redirect step).

I think the only thing we'd really lose is the ability to "determine from the url who sent an email", which in itself only works for some people, but definitely does work for some who send a lot of mail (hi, Tom!).

Thoughts on this? Do people think that's a big enough win to go for?

Re: Shorter archive URLs

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> This means that instead of being:
> https://www.postgresql.org/message-id/CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> The url would be:
> https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=

FWIW, I don't care for that one bit. Yeah, message IDs are pretty
opaque in many cases, but at least they're not designed and built to
be opaque.  An example of what would be lost is the ability to find
a message given one of these URLs in any other archive, such as one's
personal mail archive.  (Unless one sets up a mapping table to match
this transform, which would be a big PITA.)

> Any access to the old message-id based URLs would automatically receive a
> permanent redirect to the new ones.

If you remove the ability to find a message in the archive from its
original message ID, I will be REALLY unhappy, because that will break
lookups in the other direction (ie, given a message in my local files,
go find it --- and its thread --- in the PG archives).

On the whole I don't see any good reason to change this.

            regards, tom lane



Re: Shorter archive URLs

От
Alvaro Herrera
Дата:
On 2019-Jul-14, Tom Lane wrote:

> Magnus Hagander <magnus@hagander.net> writes:
> > This means that instead of being:
> > https://www.postgresql.org/message-id/CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> > The url would be:
> > https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=
> 
> FWIW, I don't care for that one bit. Yeah, message IDs are pretty
> opaque in many cases, but at least they're not designed and built to
> be opaque.  An example of what would be lost is the ability to find
> a message given one of these URLs in any other archive, such as one's
> personal mail archive.  (Unless one sets up a mapping table to match
> this transform, which would be a big PITA.)

+1

> > Any access to the old message-id based URLs would automatically receive a
> > permanent redirect to the new ones.
> 
> If you remove the ability to find a message in the archive from its
> original message ID, I will be REALLY unhappy, because that will break
> lookups in the other direction (ie, given a message in my local files,
> go find it --- and its thread --- in the PG archives).

Me too.

> On the whole I don't see any good reason to change this.

+1

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Shorter archive URLs

От
Magnus Hagander
Дата:


On Sun, Jul 14, 2019 at 6:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> This means that instead of being:
> https://www.postgresql.org/message-id/CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> The url would be:
> https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=

FWIW, I don't care for that one bit. Yeah, message IDs are pretty
opaque in many cases, but at least they're not designed and built to
be opaque.  An example of what would be lost is the ability to find
a message given one of these URLs in any other archive, such as one's
personal mail archive.  (Unless one sets up a mapping table to match
this transform, which would be a big PITA.)

You mean going from an URL into actually finding the message, without looking at the actual archives site? Yeah, that wouldn't work. With access to the website, the message-id is right there of course.

But that's true, that's definitely a "lost feature" I didn't think of.


> Any access to the old message-id based URLs would automatically receive a
> permanent redirect to the new ones.

If you remove the ability to find a message in the archive from its
original message ID, I will be REALLY unhappy, because that will break
lookups in the other direction (ie, given a message in my local files,
go find it --- and its thread --- in the PG archives).

Oh, we will absolutely not do that. There's a reason we're still keeping redirects in place for the old style archives URL, which we stopped using in 2012...


On the whole I don't see any good reason to change this.

It is something that's fairly frequently requested, because they look bad. For one thing, it's regularly mentioned when discussion commit messages,  because the "discussions:" links tend to wrap...



--

Re: Shorter archive URLs

От
Bruce Momjian
Дата:
On Sun, Jul 14, 2019 at 12:52:46PM +0200, Magnus Hagander wrote:
> We've been discussing shorting the archives URLs for some time, but never got
> around to actually doing it.
> 
> I've now got a preliminary patch going, that basically replaces the message-id
> portion of the URL with a sha1 hash (base64 encoded) of the message-id.
> 
> This brings the messageid part down to 28 characters. Looking at the current
> archives, that will make 92% of the URLs shorter and 6% longer than they are
> today. Looking at just 2018 and 2019, the number is 99% shorter and 0.5% longer
> (only 429 messages in total gets longer).
> 
> (The average messageid length has consistently increased from 39 characters in
> 1996, through 41 in 2006 and 51 in 2016 and on -- probably much thanks to
> gmail)
> 
> This means that instead of being:
> https://www.postgresql.org/message-id/
> CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> 
> The url would be:
> https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=

It would be nice if I could easily compute the hash if I know the
message-id --- I assume I can just run it through sha1.  This would
allow me to shorten commit URLs, which would be a win for GMail.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: Shorter archive URLs

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> On Sun, Jul 14, 2019 at 12:52:46PM +0200, Magnus Hagander wrote:
>> This means that instead of being:
>> https://www.postgresql.org/message-id/
>> CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
>> 
>> The url would be:
>> https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=

> It would be nice if I could easily compute the hash if I know the
> message-id --- I assume I can just run it through sha1.  This would
> allow me to shorten commit URLs, which would be a win for GMail.

Now that I look closer, Magnus' example shows that this proposal
is underspecified: exactly how would the message-ID be rendered
before being fed into sha1?  In particular it's not clear from
this whether "@" should be spelled "@" or "%40".  The existing
archive website is quite forgiving about that, you can write
either --- but the sha1 transform would be utterly unforgiving.
Instead of opaque hash X you'd get opaque hash Y, and there'd
be no way even to see what caused the mismatch.

(BTW, after some experimentation I'm totally unable to reproduce
Magnus' example using sha1sum(1) and base64(1), so that is not
the only underspecified point here.)

            regards, tom lane



Re: Shorter archive URLs

От
Magnus Hagander
Дата:


On Tue, Jul 16, 2019 at 5:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
> On Sun, Jul 14, 2019 at 12:52:46PM +0200, Magnus Hagander wrote:
>> This means that instead of being:
>> https://www.postgresql.org/message-id/
>> CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
>>
>> The url would be:
>> https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=

> It would be nice if I could easily compute the hash if I know the
> message-id --- I assume I can just run it through sha1.  This would
> allow me to shorten commit URLs, which would be a win for GMail.

Now that I look closer, Magnus' example shows that this proposal
is underspecified: exactly how would the message-ID be rendered
before being fed into sha1?  In particular it's not clear from
this whether "@" should be spelled "@" or "%40".  The existing
archive website is quite forgiving about that, you can write
either --- but the sha1 transform would be utterly unforgiving.
Instead of opaque hash X you'd get opaque hash Y, and there'd
be no way even to see what caused the mismatch.

It should always be @. The %40 is a sideeffect of @ not being allowed in an URL.

 

(BTW, after some experimentation I'm totally unable to reproduce
Magnus' example using sha1sum(1) and base64(1), so that is not
the only underspecified point here.)

The problem is that sha1sum generates a hex version of the sum, not the binary version. You also need to be careful about the newlines.
How I've done it is simply (in python):

>>> import hashlib, base64
>>> base64.urlsafe_b64encode(hashlib.sha1(b'CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg@mail.gmail.com').digest())
b'Z0oaTfo56bV4tke6-r_PKJstHF8='


We could use a hex digest instead of a base64 of course, but that would make the URLs longer.

(FWIW, I'm not wedded to making this change -- that's why I posted here first -- this is just explaining how it was actually calculated)
 
-- 

Re: Shorter archive URLs

От
Andres Freund
Дата:
Hi,

On 2019-07-14 18:46:28 +0200, Magnus Hagander wrote:
> On Sun, Jul 14, 2019 at 6:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Magnus Hagander <magnus@hagander.net> writes:
> > > This means that instead of being:
> > >
> > https://www.postgresql.org/message-id/CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> > > The url would be:
> > > https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=
> >
> > FWIW, I don't care for that one bit. Yeah, message IDs are pretty
> > opaque in many cases, but at least they're not designed and built to
> > be opaque.  An example of what would be lost is the ability to find
> > a message given one of these URLs in any other archive, such as one's
> > personal mail archive.  (Unless one sets up a mapping table to match
> > this transform, which would be a big PITA.)
> >
> 
> You mean going from an URL into actually finding the message, without
> looking at the actual archives site? Yeah, that wouldn't work. With access
> to the website, the message-id is right there of course.

The ability to do that is crucial for me as well.


> It is something that's fairly frequently requested, because they look bad.
> For one thing, it's regularly mentioned when discussion commit messages,
> because the "discussions:" links tend to wrap...

I kind of don't buy that that's a real problem, fwiw. It's not like one
has to read them to the end all the time. And if they wrap into the next
line, then that's fine too? They're not at the start of commit messages,
after all.

Greetings,

Andres Freund



Re: Shorter archive URLs

От
Magnus Hagander
Дата:


On Tue, Jul 16, 2019 at 9:30 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-07-14 18:46:28 +0200, Magnus Hagander wrote:
> On Sun, Jul 14, 2019 at 6:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Magnus Hagander <magnus@hagander.net> writes:
> > > This means that instead of being:
> > >
> > https://www.postgresql.org/message-id/CABUevEyqGVV-s1yXQBsTpoPDCHy79j-yDtJcucrPb9Hh4CFTNg%40mail.gmail.com
> > > The url would be:
> > > https://www.postgresql.org/message-id/Z0oaTfo56bV4tke6-r_PKJstHF8=
> >
> > FWIW, I don't care for that one bit. Yeah, message IDs are pretty
> > opaque in many cases, but at least they're not designed and built to
> > be opaque.  An example of what would be lost is the ability to find
> > a message given one of these URLs in any other archive, such as one's
> > personal mail archive.  (Unless one sets up a mapping table to match
> > this transform, which would be a big PITA.)
> >
>
> You mean going from an URL into actually finding the message, without
> looking at the actual archives site? Yeah, that wouldn't work. With access
> to the website, the message-id is right there of course.

The ability to do that is crucial for me as well.


> It is something that's fairly frequently requested, because they look bad.
> For one thing, it's regularly mentioned when discussion commit messages,
> because the "discussions:" links tend to wrap...

I kind of don't buy that that's a real problem, fwiw. It's not like one
has to read them to the end all the time. And if they wrap into the next
line, then that's fine too? They're not at the start of commit messages,
after all.

I think it's clear that the majority opinion here is that we don't want this change, and I will thus not complete the patch.  (And reference back to this discussion the next time somebody asks for it -- because it now has some good explanations on why)

--