Обсуждение: serial + db key, or guid?

Поиск
Список
Период
Сортировка

serial + db key, or guid?

От
Mark Phillips
Дата:
Given four instances of posgres, each with a database, each instance receiving new data, and desiring a data “merge” a
laBDR or similar multiple database solutions, my team has been discussing the pros and cons of generating unique keys
ineach table. 

1. create a unique “database” id for each database, add a column to all tables for this id value and combine that with
aserial id 
2. use guid from pg functions
3. create id ranges for each database, e.g. db1 gets 1 to 1M, db2 gets 1M+1 to 2M, and so on

All get us to the finish line, but each has its drawbacks.

Advice, cautionary tales, suggestions and such will be warmly received.

 - Mark


Re: serial + db key, or guid?

От
Christophe Pettus
Дата:

> On Aug 10, 2020, at 15:19, Mark Phillips <mark.phillips@mophilly.com> wrote:
> Advice, cautionary tales, suggestions and such will be warmly received.

Here's one solution a company found for this; it seems to work very well:

    https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c

--
-- Christophe Pettus
   xof@thebuild.com




Re: serial + db key, or guid?

От
Mark Phillips
Дата:
Thank you for the reply.

The article is a good one. I posed the question on the chance things had evolved since 2012, specifically as it relates
topostgres. 

> On Aug 10, 2020, at 3:21 PM, Christophe Pettus <xof@thebuild.com> wrote:
>
>
>
>> On Aug 10, 2020, at 15:19, Mark Phillips <mark.phillips@mophilly.com> wrote:
>> Advice, cautionary tales, suggestions and such will be warmly received.
>
> Here's one solution a company found for this; it seems to work very well:
>
>     https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c
>
> --
> -- Christophe Pettus
>   xof@thebuild.com
>




Re: serial + db key, or guid?

От
Christophe Pettus
Дата:

> On Aug 11, 2020, at 09:37, Mark Phillips <mark.phillips@mophilly.com> wrote:
>
> I posed the question on the chance things had evolved since 2012, specifically as it relates to postgres.

The essentials haven't changed.  Keys (such as UUIDs, especially UUID v4) that have most of their randomness in the
mostsignificant bits can cause significant cache hit problems on large indexes.  128 bit keys are usually overkill for
mostapplications, unless you need actual *global* uniqueness across more than a single database or installation; 64 bit
keysare usually sufficient. 

UUIDs (and similar very large random keys) do have the advantage that they are somewhat self-secure: You can expose
themto outsiders without having to worry about other keys being guessable. 
--
-- Christophe Pettus
   xof@thebuild.com




Re: serial + db key, or guid?

От
Christopher Browne
Дата:
On Tue, 11 Aug 2020 at 12:40, Christophe Pettus <xof@thebuild.com> wrote:
> On Aug 11, 2020, at 09:37, Mark Phillips <mark.phillips@mophilly.com> wrote:
>
> I posed the question on the chance things had evolved since 2012, specifically as it relates to postgres.

The essentials haven't changed.  Keys (such as UUIDs, especially UUID v4) that have most of their randomness in the most significant bits can cause significant cache hit problems on large indexes.  128 bit keys are usually overkill for most applications, unless you need actual *global* uniqueness across more than a single database or installation; 64 bit keys are usually sufficient.

Thus, if performance is highly significant (e.g. - there's lots of data in the table, and it is heavily read/written) then it may be tempting to use a sequential value instead because that can be smaller, faster to compare, and won't trash caches (e.g. - with UUIDs, seemingly adjacent data will be spread wider across indexes and will need more cache accesses to get to the data).

If the table is small, or data is infrequently queried/updated, these costs may be irrelevant.
 
UUIDs (and similar very large random keys) do have the advantage that they are somewhat self-secure: You can expose them to outsiders without having to worry about other keys being guessable.

Not overly obvious from this; the "not guessable" part comes in that chronologically adjacent records won't have any apparent similarity.

With serially assigned transaction IDs, if you, as a user, buy something, and discover that your transaction ID was 1460795, you might well guess that other recent purchases were on transactions 1460794, 1460793, and such, and maybe get at someone else's data by messing with a web URL or such.   Whereas, here's 5 uuids I just generated (dbus-uuidgen isn't generating RFC 4122 compliant values, but in context of a little illustration, who cares?)
$ for i in 1 2 3 4 5; do
\ dbus-uuidgen
\ done
0ff745301515c646498cd1165f32cc6e
a9ca459ab6330f24d24af5095f32cc6e
b1cff235d77b1f4d8504920a5f32cc6e
58773af20b34b3c550f4eebf5f32cc6e
f9a13ce961b28751b102c5545f32cc6e

There are some identical low-order bits, but they are pretty well hidden by the high-order stuff.

It's somewhat cache-destroying, but not especially well guessable.

There is something which has evolved since 2012; see <https://www.postgresql.org/docs/9.5/uuid-ossp.html>, notably
F.43.2. Building uuid-ossp

It used to be that adding UUID generator support required the OSSP library, which was sufficiently inconvenient that this would often not be built-in.  Since 9.4 (released in 2014), uuid-ossp can use common built-in libraries on Linux, OSX, BSD systems, so it's more likely that it will be included "out of the box" on package-managed deployments of PostgreSQL.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Re: serial + db key, or guid?

От
Adam Brusselback
Дата:
I mentioned this in another email thread yesterday about a similar topic, but I'd highly suggest if you do go the UUID route, do not use the standard UUID generation functions, they all suck for database use (v1 also sucks).


I don't mind having a time component correlated with my UUID's because it's simply not a threat model that matters for my use case, so I use the time based variant. It helped me immensely with FPW and 
write amplification when I switched from V4 UUIDs. It is still not as fast as an int, but it is much much better than random UUIDs.

 

Re: serial + db key, or guid?

От
Mark Phillips
Дата:
Thanks to everyone who replied. All helpful. I learned and have new ideas to work with.

On Aug 11, 2020, at 10:42 AM, Adam Brusselback <adambrusselback@gmail.com> wrote:

I mentioned this in another email thread yesterday about a similar topic, but I'd highly suggest if you do go the UUID route, do not use the standard UUID generation functions, they all suck for database use (v1 also sucks).


I don't mind having a time component correlated with my UUID's because it's simply not a threat model that matters for my use case, so I use the time based variant. It helped me immensely with FPW and 
write amplification when I switched from V4 UUIDs. It is still not as fast as an int, but it is much much better than random UUIDs.