Обсуждение: random() generates collisions too early

Поиск
Список
Период
Сортировка

random() generates collisions too early

От
Honza Horak
Дата:
Hi guys,

after playing a bit with "select random();", it turned out that numbers
get repeated quite early in a sequence. Originally I set lower PID range
(echo "2048" >/proc/sys/kernel/pid_max), but it doesn't seem to affect
the results.

So, what I observed... First, I generated a set including 1000 randomly
generated numbers without setting a seed.

   touch numbers
   for i in {1..1000} ; do
     echo "select random();"|psql|head -n 3|tail -n 1 >>numbers
   done

Then, I continued in generating random numbers and tried to find the new
one in the set:

   for i in {1..10000} ; do
     if grep `echo "select random();"|psql|head -n 3|tail -n 1` numbers
; then
       echo "SUCCESS: $i" ; break
     fi
   done

To my surprise I'm able to find a collision very quickly, in first 1000
numbers usually.

Originally, I used psql calls to get random() value from different
processes on purpose, but it seems Noah got similar results when
random() is called in one process:

On 10/18/2013 02:10 AM, Noah Misch wrote:
 > sudo sysctl -w kernel.pid_max=2048
 > psql -c 'create unlogged table samp(c float8)'
 > for n in `seq 1 200000`; do psql -qc 'insert into samp values
(random())'; done
 >
 > The results covered only 181383 distinct values, and 68 values
repeated four
 > or five times each.  We should at least consider using a
higher-entropy seed.

As I was told this is not taken as a security issue, since random() is
not considered as a CSPRNG in any case, but as Noah said, we should
probably try to make it a bit better.

Also, I'd suggest to state explicitly in the doc, that random()
shouldn't be taken as CSPRNG, since I can imagine people blindly
believing that random() can be good enough for such use cases, just
because they see how many possible values they get from double-precision
type:
http://www.postgresql.org/docs/9.3/static/functions-math.html

Regards,
Honza

Re: random() generates collisions too early

От
Heikki Linnakangas
Дата:
On 18.10.2013 14:55, Honza Horak wrote:
> On 10/18/2013 02:10 AM, Noah Misch wrote:
>  > sudo sysctl -w kernel.pid_max=2048
>  > psql -c 'create unlogged table samp(c float8)'
>  > for n in `seq 1 200000`; do psql -qc 'insert into samp values
> (random())'; done
>  >
>  > The results covered only 181383 distinct values, and 68 values
> repeated four
>  > or five times each. We should at least consider using a
> higher-entropy seed.
>
> As I was told this is not taken as a security issue, since random() is
> not considered as a CSPRNG in any case, but as Noah said, we should
> probably try to make it a bit better.

Interesting. PostgreSQL's random() function just calls the underlying
libc random() function. I assume you tested this on with Linux and glibc.

> Also, I'd suggest to state explicitly in the doc, that random()
> shouldn't be taken as CSPRNG, since I can imagine people blindly
> believing that random() can be good enough for such use cases, just
> because they see how many possible values they get from double-precision
> type:
> http://www.postgresql.org/docs/9.3/static/functions-math.html

Yeah, that seems like a good idea. A patch would be welcome.

- Heikki

Re: random() generates collisions too early

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 18.10.2013 14:55, Honza Horak wrote:
>> The results covered only 181383 distinct values, and 68 values
>> repeated four or five times each. We should at least consider using a
>> higher-entropy seed.

> Interesting. PostgreSQL's random() function just calls the underlying
> libc random() function. I assume you tested this on with Linux and glibc.

I agree with the theory that this probably isn't the fault of the random()
function as such, but with our code to reset the random seed when forking
a postmaster child process.  Note that the test case is only examining the
first random value created in each process.  So basically what this is
measuring is the number of different seed values we use.

            regards, tom lane

Re: random() generates collisions too early

От
Joe Van Dyk
Дата:
Oddly enough, I'm debugging a problem with a function that uses random()
and that is now generating collisions after I upgraded to 9.3.

CREATE OR REPLACE FUNCTION public.random_characters(length integer)
 RETURNS text
 LANGUAGE sql
AS $function$
SELECT array_to_string(array((
      SELECT SUBSTRING('abcdefghjkmnpqrstuvwxyz23456789'
        FROM mod((random()*31)::int, 31)+1 FOR 1)
      FROM generate_series(1, $1))),'');
$function$;

It's being used in a column definition like:
citext unique not null default upper(random_characters(8))

I can't make a self-contained reproducible test case now, I've tried but no
luck yet.

In the actual app, I'm getting collisions after only a couple hundred rows
are inserted. I never had a problem with < 9.3. The rows are being inserted
in a trigger, if that matters.

Joe


On Fri, Oct 18, 2013 at 4:55 AM, Honza Horak <hhorak@redhat.com> wrote:

> Hi guys,
>
> after playing a bit with "select random();", it turned out that numbers
> get repeated quite early in a sequence. Originally I set lower PID range
> (echo "2048" >/proc/sys/kernel/pid_max), but it doesn't seem to affect the
> results.
>
> So, what I observed... First, I generated a set including 1000 randomly
> generated numbers without setting a seed.
>
>   touch numbers
>   for i in {1..1000} ; do
>     echo "select random();"|psql|head -n 3|tail -n 1 >>numbers
>   done
>
> Then, I continued in generating random numbers and tried to find the new
> one in the set:
>
>   for i in {1..10000} ; do
>     if grep `echo "select random();"|psql|head -n 3|tail -n 1` numbers ;
> then
>       echo "SUCCESS: $i" ; break
>     fi
>   done
>
> To my surprise I'm able to find a collision very quickly, in first 1000
> numbers usually.
>
> Originally, I used psql calls to get random() value from different
> processes on purpose, but it seems Noah got similar results when random()
> is called in one process:
>
> On 10/18/2013 02:10 AM, Noah Misch wrote:
> > sudo sysctl -w kernel.pid_max=2048
> > psql -c 'create unlogged table samp(c float8)'
> > for n in `seq 1 200000`; do psql -qc 'insert into samp values
> (random())'; done
> >
> > The results covered only 181383 distinct values, and 68 values repeated
> four
> > or five times each.  We should at least consider using a higher-entropy
> seed.
>
> As I was told this is not taken as a security issue, since random() is not
> considered as a CSPRNG in any case, but as Noah said, we should probably
> try to make it a bit better.
>
> Also, I'd suggest to state explicitly in the doc, that random() shouldn't
> be taken as CSPRNG, since I can imagine people blindly believing that
> random() can be good enough for such use cases, just because they see how
> many possible values they get from double-precision type:
>
http://www.postgresql.org/**docs/9.3/static/functions-**math.html<http://www.postgresql.org/docs/9.3/static/functions-math.html>
>
> Regards,
> Honza
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-bugs<http://www.postgresql.org/mailpref/pgsql-bugs>
>

Re: random() generates collisions too early

От
Honza Horak
Дата:
On 10/21/2013 04:19 PM, Heikki Linnakangas wrote:
> On 18.10.2013 14:55, Honza Horak wrote:
>> On 10/18/2013 02:10 AM, Noah Misch wrote:
>>  > sudo sysctl -w kernel.pid_max=2048
>>  > psql -c 'create unlogged table samp(c float8)'
>>  > for n in `seq 1 200000`; do psql -qc 'insert into samp values
>> (random())'; done
>>  >
>>  > The results covered only 181383 distinct values, and 68 values
>> repeated four
>>  > or five times each. We should at least consider using a
>> higher-entropy seed.
>>
>> As I was told this is not taken as a security issue, since random() is
>> not considered as a CSPRNG in any case, but as Noah said, we should
>> probably try to make it a bit better.
>
> Interesting. PostgreSQL's random() function just calls the underlying
> libc random() function. I assume you tested this on with Linux and glibc.
>
>> Also, I'd suggest to state explicitly in the doc, that random()
>> shouldn't be taken as CSPRNG, since I can imagine people blindly
>> believing that random() can be good enough for such use cases, just
>> because they see how many possible values they get from double-precision
>> type:
>> http://www.postgresql.org/docs/9.3/static/functions-math.html
>
> Yeah, that seems like a good idea. A patch would be welcome.

I don't think we need to tell some long stories here, so what about this
one:
"pseudo-random value in the range 0.0 < x < 1.0 (characteristic of
randomness depends on the system implementation and is usually limited,
thus not considered as a CSPRNG in any case)"

Corresponding patch of doc is attached.

Regards,
Honza

Вложения

Re: random() generates collisions too early

От
Honza Horak
Дата:
On 10/21/2013 05:14 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> On 18.10.2013 14:55, Honza Horak wrote:
>>> The results covered only 181383 distinct values, and 68 values
>>> repeated four or five times each. We should at least consider using a
>>> higher-entropy seed.
>
>> Interesting. PostgreSQL's random() function just calls the underlying
>> libc random() function. I assume you tested this on with Linux and glibc.
>
> I agree with the theory that this probably isn't the fault of the random()
> function as such, but with our code to reset the random seed when forking
> a postmaster child process.  Note that the test case is only examining the
> first random value created in each process.  So basically what this is
> measuring is the number of different seed values we use.
>
>             regards, tom lane
>

After playing a bit more with random() value and the seed defined in
src/backend/postmaster/postmaster.c:4038 it seems really like the seed
doesn't have very good characteristic. The PID number used there ensures
the value to be different in child processes, but when only xor-ed with
usecs, it doesn't enlarge the total data domain of the seed, which stays
1M values. Then having in mind the birthday paradox and probably not
ideal PRNG, it doesn't seem unrealistic to get two same random numbers
in only hundreds/thousands of tries.

In order to enlarge possible data domain of the seed we can include sec
count as well and shift some xor-ed variables to use the whole range or
unsigned int. The attached patch uses a way that gave me much better
results (collision found approx. after a hundred thousands of values):

-       srandom((unsigned int) (MyProcPid ^ usecs));
+       srandom((unsigned int) (MyProcPid ^ (usecs << 12) ^ secs));

I'm also thinking of a reproducer -- does the test-suite allow to
"reconnect" the client during a test?

Regards,
Honza

Вложения

Re: random() generates collisions too early

От
Heikki Linnakangas
Дата:
On 22.10.2013 14:55, Honza Horak wrote:
> On 10/21/2013 04:19 PM, Heikki Linnakangas wrote:
>> On 18.10.2013 14:55, Honza Horak wrote:
>>> Also, I'd suggest to state explicitly in the doc, that random()
>>> shouldn't be taken as CSPRNG, since I can imagine people blindly
>>> believing that random() can be good enough for such use cases, just
>>> because they see how many possible values they get from double-precision
>>> type:
>>> http://www.postgresql.org/docs/9.3/static/functions-math.html
>>
>> Yeah, that seems like a good idea. A patch would be welcome.
>
> I don't think we need to tell some long stories here, so what about this
> one:
> "pseudo-random value in the range 0.0 < x < 1.0 (characteristic of
> randomness depends on the system implementation and is usually limited,
> thus not considered as a CSPRNG in any case)"

I had to look up what CSPRNG stands for, so we probably should spell it
out. Also not sure what it means for the characteristic of the
randomness to be limited. How about something like:

> random value in the range 0.0 <= x < 1.0 (the characteristics of the
> returned values depends on the system implementation. This function
> is not suitable for cryptographic applications; use pgcrypto
> instead.)

Or perhaps it would be even better to move random() and setseed to a
separate table. They are somewhat different from the rest of the
functions listed in the table of Mathematical Functions, and it would be
nice to list them together; currently the round() functions fall between
them in the alphabetically ordered table. What do you think of the attached?

- Heikki

Вложения

Re: random() generates collisions too early

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> Or perhaps it would be even better to move random() and setseed to a
> separate table. They are somewhat different from the rest of the
> functions listed in the table of Mathematical Functions, and it would be
> nice to list them together; currently the round() functions fall between
> them in the alphabetically ordered table. What do you think of the attached?

+1, but (a) I think the tgroup has 3 cols not 2; (b) s/depends/depend/
for grammar reaons.

            regards, tom lane

Re: random() generates collisions too early

От
Heikki Linnakangas
Дата:
On 23.10.2013 16:08, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> Or perhaps it would be even better to move random() and setseed to a
>> separate table. They are somewhat different from the rest of the
>> functions listed in the table of Mathematical Functions, and it would be
>> nice to list them together; currently the round() functions fall between
>> them in the alphabetically ordered table. What do you think of the attached?
>
> +1, but (a) I think the tgroup has 3 cols not 2; (b) s/depends/depend/
> for grammar reaons.

Fixed and committed, thanks.

- Heikki

Re: random() generates collisions too early

От
Heikki Linnakangas
Дата:
On 22.10.2013 15:41, Honza Horak wrote:
> On 10/21/2013 05:14 PM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>>> On 18.10.2013 14:55, Honza Horak wrote:
>>>> The results covered only 181383 distinct values, and 68 values
>>>> repeated four or five times each. We should at least consider using a
>>>> higher-entropy seed.
>>
>>> Interesting. PostgreSQL's random() function just calls the underlying
>>> libc random() function. I assume you tested this on with Linux and
>>> glibc.
>>
>> I agree with the theory that this probably isn't the fault of the
>> random()
>> function as such, but with our code to reset the random seed when forking
>> a postmaster child process. Note that the test case is only examining the
>> first random value created in each process. So basically what this is
>> measuring is the number of different seed values we use.
>>
>> regards, tom lane
>>
>
> After playing a bit more with random() value and the seed defined in
> src/backend/postmaster/postmaster.c:4038 it seems really like the seed
> doesn't have very good characteristic. The PID number used there ensures
> the value to be different in child processes, but when only xor-ed with
> usecs, it doesn't enlarge the total data domain of the seed, which stays
> 1M values. Then having in mind the birthday paradox and probably not
> ideal PRNG, it doesn't seem unrealistic to get two same random numbers
> in only hundreds/thousands of tries.
>
> In order to enlarge possible data domain of the seed we can include sec
> count as well and shift some xor-ed variables to use the whole range or
> unsigned int. The attached patch uses a way that gave me much better
> results (collision found approx. after a hundred thousands of values):
>
> - srandom((unsigned int) (MyProcPid ^ usecs));
> + srandom((unsigned int) (MyProcPid ^ (usecs << 12) ^ secs));

Seems reasonable, committed. Thanks!

- Heikki

Re: random() generates collisions too early

От
Tom Lane
Дата:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 22.10.2013 15:41, Honza Horak wrote:
>> In order to enlarge possible data domain of the seed we can include sec
>> count as well and shift some xor-ed variables to use the whole range or
>> unsigned int. The attached patch uses a way that gave me much better
>> results (collision found approx. after a hundred thousands of values):
>>
>> - srandom((unsigned int) (MyProcPid ^ usecs));
>> + srandom((unsigned int) (MyProcPid ^ (usecs << 12) ^ secs));

> Seems reasonable, committed. Thanks!

While that change seems reasonable as far as it goes, we could do more.
In particular, including the current seconds count only provides a few
extra bits of variability over the short term, so I don't think this has
really moved the ball very far downfield.

I had been wondering about the idea of taking the next random value in the
postmaster's sequence and xor'ing that into the seed, along with the
components suggested above.  This should add close to 32 bits worth of
hard-to-predict randomness into each seed.

There are a couple of arguments to be made against this idea:

1. To work properly, the random() call would have to be made before the
fork(), with the value being passed down to the child process.  This is
no big deal in the fork() case but would add a little bit of complexity
in EXEC_BACKEND mode.  Still, we pass a lot of stuff to the child already,
and one more int isn't very much.

2. In principle, an attacker with free access to the state of a child
process might be able to infer the postmaster's random() value that had
been passed down.  I'm not sure about the potential security consequences
of that, but in a system with a very bad implementation of random() it
seems possible that the attacker might be able to guess nearby values
in the postmaster's random() sequence, which would lead to being able to
guess the salt values used in subsequently-launched children's password
challenges.  OTOH, even if all this was really feasible, what does that
buy for the attacker?

            regards, tom lane