Обсуждение: valgrind issues on Fedora 28
Hi, I've recently updated to Fedora 28, and in that environment I get quite a few new valgrind issues (see the attached log). Essentially, all the reports start with either ==5971== Invalid read of size 32 ==5971== at 0x5957EB1: __wcsnlen_avx2 (in /usr/lib64/libc-2.27.so) ==5971== by 0x589E871: wcsrtombs (in /usr/lib64/libc-2.27.so) ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) or ==5971== Conditional jump or move depends on uninitialised value(s) ==5971== at 0x5822123: __gconv_transform_internal_utf8 (in /usr/lib64/libc-2.27.so) ==5971== by 0x589E8A4: wcsrtombs (in /usr/lib64/libc-2.27.so) ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) or some other combination of that. In all cases the call stack is wchar2char > wcstombs > wcsrtombs > something I don't see these issues on the old (Fedora 26) environment, so it seems to be caused by updating either gcc or glibc: old system: * glibc-2.25-13.fc26.x86_64 * gcc version 7.3.1 20180130 (Red Hat 7.3.1-2) (GCC) new system: * glibc-2.27-32.fc28.x86_64 * gcc version 8.2.1 20181011 (Red Hat 8.2.1-4) (GCC) Valgrind is the same (valgrind-3.13.0) in both cases. I'm not sure if it's due to some glibc changes, gcc generating different code, or something else. I've tried to look at wcsrtombs in glibc, but it's not particularly readable piece of code. Also, my eyes bleed ... Thank god for pgindent! I suspect it's either due to some glibc bug and/or gcc optimization that ends up touching memory that was not actually initialized (judging by __wcsnlen_avx2), or something like that. Not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 2018-11-06 18:24:55 +0100, Tomas Vondra wrote: > I've recently updated to Fedora 28, and in that environment I get quite a > few new valgrind issues (see the attached log). > > Essentially, all the reports start with either > > ==5971== Invalid read of size 32 > ==5971== at 0x5957EB1: __wcsnlen_avx2 (in /usr/lib64/libc-2.27.so) > ==5971== by 0x589E871: wcsrtombs (in /usr/lib64/libc-2.27.so) > ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) > ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) I think this isn't actually a bug, just a missing suppression. The avx2 code uses instructions to scan for 0 bytes in multiple bytes at the same time. Therefore it can encounter a byte marked as undefined, even if it never actually uses that value. > or > > ==5971== Conditional jump or move depends on uninitialised value(s) > ==5971== at 0x5822123: __gconv_transform_internal_utf8 (in > /usr/lib64/libc-2.27.so) > ==5971== by 0x589E8A4: wcsrtombs (in /usr/lib64/libc-2.27.so) > ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) > ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) > > or some other combination of that. In all cases the call stack is > > wchar2char > wcstombs > wcsrtombs > something I think I came to the same conclusion here, but I'm not quite sure. FWIW, I've supressed these on my valgrind animal a while ago. Greetings, Andres Freund
On 2018-Nov-06, Tomas Vondra wrote: > Hi, > > I've recently updated to Fedora 28, and in that environment I get quite a > few new valgrind issues (see the attached log). Hmm, related to https://postgr.es/m/20180220150838.GD18315@e733.localdomain ? Apparently the right answer is to add local suppressions ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/6/18 6:35 PM, Andres Freund wrote: > On 2018-11-06 18:24:55 +0100, Tomas Vondra wrote: >> I've recently updated to Fedora 28, and in that environment I get quite a >> few new valgrind issues (see the attached log). >> >> Essentially, all the reports start with either >> >> ==5971== Invalid read of size 32 >> ==5971== at 0x5957EB1: __wcsnlen_avx2 (in /usr/lib64/libc-2.27.so) >> ==5971== by 0x589E871: wcsrtombs (in /usr/lib64/libc-2.27.so) >> ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) >> ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) > > > I think this isn't actually a bug, just a missing suppression. The avx2 > code uses instructions to scan for 0 bytes in multiple bytes at the same > time. Therefore it can encounter a byte marked as undefined, even if it > never actually uses that value. > OK, my thoughts exactly. >> or >> >> ==5971== Conditional jump or move depends on uninitialised value(s) >> ==5971== at 0x5822123: __gconv_transform_internal_utf8 (in >> /usr/lib64/libc-2.27.so) >> ==5971== by 0x589E8A4: wcsrtombs (in /usr/lib64/libc-2.27.so) >> ==5971== by 0x5834000: wcstombs (in /usr/lib64/libc-2.27.so) >> ==5971== by 0x97DD82: wchar2char (pg_locale.c:1641) >> >> or some other combination of that. In all cases the call stack is >> >> wchar2char > wcstombs > wcsrtombs > something > > I think I came to the same conclusion here, but I'm not quite sure. > Looking at gconv code at [1], it seems it's reading the data as int32 values and using shifts to extract individual bytes. I'm pretty sure this confuses valgrind so it thinks it's accessing all the bytes. [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=iconv/gconv_simple.c;h=506c92caf228d61f92986c39a2ddf9c0a134b4c0;hb=HEAD > FWIW, I've supressed these on my valgrind animal a while ago. > OK, I propose to add these suppressions into the current list. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/6/18 6:35 PM, Alvaro Herrera wrote: > On 2018-Nov-06, Tomas Vondra wrote: > >> Hi, >> >> I've recently updated to Fedora 28, and in that environment I get quite a >> few new valgrind issues (see the attached log). > > Hmm, related to https://postgr.es/m/20180220150838.GD18315@e733.localdomain ? > Apparently the right answer is to add local suppressions ... > Yep, seems like that. I'll check if the patch proposed by Aleksander resolves all the issues reported by valgrind. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/6/18 6:51 PM, Tomas Vondra wrote: > On 11/6/18 6:35 PM, Alvaro Herrera wrote: >> On 2018-Nov-06, Tomas Vondra wrote: >> >>> Hi, >>> >>> I've recently updated to Fedora 28, and in that environment I get >>> quite a >>> few new valgrind issues (see the attached log). >> >> Hmm, related to >> https://postgr.es/m/20180220150838.GD18315@e733.localdomain ? >> Apparently the right answer is to add local suppressions ... >> > > Yep, seems like that. I'll check if the patch proposed by Aleksander > resolves all the issues reported by valgrind. > OK, I propose we add suppressions per the attached patch. The rules are clearly incomplete, due to dependence on optimization used. For example now the stack includes __wcsnlen_avx2, but I see it was __wcsnlen_sse4_1 in Aleksander's report in February. Not sure what to do about this - maybe we should wildcard this first frame, to accept all wcsnlen variants. Opinions? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On 11/8/18 1:07 PM, Tomas Vondra wrote: > On 11/6/18 6:51 PM, Tomas Vondra wrote: >> On 11/6/18 6:35 PM, Alvaro Herrera wrote: >>> On 2018-Nov-06, Tomas Vondra wrote: >>> >>>> Hi, >>>> >>>> I've recently updated to Fedora 28, and in that environment I get >>>> quite a >>>> few new valgrind issues (see the attached log). >>> >>> Hmm, related to >>> https://postgr.es/m/20180220150838.GD18315@e733.localdomain ? >>> Apparently the right answer is to add local suppressions ... >>> >> >> Yep, seems like that. I'll check if the patch proposed by Aleksander >> resolves all the issues reported by valgrind. >> > > OK, I propose we add suppressions per the attached patch. > > The rules are clearly incomplete, due to dependence on optimization > used. For example now the stack includes __wcsnlen_avx2, but I see it > was __wcsnlen_sse4_1 in Aleksander's report in February. > > Not sure what to do about this - maybe we should wildcard this first > frame, to accept all wcsnlen variants. Opinions? > I've committed this with the first frame wirldcarded, and backpatched it all the way back to 9.4. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 11/8/18 1:07 PM, Tomas Vondra wrote: >> Not sure what to do about this - maybe we should wildcard this first >> frame, to accept all wcsnlen variants. Opinions? > I've committed this with the first frame wirldcarded, and backpatched it > all the way back to 9.4. So I just got around to trying to do some valgrind testing for the first time in a couple of months, and what I find is that this patch completely breaks valgrind for me: it exits immediately with complaints like ==00:00:00:00.222 14821== FATAL: in suppressions file "/home/postgres/pgsql/src/tools/valgrind.supp" near line 227: ==00:00:00:00.222 14821== unknown tool suppression type ==00:00:00:00.222 14821== exiting now. After a bit of hair-pulling I found that the line number report is a lie, and what it's really unhappy about is the "Memcheck:Addr32" specification on line 236. This is not terribly surprising perhaps considering that "Addr32" isn't even documented on what should be the authoritative page: http://valgrind.org/docs/manual/mc-manual.html#mc-manual.suppfiles This is valgrind 3.8.1 (RHEL6's version), so not bleeding edge exactly, but we have very little evidence about how far back the committed patch works. I'm inclined to think we can't use this solution. regards, tom lane
On 12/14/18 5:10 AM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 11/8/18 1:07 PM, Tomas Vondra wrote: >>> Not sure what to do about this - maybe we should wildcard this first >>> frame, to accept all wcsnlen variants. Opinions? > >> I've committed this with the first frame wirldcarded, and backpatched it >> all the way back to 9.4. > > So I just got around to trying to do some valgrind testing for the first > time in a couple of months, and what I find is that this patch completely > breaks valgrind for me: it exits immediately with complaints like > > ==00:00:00:00.222 14821== FATAL: in suppressions file "/home/postgres/pgsql/src/tools/valgrind.supp" near line 227: > ==00:00:00:00.222 14821== unknown tool suppression type > ==00:00:00:00.222 14821== exiting now. > > After a bit of hair-pulling I found that the line number report is > a lie, and what it's really unhappy about is the "Memcheck:Addr32" > specification on line 236. This is not terribly surprising perhaps > considering that "Addr32" isn't even documented on what should be > the authoritative page: > http://valgrind.org/docs/manual/mc-manual.html#mc-manual.suppfiles > Hmm :-( FWIW the Addr32 comes directly from valgrind itself, which explicitly suggested to add suppression like this: { <insert_a_suppression_name_here> Memcheck:Addr32 fun:__wcsnlen_avx2 fun:wcsrtombs fun:wcstombs fun:wchar2char fun:str_tolower fun:lower fun:DirectFunctionCall1Coll fun:Generic_Text_IC_like ... } My guess is that documentation is either somewhat obsolete, or is meant more like a general description than an exhaustive and authoritative source of truth regarding suppressions. > This is valgrind 3.8.1 (RHEL6's version), so not bleeding edge exactly, > but we have very little evidence about how far back the committed patch > works. I'm inclined to think we can't use this solution. > Fedora 28 has 3.14.0, so this seems to be due to some improvements since 3.8.1. I'm not sure how to deal with this - surely we don't want to just revert the change because that would start generating noise again. For a moment I thought we might wildcard the error type somehow, so that it would accept Memcheck:* but AFAICs that is not supported. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 12/14/18 5:10 AM, Tom Lane wrote: >> So I just got around to trying to do some valgrind testing for the first >> time in a couple of months, and what I find is that this patch completely >> breaks valgrind for me ... >> This is valgrind 3.8.1 (RHEL6's version), so not bleeding edge exactly, >> but we have very little evidence about how far back the committed patch >> works. I'm inclined to think we can't use this solution. > Fedora 28 has 3.14.0, so this seems to be due to some improvements since > 3.8.1. I'm not sure how to deal with this - surely we don't want to just > revert the change because that would start generating noise again. For a > moment I thought we might wildcard the error type somehow, so that it > would accept Memcheck:* but AFAICs that is not supported. Meh. There are lots of situations where it's necessary to carry some platform-specific valgrind suppressions; I have half a dozen on my RHEL6 box because of various weirdness in its old version of perl, for example. I think this is one where people running code new enough to have this problem need to carry private suppressions of their own. In general, I'm not particularly on board with our valgrind.supp carrying suppressions for code outside our own code base: I think that's assuming WAY too much about which version of what is installed on a particular box. Maybe we could do something to make it simpler to have custom suppressions? Not sure what, though. regards, tom lane
On 12/14/18 4:58 PM, Tom Lane wrote: > ... > > In general, I'm not particularly on board with our valgrind.supp > carrying suppressions for code outside our own code base: I think > that's assuming WAY too much about which version of what is installed > on a particular box. > Fair point. > Maybe we could do something to make it simpler to have custom > suppressions? Not sure what, though. > I was thinking that perhaps we could allows specifying path to extra suppressions and pass that to valgrind. But we don't actually invoke valgrind, that's something people do on their own anyway - so we don't have anywhere to pass the path to. And whoever invokes valgrind can simply stick it directly into the command they're using (as it allows specifying multiple --suppressions=<file> options). Or perhaps just put it into ~/.valgrindrc. So perhaps we should simply revert that commit and be done with it. One place that will need to solve it is buildfarm client, but it could pick either of the options I mentioned. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/14/18 4:35 PM, Tomas Vondra wrote: > On 12/14/18 4:58 PM, Tom Lane wrote: >> ... >> >> In general, I'm not particularly on board with our valgrind.supp >> carrying suppressions for code outside our own code base: I think >> that's assuming WAY too much about which version of what is installed >> on a particular box. >> > Fair point. > >> Maybe we could do something to make it simpler to have custom >> suppressions? Not sure what, though. >> > I was thinking that perhaps we could allows specifying path to extra > suppressions and pass that to valgrind. > > But we don't actually invoke valgrind, that's something people do on > their own anyway - so we don't have anywhere to pass the path to. And > whoever invokes valgrind can simply stick it directly into the command > they're using (as it allows specifying multiple --suppressions=<file> > options). Or perhaps just put it into ~/.valgrindrc. > > So perhaps we should simply revert that commit and be done with it. > > One place that will need to solve it is buildfarm client, but it could > pick either of the options I mentioned. > > The buildfarm client has a parameter in the config file for valgrind options. All you would have to do is an an extra --suppressions setting in there. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/15/18 12:32 AM, Andrew Dunstan wrote: > > On 12/14/18 4:35 PM, Tomas Vondra wrote: >> On 12/14/18 4:58 PM, Tom Lane wrote: >>> ... >>> >>> In general, I'm not particularly on board with our valgrind.supp >>> carrying suppressions for code outside our own code base: I think >>> that's assuming WAY too much about which version of what is installed >>> on a particular box. >>> >> Fair point. >> >>> Maybe we could do something to make it simpler to have custom >>> suppressions? Not sure what, though. >>> >> I was thinking that perhaps we could allows specifying path to extra >> suppressions and pass that to valgrind. >> >> But we don't actually invoke valgrind, that's something people do on >> their own anyway - so we don't have anywhere to pass the path to. And >> whoever invokes valgrind can simply stick it directly into the command >> they're using (as it allows specifying multiple --suppressions=<file> >> options). Or perhaps just put it into ~/.valgrindrc. >> >> So perhaps we should simply revert that commit and be done with it. >> >> One place that will need to solve it is buildfarm client, but it could >> pick either of the options I mentioned. >> >> > > The buildfarm client has a parameter in the config file for valgrind > options. All you would have to do is an an extra --suppressions setting > in there. > OK, makes sense. I'll revert the commit tomorrow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services