Re: No easy way to join discussion in existing thread when not subscribed

Поиск
Список
Период
Сортировка
От Stefan Kaltenbrunner
Тема Re: No easy way to join discussion in existing thread when not subscribed
Дата
Msg-id 560AEBBA.7070802@kaltenbrunner.cc
обсуждение исходный текст
Ответ на Re: No easy way to join discussion in existing thread when not subscribed  ("Amir Rohan" <amir.rohan@mail.com>)
Ответы Re: No easy way to join discussion in existing thread when not subscribed  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-www
On 09/29/2015 09:34 PM, Amir Rohan wrote:
> On 09/29/2015 08:44 PM, Stefan Kaltenbrunner wrote:
> 
>> Hi Amir!
>>
> 
> Hi Stefan, thanks for reviewing.

np - you are very welcome ;)

> 
>>>
>>> Please, see attached patch adding a "whole thread[as mbox] link"
>>> to pgarchives, by popular request upthread.
>>
>> Thanks a lot for the patch - I took a quick look at the patch and have a
>> few comments to make..
>>
>>>
> 
>>>
>>> If you'd like changes (hard limits, strip attachments, etc'),
>>> do let me know.
>>
>> some other things:
>>
>> * while this is a preexisting issue in the code (most of the http auth
>> requests are handled directly in lighttpd so nobody noticed so far i
>> guess) please use "Please authenticate with user 'archives' and
>> 'password' antispam"
> 
> I don't follow. The Auth handling is duplicated from the "raw" message
> view, and I verified basic auth is required, prompt and all.

for most accesses to the archives the string for the basic auth reply
quotes the "archives" and "password" strings with ' - see
http://www.postgresql.org/message-id/20150827215248.GC14857@momjian.us
That one is "mostly" enforced by lighttph and not the django code - my
"fix" back then missed the django part (hence my the comment on it being
a preexisting issue) but we can fix it now while we are at it.


> 
>> * have you verified that the resulting mbox actually contains the
>> newline seperator after each message(I have not checked whether the
>> source data has it)?
> 
> Yes.
> 
> I also checked that both Mutt and Thunderbird can import
> the generated files.
> AFAICT, Adding an extra newline actualy ends up being one too
> many. That may be a quirk from the test data I used from
> the loader scripts and mboxes provided, but I can't be sure,
> so I kept it on, guessing that email clients will tolerate
> this sort of benign spec violation. TB and Mutt do.

hmm ok - will see if I can do some test on the real data in the next days...

> 
> btw, there's something off with the mbox processing chain you use.
> I think it is non-compliant with the spec (as per qmail manpage of
> yore), which requires so called ">From quoting":
> 
> http://qmail.org/man/man5/mbox.html
> 
> See for example <20150802150506.GH11473@alap3.anarazel.de> in
> pgsql-hackers.201508, which includes email messages as mime
> attachments and triggers (I believe) "missing Message-Ids"
> warnings from your tool, and is perhaps mangled in the archives,
> I've seen a few dozen of those while testing.

we have a number of current issues where data in the archives gets
mangled/corrupted we are looking into. We are currently working on some
infrastructure to "test" parsing fixes across all the messages in the
archives to get a better understanding of what kind effect a change has.
For this specific message I'm curious of how you found it though?

> 
>> * are you sure that using unicode() for building the output is going to
>> work on all input? - I dont think you can assume that the source data is
>> ASCII clean and/or has only valid unicode code points for mapping
>>
> 
> Good catch, but I caught earlier and send v2, which keeps everything in
> bytes() and avoids encoding issues altogether.

yeah missed v2 because it was shown in a seperate thread in my mua -
will look into it soon if I can find time...

> 
>>> It checks hiddenstatus, but does materialize the entire raw thread
>>> in memory (including attachments) to form the response, which
>>> is unbounded in principle and can be sizable in practice.
>>>
>>> Perhaps django can do streaming requests, so we can bound
>>> the memory usage + timeout. It's been a while.
>>
>> this is dangerous - the box we are running on has limited <RAM>
>> <...>
>> Have you done any (approximate) measurements on what the additional
>> in-memory overhead in both pg (to build the response) and in django is
>> compared to the resulting  mbox?
>>
> 
> That's more complicated to answer. First, the "whole thread" view
> already materializes all the messages in a threads, excepting
> attachments, so whether we include those is the main (only?) issue.
> We can drop them (stripping them may cost some cpu) if all else fails.
> 
> The average email length in -hackers was about 10k in august.
> The largest thread contained 91 messages, the median was 3.
> So, say it takes 1M to store an mbox file for a large thread,
> assuming august is a representative sample.
> 
> Although python (py2, especially) is quite wasteful about strings,  it
> has only constant overhead per object for bytes() objects, and psycopg2
> returns buffer objects which have the same property. Both these
> assertions check out with `sys.getsizeof`.
> So we're left with pretty much 1:1 memory to final mbox size on the
> python side of things, which seems tolerable even with a safety factor
> of 5-10 (i don't know if python does zero-copy).
> 
> If that's a problem, we can slap a size limit and still keep the 98%
> percentile or so of threads accessible.
> We can also add a rate limit to mitigate this as a DOS vector, if you're
> really concerned about that. Or a captcha, if we must.

good data - will look into the entire archives to see what the "largest
thread every" (in terms of octet bytes) was and whether the current
system would be able to cope. My concern mostly stems from operational
experience(on the sysadmin team) that some operations on the archives
currently are fairly computational and memory intensive causing issues
with availability and we would want to not add more vectors that can
cause that.

> 
> On the postgres side (and psycopg2 internals), I'm less knowledgable.
> Maybe someone else can jump in? Does postgres (or psycopg2) use
> inordinately large amounts of memory when fetching a meg or two of text?
> What's the simplest way to profile postgres's memory usage during
> a query?
> 
> The large data lives in a field which is tellingly schema'd as "bytea".
> I bet that's the good stuff, considering who wrote this.

well bytea is the only really useful datatype we can store strings of
bytes in (other than blobs) ;)


Stefan



В списке pgsql-www по дате отправления:

Предыдущее
От: "Amir Rohan"
Дата:
Сообщение: Re: No easy way to join discussion in existing thread when not subscribed
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: No easy way to join discussion in existing thread when not subscribed