Google Summer of Code 2008

Поиск
Список
Период
Сортировка
От Jan Urbański
Тема Google Summer of Code 2008
Дата
Msg-id 47CC53C1.5000609@students.mimuw.edu.pl
обсуждение исходный текст
Ответы Re: Google Summer of Code 2008  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Hi PostgreSQL!

Although this year's GSoC is just starting, I thought getting in touch a bit
earlier would only be of benefit.

I study Computer Science in Faculty of Mathematics, Informatics
and Mechanics of Warsaw University. I'm currently in my fourth year of
studies. Having chosen Databases for my degree course I plan to write my thesis
concentrating at least partially on PostgreSQL. This will (hopefully) be my
first GSoC.

For the past one and a half years I've alse been working in a privately held
company Fiok LLP. The company deals, among others, in developing custom
Web applications, which all use PostgreSQL as their database solution. During my
time in Fiok I have taken part in creating an accounting system for a large
Polish university, capable of generating financial reports required by the
European Union, a publishing platform for editors working in the Polish
Catholic Press Agency and a custom tailored CRM application.

All of these projects use unique PostgreSQL features, like PITR and full-text
search to name a few. You can glimpse the implemented FTS functionality by
looking here:

http://system.ekai.pl/kair/?_tw_DepeszeKlientaTable_0__search_plainfulltext=kalendarz&_tw_DepeszeKlientaTable_0__search_rank_orderby=on&screen=depesze
It's the public part of the publishing platform, which allows subscribed
readers to view published messages. The link takes you to search results for
the word 'kalendarz' (which is Polish for calendar), ordered by rank() and
highlighted by headline() (our client uses 8.2, hence the old function names).

I do my work in Fiok almost exclusively from home, showing up at the office
once every two or three weeks, so working in a distributed environment using
SCM tools is natural to me.

I'm also engaged in an open source project called Kato, being one of the key
developers. It's a small project that started as my company's requirement for a
new Web application framework and ended up being released under the New BSD
License. Of course it's native database engine is PostgreSQL. You can take a
look at the source here:
http://kato.googlecode.com/
or play around with a simple demo here:
http://sahara.fiok.pl/~jurbanski/kato-demo/kato-demo.en.php

Speaking of open source contributions, I also wrote a FTS-related patch for Postgres, that
made it's way into 8.3:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php

I try to follow -patches, occasinally read -hackers and sometimes make
excursions around the pgsql source, trying to learn more and more of it.

About my programming skills, particulary in C - one piece of code I'd like to show
you was written for an Operating Systems course. It's a kernel patch
implementing I/O operations throttling on a per-process basis through a /proc
based interface. The code lacks comments, as they were in Polish, but it's just
to assure you I'm able to write some good C:
http://students.mimuw.edu.pl/~wulczer/linux-2.6.17.13-iolimits-ju219721.patch

And now for the SoC. As this year's PostgreSQL Ideas are not set up yet, I
thought I'd give you the two projects floating through my mind

1. WAL segment files explorer / mangler

While preparing a presentation about PITR and warm stanby in PostgreSQL for my
degree course, I thought it would be nice if one had a command-line tool to
examine the contents of a WAL segment file and determine for example what
commands were recorded in it, what are the transaction IDs they were in,
etc. This could allow for instance to replay the WAL sequence up until a
function went haywire and wrecked one's data - without the need to know *when*
the accident happened. It could be useful as an alternative method of logging
operations - since WAL files are written anyway, one could imagine a process
periodically looking through them and reporting (perheaps not all) operations
to some external listener. If for instance you were curious which column in a
table is updated most, instead of writing a trigger to log updates to it, you
could use the WAL explorer to find updates to that column and log them over the
network, thus reducing disk I/O.
Being even bolder, I thought about allowing to edit the contents of a WAL file,
so if the proverbial junior DBA drops a crucial table and gets caught the next
morning, you don't have to throw away all transactions that got commited over
the night. Maybe you could *overwrite* his DROP TABLE with something neutral
and replay the WAL up to it's end.

2. Implement better selectivity estimates for FTS.

If I'm not mistaken, the @@ operator still uses the contsel selectivity
function, returning 0.001 * <total_row_count> as the expected number of rows
left after applying the @@ operator. I have in the past been bitten by
performance problems that I think could be traced back to row count estimates
being horribly wrong (i.e. much too low) for FTS queries asking for a very
popular word. Maybe we could do better that just return one-thousandth?

I myself am more for the first idea, but both seem good concepts to me. Also,
both are implementable as contrib modules, with the WAL explorer possibly
requiring modification to the WAL structure, and thus having to wait
for 8.4 to get into core.

As of now, these are just loose ideas, but ones I believe are possible to
implement in the time boundaries of GSoC coding. Before digging deeper into the
source and giving them more thought i wanted to consult some more experienced
PosgtreSQL hackers and get their opinions - after all, that's what the
community is for.

To wrap it up: do you find any of these ideas worthwhile? Could they be good
candidates for a GSoC project? Of course doing some stuff from the TODO list
would still be fun, if you believe they are more promising/needed. Basically,
any kind of involvment in PostgreSQL is something that gets me excited.

Hope to hear from you,
Cheers,
Jan Urbanski 

-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: proposal: plpgsql return execute ...
Следующее
От: "Pavel Stehule"
Дата:
Сообщение: Re: proposal: plpgsql return execute ...