Обсуждение: Re: no mailing list hits in google

Поиск

Список

Период

Сортировка

Re: no mailing list hits in google

От

Tom Lane

Дата:

28 августа 2019 г., 16:51:42

Merlin Moncure <mmoncure@gmail.com> writes:
> [apologies if this is the incorrect list or is already discussed material]

It's not the right list; redirecting to pgsql-www.

> I've noticed that mailing list discussions in -hackers and other
> mailing lists appear to not be indexed by google -- at all.  We are
> also not being tracked by any mailing list aggregators -- in contrast
> to a decade ago where we had nabble and other systems to collect and
> organize results (tbh, often better than we do) we are now at an
> extreme disadvantage; mailing list activity was formerly and
> absolutely fantastic research via google to find solutions to obscure
> technical problems in the database.  Limited access to this
> information will directly lead to increased bug reports, lack of
> solution confidence, etc.

> My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
> I was searching out a link to recent bug report for copy/paste into
> corporate email. In the old days this would fire right up but now
> returns no hits even though the discussion is available in the
> archives (which I had to find by looking up the specific day the
> thread was active).  Just a heads up.

Hm.  When I try googling that, the first thing I get is

    pgsql-hackers - PostgreSQL

    https://www.postgresql.org › list › pgsql-hackers
    No information is available for this page.
    Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

            regards, tom lane

Re: no mailing list hits in google

От

Magnus Hagander

Дата:

28 августа 2019 г., 17:09:40

On Wed, Aug 28, 2019 at 6:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:
> [apologies if this is the incorrect list or is already discussed material]

It's not the right list; redirecting to pgsql-www.

> I've noticed that mailing list discussions in -hackers and other
> mailing lists appear to not be indexed by google -- at all. We are
> also not being tracked by any mailing list aggregators -- in contrast
> to a decade ago where we had nabble and other systems to collect and
> organize results (tbh, often better than we do) we are now at an
> extreme disadvantage; mailing list activity was formerly and
> absolutely fantastic research via google to find solutions to obscure
> technical problems in the database. Limited access to this
> information will directly lead to increased bug reports, lack of
> solution confidence, etc.

> My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
> I was searching out a link to recent bug report for copy/paste into
> corporate email. In the old days this would fire right up but now
> returns no hits even though the discussion is available in the
> archives (which I had to find by looking up the specific day the
> thread was active). Just a heads up.

Hm. When I try googling that, the first thing I get is

pgsql-hackers - PostgreSQL

https://www.postgresql.org › list › pgsql-hackers
No information is available for this page.
Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

It blocks /list/ which has the subjects only. The actual emails in /message-id/ are not blocked by robots.txt. I don't know why they stopped appearing in the searches... Nothing has been changed around that for many years from *our* side.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: no mailing list hits in google

От

Andres Freund

Дата:

28 августа 2019 г., 17:45:53

Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

Why is that /list/ exclusion there in the first place?

> Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

Greetings,

Andres Freund

Re: no mailing list hits in google

От

Tom Lane

Дата:

28 августа 2019 г., 17:59:40

Magnus Hagander <magnus@hagander.net> writes:
> It blocks /list/ which has the subjects only. The actual emails in
> /message-id/ are not blocked by robots.txt.  I don't know why they stopped
> appearing in the searches... Nothing has been changed around that for many
> years from *our* side.

If I go to

https://www.postgresql.org/message-id/

I get a page saying "Not Found".  So I'm not clear on how a web crawler
would descend through that to individual messages.

Even if it looks different to a robot, what would it look like exactly?
A flat space of umpteen zillion immediate-child pages?  It seems not
improbable that Google's search engine would intentionally decide not to
index that, or unintentionally just fail due to some internal resource
limit.  (This theory can explain why it used to work and no longer does:
we got past whatever the limit is.)

Andres' idea of allowing access to /list/ would allow the archives to be
traversed in more bite-size pieces, which might fix the issue.

            regards, tom lane

Re: no mailing list hits in google

От

Magnus Hagander

Дата:

29 августа 2019 г., 11:12:00

On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

That is indeed a good point. But it has been that way for many years, so something must've changed. We last modified this in 2013....

Maybe Google used to load the pages under /list/ and crawl them for links but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that should work -- will need some investigation on exactly how to do it, as sitemaps also have individual restrictions on the number of urls per file, and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to the fact that you can pick an arbitrary point in time to view from.

> Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

That one had a robots.txt blocking this going back even further in time.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: no mailing list hits in google

От

Alvaro Herrera

Дата:

29 августа 2019 г., 13:32:35

On 2019-Aug-29, Magnus Hagander wrote:

> Maybe Google used to load the pages under /list/ and crawl them for links
> but just not include the actual pages in the index or something
> 
> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.
> 
> > Why is that /list/ exclusion there in the first place?
> 
> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once.  Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no mailing list hits in google

От

Magnus Hagander

Дата:

29 августа 2019 г., 13:39:39

On Thu, Aug 29, 2019 at 3:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Aug-29, Magnus Hagander wrote:

> Maybe Google used to load the pages under /list/ and crawl them for links
> but just not include the actual pages in the index or something
>
> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.
>
> > Why is that /list/ exclusion there in the first place?
>
> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

That's pretty much what I'm suggesting but using a sitemap so it's directly injected.

//Magnus

Re: no mailing list hits in google

От

Andres Freund

Дата:

29 августа 2019 г., 14:50:13

Hi,

On 2019-08-29 13:12:00 +0200, Magnus Hagander wrote:
> On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> > > It blocks /list/ which has the subjects only.
> >
> > Yea. But there's no way to actually get to all the individual messages
> > without /list/? Sure, some will be linked to from somewhere else, but
> > without the content below /list/, most won't be reached?
> >
> 
> That is indeed a good point. But it has been that way for many years, so
> something must've changed.   We last modified this in 2013....

Hm. I guess it's possible that most pages were found due to the
next/prev links in individual messages, once one of them is linked from
somewhere externally.   Any chance there's enough logs around to see
from where to where the indexers currently move?

> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.

Hm. You mean in addition to allowing /list/ or solely?

> > Why is that /list/ exclusion there in the first place?

> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

You mean because of the per-day links, that aren't really per-day? I
think the number of links due to that would still be manageable traffic
wise? Or are they that expensive to compute?  Perhaps we could make the
"jump to day" links smarter in some way? Perhaps by not including
content for the following days in the per-day pages?

Greetings,

Andres Freund

Re: no mailing list hits in google

От

Andres Freund

Дата:

29 августа 2019 г., 14:55:35

Hi,

On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
> On 2019-Aug-29, Magnus Hagander wrote:
> 
> > Maybe Google used to load the pages under /list/ and crawl them for links
> > but just not include the actual pages in the index or something
> > 
> > I wonder if we can inject these into Google using a sitemap. I think that
> > should work -- will need some investigation on exactly how to do it, as
> > sitemaps also have individual restrictions on the number of urls per file,
> > and we do have quite a few messages.
> > 
> > > Why is that /list/ exclusion there in the first place?
> > 
> > Because there are basically infinite number of pages in that space, due to
> > the fact that you can pick an arbitrary point in time to view from.
> 
> Maybe we can create a new page that's specifically to be used by
> crawlers, that lists all emails, each only once.  Say (unimaginatively)
> /list_crawlers/2019-08/ containing links to all emails of all public
> lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines? Honestly I find the current
navigation with the overlapping content to be not great for humans too,
so I think it might be worthwhile to rather improve the general
navigation and allow robots for /list/.  But if that's too much/not well
specified enough: perhaps we could mark the per-day links as
rel=nofollow, but not the prev/next links when starting at certain
boundaries?

Greetings,

Andres Freund

Re: no mailing list hits in google

От

Daniel Gustafsson

Дата:

30 августа 2019 г., 09:40:16

> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
>> On 2019-Aug-29, Magnus Hagander wrote:
>>
>>> Maybe Google used to load the pages under /list/ and crawl them for links
>>> but just not include the actual pages in the index or something
>>>
>>> I wonder if we can inject these into Google using a sitemap. I think that
>>> should work -- will need some investigation on exactly how to do it, as
>>> sitemaps also have individual restrictions on the number of urls per file,
>>> and we do have quite a few messages.
>>>
>>>> Why is that /list/ exclusion there in the first place?
>>>
>>> Because there are basically infinite number of pages in that space, due to
>>> the fact that you can pick an arbitrary point in time to view from.
>>
>> Maybe we can create a new page that's specifically to be used by
>> crawlers, that lists all emails, each only once.  Say (unimaginatively)
>> /list_crawlers/2019-08/ containing links to all emails of all public
>> lists occurring during August 2019.
>
> Hm. Weren't there occasionally downranking rules for pages that were
> clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there.  The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page.  A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that.  Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.

That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by.  The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO.  AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

cheers ./daniel

Re: no mailing list hits in google

От

Magnus Hagander

Дата:

30 августа 2019 г., 10:08:28

On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se> wrote:

> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
>> On 2019-Aug-29, Magnus Hagander wrote:
>>
>>> Maybe Google used to load the pages under /list/ and crawl them for links
>>> but just not include the actual pages in the index or something
>>>
>>> I wonder if we can inject these into Google using a sitemap. I think that
>>> should work -- will need some investigation on exactly how to do it, as
>>> sitemaps also have individual restrictions on the number of urls per file,
>>> and we do have quite a few messages.
>>>
>>>> Why is that /list/ exclusion there in the first place?
>>>
>>> Because there are basically infinite number of pages in that space, due to
>>> the fact that you can pick an arbitrary point in time to view from.
>>
>> Maybe we can create a new page that's specifically to be used by
>> crawlers, that lists all emails, each only once. Say (unimaginatively)
>> /list_crawlers/2019-08/ containing links to all emails of all public
>> lists occurring during August 2019.
>
> Hm. Weren't there occasionally downranking rules for pages that were
> clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there. The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page. A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that. Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.

But robots.txt blocks the whole-thread view (and this is the reason for it).

And postgr.es/m/ does not actually make the content available there, it redirects.

So I don't think those should actually have an effect?

That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by. The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO. AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because they are in /flat/ or /raw/, but nothing at all about the regular messages.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: no mailing list hits in google

От

Daniel Gustafsson

Дата:

30 августа 2019 г., 11:09:14

> On 30 Aug 2019, at 12:08, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:
> > On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote:
> > On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
> >> On 2019-Aug-29, Magnus Hagander wrote:
> >>
> >>> Maybe Google used to load the pages under /list/ and crawl them for links
> >>> but just not include the actual pages in the index or something
> >>>
> >>> I wonder if we can inject these into Google using a sitemap. I think that
> >>> should work -- will need some investigation on exactly how to do it, as
> >>> sitemaps also have individual restrictions on the number of urls per file,
> >>> and we do have quite a few messages.
> >>>
> >>>> Why is that /list/ exclusion there in the first place?
> >>>
> >>> Because there are basically infinite number of pages in that space, due to
> >>> the fact that you can pick an arbitrary point in time to view from.
> >>
> >> Maybe we can create a new page that's specifically to be used by
> >> crawlers, that lists all emails, each only once.  Say (unimaginatively)
> >> /list_crawlers/2019-08/ containing links to all emails of all public
> >> lists occurring during August 2019.
> >
> > Hm. Weren't there occasionally downranking rules for pages that were
> > clearly aimed just at search engines?
>
> I think that’s mainly been for pages which are clearly keyword spamming, I
> doubt our content would get caught there.  The sitemap, as proposed upthread,
> is the solution to this however and is also the recommended way from Google for
> sites with lots of content.
>
> Google does however explicitly downrank duplicated/similar content, or content
> which can be reached via multiple URLs and which doesn’t list a canonical URL
> in the page.  A single message and the whole-thread link does contain the same
> content, and neither are canonical so we might be incurring penalties from
> that.  Also, the postgr.es/m/ <http://postgr.es/m/> shortener makes content available via two URLs,
> without a canonical URL specified.
>
> But robots.txt blocks the whole-thread view (and this is the reason for it).

Maybe that’s part of the explanation, since Google no longer wants sites to use
robots.txt for restricting crawlers on what to index (contrary to much indexing
advice which is vague at best they actually say so explicitly)?  Being in
robots.txt doesn’t restrict the page from being indexed, if it is linked to
from somewhere else with enough context etc (for example if a thread is
reproduced on a forum with a link to /message-id/raw).  Their recommended way
is to mark the page with noindex:

    <meta name=“robots” content=“noindex” />

> And postgr.es/m/ <http://postgr.es/m/> does not actually make the content available there, it redirects.

Right, but a 301 redirect is considered by Google as deprecating the old page,
which may or may not throw the indexer off since we continue to use
postgr.es/m/ without a canonicalization?

> So I don't think those should actually have an effect?

That could very well be true, as with most things SEO it’s all a guessing game.

> That being said, since we haven’t changed anything, and DuckDuckGo happily
> index the mailinglist posts, this smells a lot more like a policy change than a
> technical change if my experience with Google SEO is anything to go by.  The
> Webmaster Tools Search Console can quite often give insights as to why a page
> is missing, that’s probably a better place to start then second guessing Google
> SEO.  AFAICR, using that requires proving that one owns the site/domain, but
> doesn’t require adding any google trackers or similar things.
>
> I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because
theyare in /flat/ or /raw/, but nothing at all about the regular messages.  

That’s disappointing, I’ve gotten quite good advice there in past.

cheers ./daniel

Re: no mailing list hits in google

От

Andres Freund

Дата:

12 июня 2021 г., 19:05:42

Hi,

This got brought up again on in a twitter discussion, see
https://twitter.com/AndresFreundTec/status/1403418002951794688

On 2019-08-29 07:50:13 -0700, Andres Freund wrote:
> > > Why is that /list/ exclusion there in the first place?
> 
> > Because there are basically infinite number of pages in that space, due to
> > the fact that you can pick an arbitrary point in time to view from.
> 
> You mean because of the per-day links, that aren't really per-day? I
> think the number of links due to that would still be manageable traffic
> wise? Or are they that expensive to compute?  Perhaps we could make the
> "jump to day" links smarter in some way? Perhaps by not including
> content for the following days in the per-day pages?

I still don't understand why all of /list/ is in robots.txt. I
understand why we don't necessarily want to index /list/.../since/...,
but prohibiting all of /list/ seems like a extremely poorly aimed
big hammer.

Can't we use wildcards to at least allow everything but the /since/
links? E.g. Disallow: /list/*/since/*. Is it because we're some less
common crawler doesn't implement wildcards at all?

Or slap rel=nofollow on links / add a meta tag preventing /since/ pages
from being indexed.

Yes, that'd not be perfect for the bigger lists, because there's no
"direct" way to get from the month's archive to all the month's emails
when paginated. But there's still the next/prev links. And it'd be much
better than what we have right now.

Greetings,

Andres Freund

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: no mailing list hits in google