Обсуждение: Re: no mailing list hits in google

Поиск
Список
Период
Сортировка

Re: no mailing list hits in google

От
Tom Lane
Дата:
Merlin Moncure <mmoncure@gmail.com> writes:
> [apologies if this is the incorrect list or is already discussed material]

It's not the right list; redirecting to pgsql-www.

> I've noticed that mailing list discussions in -hackers and other
> mailing lists appear to not be indexed by google -- at all.  We are
> also not being tracked by any mailing list aggregators -- in contrast
> to a decade ago where we had nabble and other systems to collect and
> organize results (tbh, often better than we do) we are now at an
> extreme disadvantage; mailing list activity was formerly and
> absolutely fantastic research via google to find solutions to obscure
> technical problems in the database.  Limited access to this
> information will directly lead to increased bug reports, lack of
> solution confidence, etc.

> My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
> I was searching out a link to recent bug report for copy/paste into
> corporate email. In the old days this would fire right up but now
> returns no hits even though the discussion is available in the
> archives (which I had to find by looking up the specific day the
> thread was active).  Just a heads up.

Hm.  When I try googling that, the first thing I get is

    pgsql-hackers - PostgreSQL

    https://www.postgresql.org › list › pgsql-hackers
    No information is available for this page.
    Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

            regards, tom lane



Re: no mailing list hits in google

От
Magnus Hagander
Дата:
On Wed, Aug 28, 2019 at 6:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
> [apologies if this is the incorrect list or is already discussed material]

It's not the right list; redirecting to pgsql-www.

> I've noticed that mailing list discussions in -hackers and other
> mailing lists appear to not be indexed by google -- at all.  We are
> also not being tracked by any mailing list aggregators -- in contrast
> to a decade ago where we had nabble and other systems to collect and
> organize results (tbh, often better than we do) we are now at an
> extreme disadvantage; mailing list activity was formerly and
> absolutely fantastic research via google to find solutions to obscure
> technical problems in the database.  Limited access to this
> information will directly lead to increased bug reports, lack of
> solution confidence, etc.

> My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
> I was searching out a link to recent bug report for copy/paste into
> corporate email. In the old days this would fire right up but now
> returns no hits even though the discussion is available in the
> archives (which I had to find by looking up the specific day the
> thread was active).  Just a heads up.

Hm.  When I try googling that, the first thing I get is

        pgsql-hackers - PostgreSQL

        https://www.postgresql.org › list › pgsql-hackers
        No information is available for this page.
        Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

It blocks /list/ which has the subjects only. The actual emails in /message-id/ are not blocked by robots.txt.  I don't know why they stopped appearing in the searches... Nothing has been changed around that for many years from *our* side. 

--

Re: no mailing list hits in google

От
Andres Freund
Дата:
Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

Why is that /list/ exclusion there in the first place?


> Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

Greetings,

Andres Freund



Re: no mailing list hits in google

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> It blocks /list/ which has the subjects only. The actual emails in
> /message-id/ are not blocked by robots.txt.  I don't know why they stopped
> appearing in the searches... Nothing has been changed around that for many
> years from *our* side.

If I go to

https://www.postgresql.org/message-id/

I get a page saying "Not Found".  So I'm not clear on how a web crawler
would descend through that to individual messages.

Even if it looks different to a robot, what would it look like exactly?
A flat space of umpteen zillion immediate-child pages?  It seems not
improbable that Google's search engine would intentionally decide not to
index that, or unintentionally just fail due to some internal resource
limit.  (This theory can explain why it used to work and no longer does:
we got past whatever the limit is.)

Andres' idea of allowing access to /list/ would allow the archives to be
traversed in more bite-size pieces, which might fix the issue.

            regards, tom lane



Re: no mailing list hits in google

От
Magnus Hagander
Дата:
On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

That is indeed a good point. But it has been that way for many years, so something must've changed.   We last modified this in 2013....

Maybe Google used to load the pages under /list/ and crawl them for links but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that should work -- will need some investigation on exactly how to do it, as sitemaps also have individual restrictions on the number of urls per file, and we do have quite a few messages.


Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to the fact that you can pick an arbitrary point in time to view from.


> Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

That one had a robots.txt blocking this going back even further in time. 

--

Re: no mailing list hits in google

От
Alvaro Herrera
Дата:
On 2019-Aug-29, Magnus Hagander wrote:

> Maybe Google used to load the pages under /list/ and crawl them for links
> but just not include the actual pages in the index or something
> 
> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.
> 
> > Why is that /list/ exclusion there in the first place?
> 
> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once.  Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: no mailing list hits in google

От
Magnus Hagander
Дата:

On Thu, Aug 29, 2019 at 3:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Aug-29, Magnus Hagander wrote:

> Maybe Google used to load the pages under /list/ and crawl them for links
> but just not include the actual pages in the index or something
>
> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.
>
> > Why is that /list/ exclusion there in the first place?
>
> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once.  Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.


That's pretty much what I'm suggesting but using a sitemap so it's directly injected.

//Magnus
 

Re: no mailing list hits in google

От
Andres Freund
Дата:
Hi,

On 2019-08-29 13:12:00 +0200, Magnus Hagander wrote:
> On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> > > It blocks /list/ which has the subjects only.
> >
> > Yea. But there's no way to actually get to all the individual messages
> > without /list/? Sure, some will be linked to from somewhere else, but
> > without the content below /list/, most won't be reached?
> >
> 
> That is indeed a good point. But it has been that way for many years, so
> something must've changed.   We last modified this in 2013....

Hm. I guess it's possible that most pages were found due to the
next/prev links in individual messages, once one of them is linked from
somewhere externally.   Any chance there's enough logs around to see
from where to where the indexers currently move?


> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.

Hm. You mean in addition to allowing /list/ or solely?


> > Why is that /list/ exclusion there in the first place?

> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.

You mean because of the per-day links, that aren't really per-day? I
think the number of links due to that would still be manageable traffic
wise? Or are they that expensive to compute?  Perhaps we could make the
"jump to day" links smarter in some way? Perhaps by not including
content for the following days in the per-day pages?

Greetings,

Andres Freund



Re: no mailing list hits in google

От
Andres Freund
Дата:
Hi,

On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
> On 2019-Aug-29, Magnus Hagander wrote:
> 
> > Maybe Google used to load the pages under /list/ and crawl them for links
> > but just not include the actual pages in the index or something
> > 
> > I wonder if we can inject these into Google using a sitemap. I think that
> > should work -- will need some investigation on exactly how to do it, as
> > sitemaps also have individual restrictions on the number of urls per file,
> > and we do have quite a few messages.
> > 
> > > Why is that /list/ exclusion there in the first place?
> > 
> > Because there are basically infinite number of pages in that space, due to
> > the fact that you can pick an arbitrary point in time to view from.
> 
> Maybe we can create a new page that's specifically to be used by
> crawlers, that lists all emails, each only once.  Say (unimaginatively)
> /list_crawlers/2019-08/ containing links to all emails of all public
> lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines? Honestly I find the current
navigation with the overlapping content to be not great for humans too,
so I think it might be worthwhile to rather improve the general
navigation and allow robots for /list/.  But if that's too much/not well
specified enough: perhaps we could mark the per-day links as
rel=nofollow, but not the prev/next links when starting at certain
boundaries?

Greetings,

Andres Freund



Re: no mailing list hits in google

От
Daniel Gustafsson
Дата:
> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
>> On 2019-Aug-29, Magnus Hagander wrote:
>>
>>> Maybe Google used to load the pages under /list/ and crawl them for links
>>> but just not include the actual pages in the index or something
>>>
>>> I wonder if we can inject these into Google using a sitemap. I think that
>>> should work -- will need some investigation on exactly how to do it, as
>>> sitemaps also have individual restrictions on the number of urls per file,
>>> and we do have quite a few messages.
>>>
>>>> Why is that /list/ exclusion there in the first place?
>>>
>>> Because there are basically infinite number of pages in that space, due to
>>> the fact that you can pick an arbitrary point in time to view from.
>>
>> Maybe we can create a new page that's specifically to be used by
>> crawlers, that lists all emails, each only once.  Say (unimaginatively)
>> /list_crawlers/2019-08/ containing links to all emails of all public
>> lists occurring during August 2019.
>
> Hm. Weren't there occasionally downranking rules for pages that were
> clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there.  The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page.  A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that.  Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.

That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by.  The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO.  AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

cheers ./daniel


Re: no mailing list hits in google

От
Magnus Hagander
Дата:


On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
>> On 2019-Aug-29, Magnus Hagander wrote:
>>
>>> Maybe Google used to load the pages under /list/ and crawl them for links
>>> but just not include the actual pages in the index or something
>>>
>>> I wonder if we can inject these into Google using a sitemap. I think that
>>> should work -- will need some investigation on exactly how to do it, as
>>> sitemaps also have individual restrictions on the number of urls per file,
>>> and we do have quite a few messages.
>>>
>>>> Why is that /list/ exclusion there in the first place?
>>>
>>> Because there are basically infinite number of pages in that space, due to
>>> the fact that you can pick an arbitrary point in time to view from.
>>
>> Maybe we can create a new page that's specifically to be used by
>> crawlers, that lists all emails, each only once.  Say (unimaginatively)
>> /list_crawlers/2019-08/ containing links to all emails of all public
>> lists occurring during August 2019.
>
> Hm. Weren't there occasionally downranking rules for pages that were
> clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there.  The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page.  A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that.  Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.

But robots.txt blocks the whole-thread view (and this is the reason for it).
And postgr.es/m/ does not actually make the content available there, it redirects.

So I don't think those should actually have an effect?


That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by.  The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO.  AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because they are in /flat/ or /raw/, but nothing at all about the regular messages. 

--

Re: no mailing list hits in google

От
Daniel Gustafsson
Дата:
> On 30 Aug 2019, at 12:08, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:
> > On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote:
> > On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
> >> On 2019-Aug-29, Magnus Hagander wrote:
> >>
> >>> Maybe Google used to load the pages under /list/ and crawl them for links
> >>> but just not include the actual pages in the index or something
> >>>
> >>> I wonder if we can inject these into Google using a sitemap. I think that
> >>> should work -- will need some investigation on exactly how to do it, as
> >>> sitemaps also have individual restrictions on the number of urls per file,
> >>> and we do have quite a few messages.
> >>>
> >>>> Why is that /list/ exclusion there in the first place?
> >>>
> >>> Because there are basically infinite number of pages in that space, due to
> >>> the fact that you can pick an arbitrary point in time to view from.
> >>
> >> Maybe we can create a new page that's specifically to be used by
> >> crawlers, that lists all emails, each only once.  Say (unimaginatively)
> >> /list_crawlers/2019-08/ containing links to all emails of all public
> >> lists occurring during August 2019.
> >
> > Hm. Weren't there occasionally downranking rules for pages that were
> > clearly aimed just at search engines?
>
> I think that’s mainly been for pages which are clearly keyword spamming, I
> doubt our content would get caught there.  The sitemap, as proposed upthread,
> is the solution to this however and is also the recommended way from Google for
> sites with lots of content.
>
> Google does however explicitly downrank duplicated/similar content, or content
> which can be reached via multiple URLs and which doesn’t list a canonical URL
> in the page.  A single message and the whole-thread link does contain the same
> content, and neither are canonical so we might be incurring penalties from
> that.  Also, the postgr.es/m/ <http://postgr.es/m/> shortener makes content available via two URLs,
> without a canonical URL specified.
>
> But robots.txt blocks the whole-thread view (and this is the reason for it).

Maybe that’s part of the explanation, since Google no longer wants sites to use
robots.txt for restricting crawlers on what to index (contrary to much indexing
advice which is vague at best they actually say so explicitly)?  Being in
robots.txt doesn’t restrict the page from being indexed, if it is linked to
from somewhere else with enough context etc (for example if a thread is
reproduced on a forum with a link to /message-id/raw).  Their recommended way
is to mark the page with noindex:

    <meta name=“robots” content=“noindex” />

> And postgr.es/m/ <http://postgr.es/m/> does not actually make the content available there, it redirects.

Right, but a 301 redirect is considered by Google as deprecating the old page,
which may or may not throw the indexer off since we continue to use
postgr.es/m/ without a canonicalization?

> So I don't think those should actually have an effect?

That could very well be true, as with most things SEO it’s all a guessing game.

> That being said, since we haven’t changed anything, and DuckDuckGo happily
> index the mailinglist posts, this smells a lot more like a policy change than a
> technical change if my experience with Google SEO is anything to go by.  The
> Webmaster Tools Search Console can quite often give insights as to why a page
> is missing, that’s probably a better place to start then second guessing Google
> SEO.  AFAICR, using that requires proving that one owns the site/domain, but
> doesn’t require adding any google trackers or similar things.
>
> I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because
theyare in /flat/ or /raw/, but nothing at all about the regular messages.  

That’s disappointing, I’ve gotten quite good advice there in past.

cheers ./daniel


Re: no mailing list hits in google

От
Andres Freund
Дата:
Hi,

This got brought up again on in a twitter discussion, see
https://twitter.com/AndresFreundTec/status/1403418002951794688

On 2019-08-29 07:50:13 -0700, Andres Freund wrote:
> > > Why is that /list/ exclusion there in the first place?
> 
> > Because there are basically infinite number of pages in that space, due to
> > the fact that you can pick an arbitrary point in time to view from.
> 
> You mean because of the per-day links, that aren't really per-day? I
> think the number of links due to that would still be manageable traffic
> wise? Or are they that expensive to compute?  Perhaps we could make the
> "jump to day" links smarter in some way? Perhaps by not including
> content for the following days in the per-day pages?

I still don't understand why all of /list/ is in robots.txt. I
understand why we don't necessarily want to index /list/.../since/...,
but prohibiting all of /list/ seems like a extremely poorly aimed
big hammer.

Can't we use wildcards to at least allow everything but the /since/
links? E.g. Disallow: /list/*/since/*. Is it because we're some less
common crawler doesn't implement wildcards at all?

Or slap rel=nofollow on links / add a meta tag preventing /since/ pages
from being indexed.

Yes, that'd not be perfect for the bigger lists, because there's no
"direct" way to get from the month's archive to all the month's emails
when paginated. But there's still the next/prev links. And it'd be much
better than what we have right now.

Greetings,

Andres Freund