Обсуждение: Re: no mailing list hits in google
Merlin Moncure <mmoncure@gmail.com> writes: > [apologies if this is the incorrect list or is already discussed material] It's not the right list; redirecting to pgsql-www. > I've noticed that mailing list discussions in -hackers and other > mailing lists appear to not be indexed by google -- at all. We are > also not being tracked by any mailing list aggregators -- in contrast > to a decade ago where we had nabble and other systems to collect and > organize results (tbh, often better than we do) we are now at an > extreme disadvantage; mailing list activity was formerly and > absolutely fantastic research via google to find solutions to obscure > technical problems in the database. Limited access to this > information will directly lead to increased bug reports, lack of > solution confidence, etc. > My test case here is the query: pgsql-hackers ExecHashJoinNewBatch > I was searching out a link to recent bug report for copy/paste into > corporate email. In the old days this would fire right up but now > returns no hits even though the discussion is available in the > archives (which I had to find by looking up the specific day the > thread was active). Just a heads up. Hm. When I try googling that, the first thing I get is pgsql-hackers - PostgreSQL https://www.postgresql.org › list › pgsql-hackers No information is available for this page. Learn why and the "learn why" link says that "You are seeing this result because the page is blocked by a robots.txt file on your website." So somebody has blocked the archives from being indexed. Seems like a bad idea. regards, tom lane
On Wed, Aug 28, 2019 at 6:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
> [apologies if this is the incorrect list or is already discussed material]
It's not the right list; redirecting to pgsql-www.
> I've noticed that mailing list discussions in -hackers and other
> mailing lists appear to not be indexed by google -- at all. We are
> also not being tracked by any mailing list aggregators -- in contrast
> to a decade ago where we had nabble and other systems to collect and
> organize results (tbh, often better than we do) we are now at an
> extreme disadvantage; mailing list activity was formerly and
> absolutely fantastic research via google to find solutions to obscure
> technical problems in the database. Limited access to this
> information will directly lead to increased bug reports, lack of
> solution confidence, etc.
> My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
> I was searching out a link to recent bug report for copy/paste into
> corporate email. In the old days this would fire right up but now
> returns no hits even though the discussion is available in the
> archives (which I had to find by looking up the specific day the
> thread was active). Just a heads up.
Hm. When I try googling that, the first thing I get is
pgsql-hackers - PostgreSQL
https://www.postgresql.org › list › pgsql-hackers
No information is available for this page.
Learn why
and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."
So somebody has blocked the archives from being indexed.
Seems like a bad idea.
It blocks /list/ which has the subjects only. The actual emails in /message-id/ are not blocked by robots.txt. I don't know why they stopped appearing in the searches... Nothing has been changed around that for many years from *our* side.
Hi, On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote: > It blocks /list/ which has the subjects only. Yea. But there's no way to actually get to all the individual messages without /list/? Sure, some will be linked to from somewhere else, but without the content below /list/, most won't be reached? Why is that /list/ exclusion there in the first place? > Nothing has been changed around that for many years from *our* side. Any chance that there previously still was an archives.postgresql.org view or such that allowed to reach the individual messages without being blocked by robots.txt? Greetings, Andres Freund
Magnus Hagander <magnus@hagander.net> writes: > It blocks /list/ which has the subjects only. The actual emails in > /message-id/ are not blocked by robots.txt. I don't know why they stopped > appearing in the searches... Nothing has been changed around that for many > years from *our* side. If I go to https://www.postgresql.org/message-id/ I get a page saying "Not Found". So I'm not clear on how a web crawler would descend through that to individual messages. Even if it looks different to a robot, what would it look like exactly? A flat space of umpteen zillion immediate-child pages? It seems not improbable that Google's search engine would intentionally decide not to index that, or unintentionally just fail due to some internal resource limit. (This theory can explain why it used to work and no longer does: we got past whatever the limit is.) Andres' idea of allowing access to /list/ would allow the archives to be traversed in more bite-size pieces, which might fix the issue. regards, tom lane
On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:
> It blocks /list/ which has the subjects only.
Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?
That is indeed a good point. But it has been that way for many years, so something must've changed. We last modified this in 2013....
Maybe Google used to load the pages under /list/ and crawl them for links but just not include the actual pages in the index or something
I wonder if we can inject these into Google using a sitemap. I think that should work -- will need some investigation on exactly how to do it, as sitemaps also have individual restrictions on the number of urls per file, and we do have quite a few messages.
Why is that /list/ exclusion there in the first place?
Because there are basically infinite number of pages in that space, due to the fact that you can pick an arbitrary point in time to view from.
> Nothing has been changed around that for many years from *our* side.
Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?
That one had a robots.txt blocking this going back even further in time.
On 2019-Aug-29, Magnus Hagander wrote: > Maybe Google used to load the pages under /list/ and crawl them for links > but just not include the actual pages in the index or something > > I wonder if we can inject these into Google using a sitemap. I think that > should work -- will need some investigation on exactly how to do it, as > sitemaps also have individual restrictions on the number of urls per file, > and we do have quite a few messages. > > > Why is that /list/ exclusion there in the first place? > > Because there are basically infinite number of pages in that space, due to > the fact that you can pick an arbitrary point in time to view from. Maybe we can create a new page that's specifically to be used by crawlers, that lists all emails, each only once. Say (unimaginatively) /list_crawlers/2019-08/ containing links to all emails of all public lists occurring during August 2019. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 29, 2019 at 3:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Aug-29, Magnus Hagander wrote:
> Maybe Google used to load the pages under /list/ and crawl them for links
> but just not include the actual pages in the index or something
>
> I wonder if we can inject these into Google using a sitemap. I think that
> should work -- will need some investigation on exactly how to do it, as
> sitemaps also have individual restrictions on the number of urls per file,
> and we do have quite a few messages.
>
> > Why is that /list/ exclusion there in the first place?
>
> Because there are basically infinite number of pages in that space, due to
> the fact that you can pick an arbitrary point in time to view from.
Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.
That's pretty much what I'm suggesting but using a sitemap so it's directly injected.
//Magnus
Hi, On 2019-08-29 13:12:00 +0200, Magnus Hagander wrote: > On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote: > > > It blocks /list/ which has the subjects only. > > > > Yea. But there's no way to actually get to all the individual messages > > without /list/? Sure, some will be linked to from somewhere else, but > > without the content below /list/, most won't be reached? > > > > That is indeed a good point. But it has been that way for many years, so > something must've changed. We last modified this in 2013.... Hm. I guess it's possible that most pages were found due to the next/prev links in individual messages, once one of them is linked from somewhere externally. Any chance there's enough logs around to see from where to where the indexers currently move? > I wonder if we can inject these into Google using a sitemap. I think that > should work -- will need some investigation on exactly how to do it, as > sitemaps also have individual restrictions on the number of urls per file, > and we do have quite a few messages. Hm. You mean in addition to allowing /list/ or solely? > > Why is that /list/ exclusion there in the first place? > Because there are basically infinite number of pages in that space, due to > the fact that you can pick an arbitrary point in time to view from. You mean because of the per-day links, that aren't really per-day? I think the number of links due to that would still be manageable traffic wise? Or are they that expensive to compute? Perhaps we could make the "jump to day" links smarter in some way? Perhaps by not including content for the following days in the per-day pages? Greetings, Andres Freund
Hi, On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote: > On 2019-Aug-29, Magnus Hagander wrote: > > > Maybe Google used to load the pages under /list/ and crawl them for links > > but just not include the actual pages in the index or something > > > > I wonder if we can inject these into Google using a sitemap. I think that > > should work -- will need some investigation on exactly how to do it, as > > sitemaps also have individual restrictions on the number of urls per file, > > and we do have quite a few messages. > > > > > Why is that /list/ exclusion there in the first place? > > > > Because there are basically infinite number of pages in that space, due to > > the fact that you can pick an arbitrary point in time to view from. > > Maybe we can create a new page that's specifically to be used by > crawlers, that lists all emails, each only once. Say (unimaginatively) > /list_crawlers/2019-08/ containing links to all emails of all public > lists occurring during August 2019. Hm. Weren't there occasionally downranking rules for pages that were clearly aimed just at search engines? Honestly I find the current navigation with the overlapping content to be not great for humans too, so I think it might be worthwhile to rather improve the general navigation and allow robots for /list/. But if that's too much/not well specified enough: perhaps we could mark the per-day links as rel=nofollow, but not the prev/next links when starting at certain boundaries? Greetings, Andres Freund
> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote: > On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote: >> On 2019-Aug-29, Magnus Hagander wrote: >> >>> Maybe Google used to load the pages under /list/ and crawl them for links >>> but just not include the actual pages in the index or something >>> >>> I wonder if we can inject these into Google using a sitemap. I think that >>> should work -- will need some investigation on exactly how to do it, as >>> sitemaps also have individual restrictions on the number of urls per file, >>> and we do have quite a few messages. >>> >>>> Why is that /list/ exclusion there in the first place? >>> >>> Because there are basically infinite number of pages in that space, due to >>> the fact that you can pick an arbitrary point in time to view from. >> >> Maybe we can create a new page that's specifically to be used by >> crawlers, that lists all emails, each only once. Say (unimaginatively) >> /list_crawlers/2019-08/ containing links to all emails of all public >> lists occurring during August 2019. > > Hm. Weren't there occasionally downranking rules for pages that were > clearly aimed just at search engines? I think that’s mainly been for pages which are clearly keyword spamming, I doubt our content would get caught there. The sitemap, as proposed upthread, is the solution to this however and is also the recommended way from Google for sites with lots of content. Google does however explicitly downrank duplicated/similar content, or content which can be reached via multiple URLs and which doesn’t list a canonical URL in the page. A single message and the whole-thread link does contain the same content, and neither are canonical so we might be incurring penalties from that. Also, the postgr.es/m/ shortener makes content available via two URLs, without a canonical URL specified. That being said, since we haven’t changed anything, and DuckDuckGo happily index the mailinglist posts, this smells a lot more like a policy change than a technical change if my experience with Google SEO is anything to go by. The Webmaster Tools Search Console can quite often give insights as to why a page is missing, that’s probably a better place to start then second guessing Google SEO. AFAICR, using that requires proving that one owns the site/domain, but doesn’t require adding any google trackers or similar things. cheers ./daniel
On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
>> On 2019-Aug-29, Magnus Hagander wrote:
>>
>>> Maybe Google used to load the pages under /list/ and crawl them for links
>>> but just not include the actual pages in the index or something
>>>
>>> I wonder if we can inject these into Google using a sitemap. I think that
>>> should work -- will need some investigation on exactly how to do it, as
>>> sitemaps also have individual restrictions on the number of urls per file,
>>> and we do have quite a few messages.
>>>
>>>> Why is that /list/ exclusion there in the first place?
>>>
>>> Because there are basically infinite number of pages in that space, due to
>>> the fact that you can pick an arbitrary point in time to view from.
>>
>> Maybe we can create a new page that's specifically to be used by
>> crawlers, that lists all emails, each only once. Say (unimaginatively)
>> /list_crawlers/2019-08/ containing links to all emails of all public
>> lists occurring during August 2019.
>
> Hm. Weren't there occasionally downranking rules for pages that were
> clearly aimed just at search engines?
I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there. The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.
Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page. A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that. Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.
But robots.txt blocks the whole-thread view (and this is the reason for it).
And postgr.es/m/ does not actually make the content available there, it redirects.
So I don't think those should actually have an effect?
That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by. The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO. AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.
I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because they are in /flat/ or /raw/, but nothing at all about the regular messages.
> On 30 Aug 2019, at 12:08, Magnus Hagander <magnus@hagander.net> wrote: > On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote: > > On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote: > > On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote: > >> On 2019-Aug-29, Magnus Hagander wrote: > >> > >>> Maybe Google used to load the pages under /list/ and crawl them for links > >>> but just not include the actual pages in the index or something > >>> > >>> I wonder if we can inject these into Google using a sitemap. I think that > >>> should work -- will need some investigation on exactly how to do it, as > >>> sitemaps also have individual restrictions on the number of urls per file, > >>> and we do have quite a few messages. > >>> > >>>> Why is that /list/ exclusion there in the first place? > >>> > >>> Because there are basically infinite number of pages in that space, due to > >>> the fact that you can pick an arbitrary point in time to view from. > >> > >> Maybe we can create a new page that's specifically to be used by > >> crawlers, that lists all emails, each only once. Say (unimaginatively) > >> /list_crawlers/2019-08/ containing links to all emails of all public > >> lists occurring during August 2019. > > > > Hm. Weren't there occasionally downranking rules for pages that were > > clearly aimed just at search engines? > > I think that’s mainly been for pages which are clearly keyword spamming, I > doubt our content would get caught there. The sitemap, as proposed upthread, > is the solution to this however and is also the recommended way from Google for > sites with lots of content. > > Google does however explicitly downrank duplicated/similar content, or content > which can be reached via multiple URLs and which doesn’t list a canonical URL > in the page. A single message and the whole-thread link does contain the same > content, and neither are canonical so we might be incurring penalties from > that. Also, the postgr.es/m/ <http://postgr.es/m/> shortener makes content available via two URLs, > without a canonical URL specified. > > But robots.txt blocks the whole-thread view (and this is the reason for it). Maybe that’s part of the explanation, since Google no longer wants sites to use robots.txt for restricting crawlers on what to index (contrary to much indexing advice which is vague at best they actually say so explicitly)? Being in robots.txt doesn’t restrict the page from being indexed, if it is linked to from somewhere else with enough context etc (for example if a thread is reproduced on a forum with a link to /message-id/raw). Their recommended way is to mark the page with noindex: <meta name=“robots” content=“noindex” /> > And postgr.es/m/ <http://postgr.es/m/> does not actually make the content available there, it redirects. Right, but a 301 redirect is considered by Google as deprecating the old page, which may or may not throw the indexer off since we continue to use postgr.es/m/ without a canonicalization? > So I don't think those should actually have an effect? That could very well be true, as with most things SEO it’s all a guessing game. > That being said, since we haven’t changed anything, and DuckDuckGo happily > index the mailinglist posts, this smells a lot more like a policy change than a > technical change if my experience with Google SEO is anything to go by. The > Webmaster Tools Search Console can quite often give insights as to why a page > is missing, that’s probably a better place to start then second guessing Google > SEO. AFAICR, using that requires proving that one owns the site/domain, but > doesn’t require adding any google trackers or similar things. > > I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because theyare in /flat/ or /raw/, but nothing at all about the regular messages. That’s disappointing, I’ve gotten quite good advice there in past. cheers ./daniel
Hi, This got brought up again on in a twitter discussion, see https://twitter.com/AndresFreundTec/status/1403418002951794688 On 2019-08-29 07:50:13 -0700, Andres Freund wrote: > > > Why is that /list/ exclusion there in the first place? > > > Because there are basically infinite number of pages in that space, due to > > the fact that you can pick an arbitrary point in time to view from. > > You mean because of the per-day links, that aren't really per-day? I > think the number of links due to that would still be manageable traffic > wise? Or are they that expensive to compute? Perhaps we could make the > "jump to day" links smarter in some way? Perhaps by not including > content for the following days in the per-day pages? I still don't understand why all of /list/ is in robots.txt. I understand why we don't necessarily want to index /list/.../since/..., but prohibiting all of /list/ seems like a extremely poorly aimed big hammer. Can't we use wildcards to at least allow everything but the /since/ links? E.g. Disallow: /list/*/since/*. Is it because we're some less common crawler doesn't implement wildcards at all? Or slap rel=nofollow on links / add a meta tag preventing /since/ pages from being indexed. Yes, that'd not be perfect for the bigger lists, because there's no "direct" way to get from the month's archive to all the month's emails when paginated. But there's still the next/prev links. And it'd be much better than what we have right now. Greetings, Andres Freund