Обсуждение: Fixing Google Search on the docs (redux)

Поиск
Список
Период
Сортировка

Fixing Google Search on the docs (redux)

От
Dave Page
Дата:
I was looking at our analytic data, and saw that the vast majority of inbound traffic to the docs, hits the 9.1 version. We've known this has been an issue for years and have tried various remedies, clearly none of which are working.

Should we try an experiment for a couple of months, in which we simply block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt? It's a much more drastic option, but at least it might force Google into indexing the latest doc version with the highest priority.
 
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Fixing Google Search on the docs (redux)

От
"Jonathan S. Katz"
Дата:
On 11/18/20 11:20 AM, Dave Page wrote:
> I was looking at our analytic data, and saw that the vast majority of
> inbound traffic to the docs, hits the 9.1 version. We've known this has
> been an issue for years and have tried various remedies, clearly none of
> which are working.
>
> Should we try an experiment for a couple of months, in which we simply
> block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> It's a much more drastic option, but at least it might force Google into
> indexing the latest doc version with the highest priority.

If we're going down this road, I would suggest borrowing a concept from
the Django Project documentation which has a similar issue to us. In
their codebase, use a <link> tag with rel="canonical" to point to the
latest version of docs on their page[1].

So for example, given 3.1 is their latest release, you will find
something similar to this:

<link rel="canonical"
href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">

From a quick test of searching various Django concepts, it seems that
the 3.1 pages tend to turn up first.

Our equivalent would be "current".

Jonathan

[1]
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls

Вложения

Re: Fixing Google Search on the docs (redux)

От
Magnus Hagander
Дата:
On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>
> On 11/18/20 11:20 AM, Dave Page wrote:
> > I was looking at our analytic data, and saw that the vast majority of
> > inbound traffic to the docs, hits the 9.1 version. We've known this has
> > been an issue for years and have tried various remedies, clearly none of
> > which are working.
> >
> > Should we try an experiment for a couple of months, in which we simply
> > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> > It's a much more drastic option, but at least it might force Google into
> > indexing the latest doc version with the highest priority.
>
> If we're going down this road, I would suggest borrowing a concept from
> the Django Project documentation which has a similar issue to us. In
> their codebase, use a <link> tag with rel="canonical" to point to the
> latest version of docs on their page[1].
>
> So for example, given 3.1 is their latest release, you will find
> something similar to this:
>
> <link rel="canonical"
> href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>
> From a quick test of searching various Django concepts, it seems that
> the 3.1 pages tend to turn up first.
>
> Our equivalent would be "current".
>
> Jonathan
>
> [1]
> https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls

We've discussed this many times before, and I think so far they've all
bogged down at "google suck" :) The problem is that they don't even
consider the case like we have where the pages *aren't* identical, but
yet related.

The problem it usually comes down to is that if we do that, then you
will no longer be able to say search for something in the old docs *at
all*. A good example right now might be that recovery.conf stuff goes
away. Even if you explicitly search for "postgresql recovery.conf 11".
And I'd guess the majority of people are actually looking for things
in versions that are NOT the latest (though an even bigger majority of
people will be looking for things in versions that are not 9.1).

FWIW, I find the django example absolutely terrible -- in fact, it's a
great example of how the canonical URL handling sucks. There is AFAICT
no way to actually search for information about old versions. You have
to search for it in the new version and then hope that the same info
happens to be on the same page in an earlier version, and then
manually browse your way back to that version (also through very
annoying js popover stuff, but that's a different thing)

I don't know of any way to actually tell google to prioritise the new
versions. You used to be able to do this using the sitemap.xml stuff,
which is why we do that, but at some point they just stopped caring
about those, even in the cases where we're *lowering* our own
priority, under the argument of not letting us increase our priority.

It's not that what we have now for this is especially great. It might
be that going down that route is still the least bad. But we have to
make that decision while knowing this means that *nobody* will be able
to search for things in our older documentation even if they
explicitly ask for it. At all. Their only chance is to search for
something else that might hit our docs, then in that click over to the
correct version they actually asked for, and then search *again* using
our site-search and hope that it shows up there. I'm willing to bet
very few users will figure that part out...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Fixing Google Search on the docs (redux)

От
Christophe Pettus
Дата:

> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus@hagander.net> wrote:
> Their only chance is to search for
> something else that might hit our docs, then in that click over to the
> correct version they actually asked for, and then search *again* using
> our site-search and hope that it shows up there. I'm willing to bet
> very few users will figure that part out...

I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.
--
-- Christophe Pettus
   xof@thebuild.com




Re: Fixing Google Search on the docs (redux)

От
Tom Lane
Дата:
Christophe Pettus <xof@thebuild.com> writes:
>> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus@hagander.net> wrote:
>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...

> I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.

Maybe, rather than trying to force google to index "current", we should
force them to index current minus one or two releases, so that what they
index is in the middle of the range of supported releases.  That would
represent a decent compromise between "info too old" and "info too new".

Another idea is to block, via robots.txt, any out-of-support branches.
We won't know which of the supported branches they then prioritize,
but at least it won't be 9.1.

            regards, tom lane



Re: Fixing Google Search on the docs (redux)

От
Magnus Hagander
Дата:
On Wed, Nov 18, 2020 at 6:33 PM Christophe Pettus <xof@thebuild.com> wrote:
>
>
>
> > On Nov 18, 2020, at 09:28, Magnus Hagander <magnus@hagander.net> wrote:
> > Their only chance is to search for
> > something else that might hit our docs, then in that click over to the
> > correct version they actually asked for, and then search *again* using
> > our site-search and hope that it shows up there. I'm willing to bet
> > very few users will figure that part out...
>
> I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.

Today you can append "12" to your search and get the results for v12
most of the time.

So today the default is really bad, but the exact right thing is possible.
With the change, the default would be less bad (but not necessarily
exactly right), and the exact right thing would be impossible.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Fixing Google Search on the docs (redux)

От
Christophe Pettus
Дата:

> On Nov 18, 2020, at 10:03, Magnus Hagander <magnus@hagander.net> wrote:
> So today the default is really bad, but the exact right thing is possible.
> With the change, the default would be less bad (but not necessarily
> exactly right), and the exact right thing would be impossible.

We're kind of speculating that "oh, right, I have to slap a version number on when I search by Google" is less
frustratingor more common than just "click on the wrong version result and then navigate to the right version result." 

I also think it's a benefit to prioritize the most recent version on external search hits.

I haven't significant complaints within the Django community about the way they handle it, and that's with the Django
documentationbeing *much* less well-organized than the PostgreSQL documentation (and thus more reliant on external
searchengines to find the right thing). 
--
-- Christophe Pettus
   xof@thebuild.com




Re: Fixing Google Search on the docs (redux)

От
Dave Page
Дата:


On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus@hagander.net> wrote:
On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>
> On 11/18/20 11:20 AM, Dave Page wrote:
> > I was looking at our analytic data, and saw that the vast majority of
> > inbound traffic to the docs, hits the 9.1 version. We've known this has
> > been an issue for years and have tried various remedies, clearly none of
> > which are working.
> >
> > Should we try an experiment for a couple of months, in which we simply
> > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> > It's a much more drastic option, but at least it might force Google into
> > indexing the latest doc version with the highest priority.
>
> If we're going down this road, I would suggest borrowing a concept from
> the Django Project documentation which has a similar issue to us. In
> their codebase, use a <link> tag with rel="canonical" to point to the
> latest version of docs on their page[1].
>
> So for example, given 3.1 is their latest release, you will find
> something similar to this:
>
> <link rel="canonical"
> href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>
> From a quick test of searching various Django concepts, it seems that
> the 3.1 pages tend to turn up first.
>
> Our equivalent would be "current".
>
> Jonathan
>
> [1]
> https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls

We've discussed this many times before, and I think so far they've all
bogged down at "google suck" :) The problem is that they don't even
consider the case like we have where the pages *aren't* identical, but
yet related.

Sure, but we need to do something, regardless of whether Google suck in this case. The current situation is ridiculous; I don't remember the last time I searched on something and didn't have to click an alternate version link if I chose a result from our docs.
 

The problem it usually comes down to is that if we do that, then you
will no longer be able to say search for something in the old docs *at
all*. A good example right now might be that recovery.conf stuff goes
away. Even if you explicitly search for "postgresql recovery.conf 11".
And I'd guess the majority of people are actually looking for things
in versions that are NOT the latest (though an even bigger majority of
people will be looking for things in versions that are not 9.1).

The irony is that that example would be far less of an issue if we hadn't removed all the release notes for older versions (see https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an example). The older release notes would give users a hint as to where to look.

 
FWIW, I find the django example absolutely terrible -- in fact, it's a
great example of how the canonical URL handling sucks. There is AFAICT
no way to actually search for information about old versions. You have
to search for it in the new version and then hope that the same info
happens to be on the same page in an earlier version, and then
manually browse your way back to that version (also through very
annoying js popover stuff, but that's a different thing)

That is true, however the *vast* majority of cases will be present in older versions.
 

I don't know of any way to actually tell google to prioritise the new
versions. You used to be able to do this using the sitemap.xml stuff,
which is why we do that, but at some point they just stopped caring
about those, even in the cases where we're *lowering* our own
priority, under the argument of not letting us increase our priority.

It's not that what we have now for this is especially great. It might
be that going down that route is still the least bad. But we have to
make that decision while knowing this means that *nobody* will be able
to search for things in our older documentation even if they
explicitly ask for it. At all.

On public search engines. They will still be able to using our own site search.
 
Their only chance is to search for
something else that might hit our docs, then in that click over to the
correct version they actually asked for, and then search *again* using
our site-search and hope that it shows up there. I'm willing to bet
very few users will figure that part out...

The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics. If we blocked indexing of all but the current version of the docs, it would suck in the same way only for those that specifically want to look at an older version, and those that search for one of the very few things that have been removed from the latest version. In short, I think the current situation is worse.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Fixing Google Search on the docs (redux)

От
Dave Page
Дата:


On Wed, Nov 18, 2020 at 5:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Christophe Pettus <xof@thebuild.com> writes:
>> On Nov 18, 2020, at 09:28, Magnus Hagander <magnus@hagander.net> wrote:
>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...

> I'm not sure that is a worse situation than searching for something and having the first page be 9.1 hits.

Maybe, rather than trying to force google to index "current", we should
force them to index current minus one or two releases, so that what they
index is in the middle of the range of supported releases.  That would
represent a decent compromise between "info too old" and "info too new".

That'll stop people searching about the new features in the latest, which I think is likely a common pattern.
 

Another idea is to block, via robots.txt, any out-of-support branches.
We won't know which of the supported branches they then prioritize,
but at least it won't be 9.1.

That I could get on board with.
 
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Fixing Google Search on the docs (redux)

От
Magnus Hagander
Дата:
On Thu, Nov 19, 2020 at 10:40 AM Dave Page <dpage@pgadmin.org> wrote:
>
>
>
> On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus@hagander.net> wrote:
>>
>> On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>> >
>> > On 11/18/20 11:20 AM, Dave Page wrote:
>> > > I was looking at our analytic data, and saw that the vast majority of
>> > > inbound traffic to the docs, hits the 9.1 version. We've known this has
>> > > been an issue for years and have tried various remedies, clearly none of
>> > > which are working.
>> > >
>> > > Should we try an experiment for a couple of months, in which we simply
>> > > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
>> > > It's a much more drastic option, but at least it might force Google into
>> > > indexing the latest doc version with the highest priority.
>> >
>> > If we're going down this road, I would suggest borrowing a concept from
>> > the Django Project documentation which has a similar issue to us. In
>> > their codebase, use a <link> tag with rel="canonical" to point to the
>> > latest version of docs on their page[1].
>> >
>> > So for example, given 3.1 is their latest release, you will find
>> > something similar to this:
>> >
>> > <link rel="canonical"
>> > href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>> >
>> > From a quick test of searching various Django concepts, it seems that
>> > the 3.1 pages tend to turn up first.
>> >
>> > Our equivalent would be "current".
>> >
>> > Jonathan
>> >
>> > [1]
>> > https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
>>
>> We've discussed this many times before, and I think so far they've all
>> bogged down at "google suck" :) The problem is that they don't even
>> consider the case like we have where the pages *aren't* identical, but
>> yet related.
>
>
> Sure, but we need to do something, regardless of whether Google suck in this case. The current situation is
ridiculous;I don't remember the last time I searched on something and didn't have to click an alternate version link if
Ichose a result from our docs. 
>
>>
>>
>> The problem it usually comes down to is that if we do that, then you
>> will no longer be able to say search for something in the old docs *at
>> all*. A good example right now might be that recovery.conf stuff goes
>> away. Even if you explicitly search for "postgresql recovery.conf 11".
>> And I'd guess the majority of people are actually looking for things
>> in versions that are NOT the latest (though an even bigger majority of
>> people will be looking for things in versions that are not 9.1).
>
>
> The irony is that that example would be far less of an issue if we hadn't removed all the release notes for older
versions(see https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an example). The older release
noteswould give users a hint as to where to look. 

The release notes themselves are still under for example
https://www.postgresql.org/docs/release/12.0/ as well, so we should be
able to keep *that* searchable still. So for this particular case it
would at least tell people that "yeah, you're right, it used to be
called recovery conf" when they're searching for documentation about
11 and earlier... They still won't get to the actual documentation for
it though -- but neither does your example from edb :)


>> FWIW, I find the django example absolutely terrible -- in fact, it's a
>> great example of how the canonical URL handling sucks. There is AFAICT
>> no way to actually search for information about old versions. You have
>> to search for it in the new version and then hope that the same info
>> happens to be on the same page in an earlier version, and then
>> manually browse your way back to that version (also through very
>> annoying js popover stuff, but that's a different thing)
>
>
> That is true, however the *vast* majority of cases will be present in older versions.

Yes, but one could also argue that specifically the things that people
search for might be less cross-platform present there..


>> I don't know of any way to actually tell google to prioritise the new
>> versions. You used to be able to do this using the sitemap.xml stuff,
>> which is why we do that, but at some point they just stopped caring
>> about those, even in the cases where we're *lowering* our own
>> priority, under the argument of not letting us increase our priority.
>>
>> It's not that what we have now for this is especially great. It might
>> be that going down that route is still the least bad. But we have to
>> make that decision while knowing this means that *nobody* will be able
>> to search for things in our older documentation even if they
>> explicitly ask for it. At all.
>
>
> On public search engines. They will still be able to using our own site search.

Yes, of course.


>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...
>
>
> The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics.
Ifwe blocked indexing of all but the current version of the docs, it would suck in the same way only for those that
specificallywant to look at an older version, and those that search for one of the very few things that have been
removedfrom the latest version. In short, I think the current situation is worse. 

Or we need a somewhat in between level. Like, right now I bet most
people would actually want version 11 or 12, not 13. So do we need to
define a "most likely wants to search for this" version as well, which
would then trail the actual latest-release version, and point the
search engines to that?

That said, I also agree with the suggestion to start by at least
blocking those that are unsupported. However, we should monitor the
results carefully so that doesn't end up with google just zapping
*everything* -- we need them to realize the newer versions are there.
Doing the canonical-URL-setup that Jonathan suggested would make
google update it, the question is what happens if they just "go away".
Do we *loose* all the existing "google power" of those links? If so,
it might be a very costly expereiment...

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Fixing Google Search on the docs (redux)

От
Dave Page
Дата:


On Thu, Nov 19, 2020 at 9:58 AM Magnus Hagander <magnus@hagander.net> wrote:

> The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics. If we blocked indexing of all but the current version of the docs, it would suck in the same way only for those that specifically want to look at an older version, and those that search for one of the very few things that have been removed from the latest version. In short, I think the current situation is worse.

Or we need a somewhat in between level. Like, right now I bet most
people would actually want version 11 or 12, not 13. So do we need to
define a "most likely wants to search for this" version as well, which
would then trail the actual latest-release version, and point the
search engines to that?

Perhaps an interesting datapoint is this

====
If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

If you don't explicitly tell Google which URL is canonical, Google will make the choice for you, or might consider them both of equal weight, which might lead to unwanted behavior, as explained below in Why should I choose a canonical URL?
==== 


I think this is interesting because it makes the point that non-canonical URLs will still be indexed, just less often. I wonder if we can do something like the following, but still retain the ability to do a search like "postgresql 12 create trigger":

- Remove (by default) all doc URLs from the sitemap that aren't under /current/ (note that evidence indicates Google will still index pages not in the sitemap if it finds them, if a sitemap is present).
- Include a canonical URL in all doc pages that points to the /current/ version
- Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the /current/ version).

If the Google docs are correct, it'll still index the older versions (and presumably use them in results if it needs to, e.g. because the user included a version number), but it'll prefer the canonical one.


That said, I also agree with the suggestion to start by at least
blocking those that are unsupported. However, we should monitor the
results carefully so that doesn't end up with google just zapping
*everything* -- we need them to realize the newer versions are there.
Doing the canonical-URL-setup that Jonathan suggested would make
google update it, the question is what happens if they just "go away".
Do we *loose* all the existing "google power" of those links? If so,
it might be a very costly expereiment...

I think there's a risk here whatever we do. I'm not sure that's a good enough reason to do nothing though.
 
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Fixing Google Search on the docs (redux)

От
Greg Stark
Дата:
> all other URLs will be considered duplicate URLs and crawled less often

What Google crawls and what Google considers a valid search result to
serve users are two independent questions. Google may well crawl the
non-canonical results but never serve them. The crawl would still, for
example, add weight to pages linked from it. It's always really hard
to tell when reading Google docs whether they're talking about crawl
behaviour or search results behaviour.

> - Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the
/current/version).
 

This seems like a significant advance on previous ideas. If we have
enough meta data available to do this that would be a big win. I think
it's rare that we remove information from a page but keep the same
page. Generally things like recovery.conf would mean removing whole
pages replacing them with new pages that document new functionality.



Re: Fixing Google Search on the docs (redux)

От
Magnus Hagander
Дата:
On Thu, Nov 19, 2020 at 3:19 PM Greg Stark <stark@mit.edu> wrote:
>
> > - Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the
/current/version).
 
>
> This seems like a significant advance on previous ideas. If we have
> enough meta data available to do this that would be a big win. I think
> it's rare that we remove information from a page but keep the same
> page. Generally things like recovery.conf would mean removing whole
> pages replacing them with new pages that document new functionality.

It's actually the other way around. We very seldom remove pages, but
more often change the information that's on them.

But yes, we definitely have the metadata to do that. It'll take some
SQL magic in the page generation I think, but luckily we know one or
two people who can write such things :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Fixing Google Search on the docs (redux)

От
Peter Geoghegan
Дата:
On Thu, Nov 19, 2020 at 6:23 AM Magnus Hagander <magnus@hagander.net> wrote:
> It's actually the other way around. We very seldom remove pages, but
> more often change the information that's on them.

I agree.

-- 
Peter Geoghegan



Re: Fixing Google Search on the docs (redux)

От
Andres Freund
Дата:
Hi,

On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> We've discussed this many times before, and I think so far they've all
> bogged down at "google suck" :) The problem is that they don't even
> consider the case like we have where the pages *aren't* identical, but
> yet related.

Is any search engine better at this? I don't think so?


> The problem it usually comes down to is that if we do that, then you
> will no longer be able to say search for something in the old docs *at
> all*.

I think that'd still be better than the current situation. But I hope we
can do better:

> A good example right now might be that recovery.conf stuff goes
> away. Even if you explicitly search for "postgresql recovery.conf 11".
> And I'd guess the majority of people are actually looking for things
> in versions that are NOT the latest (though an even bigger majority of
> people will be looking for things in versions that are not 9.1).

E.g. not applying canonical when there's no newer version.


> I don't know of any way to actually tell google to prioritise the new
> versions. You used to be able to do this using the sitemap.xml stuff,
> which is why we do that, but at some point they just stopped caring
> about those, even in the cases where we're *lowering* our own
> priority, under the argument of not letting us increase our priority.

Have we evaluated not using canonical, but not including old versions in
the sitemap?

Greetings,

Andres Freund



Re: Fixing Google Search on the docs (redux)

От
Magnus Hagander
Дата:
On Thu, Nov 19, 2020 at 8:50 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> > We've discussed this many times before, and I think so far they've all
> > bogged down at "google suck" :) The problem is that they don't even
> > consider the case like we have where the pages *aren't* identical, but
> > yet related.
>
> Is any search engine better at this? I don't think so?

I doubt it, most tend to copy Google. And in either case it doesn't
matter that much -- the *vast* majority of our inbound search traffic
is google vs the other searches. By such a margin that it's not even a
point in considering the others.


> > The problem it usually comes down to is that if we do that, then you
> > will no longer be able to say search for something in the old docs *at
> > all*.
>
> I think that'd still be better than the current situation. But I hope we
> can do better:
>
> > A good example right now might be that recovery.conf stuff goes
> > away. Even if you explicitly search for "postgresql recovery.conf 11".
> > And I'd guess the majority of people are actually looking for things
> > in versions that are NOT the latest (though an even bigger majority of
> > people will be looking for things in versions that are not 9.1).
>
> E.g. not applying canonical when there's no newer version.

That we can definitely go. So for recovery.conf it would still work,
but anything that goes on a page where the page still exists, I don't
see how we could separate that out and not do a canonical for that...


> > I don't know of any way to actually tell google to prioritise the new
> > versions. You used to be able to do this using the sitemap.xml stuff,
> > which is why we do that, but at some point they just stopped caring
> > about those, even in the cases where we're *lowering* our own
> > priority, under the argument of not letting us increase our priority.
>
> Have we evaluated not using canonical, but not including old versions in
> the sitemap?

AIUI from my reading, Google mostly ignores sitemaps these days. The
only thing it's used for is seeding *new* URLs into the search engine,
not removing old and not having any effect on priority. Probably
because it was abused too much.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: Fixing Google Search on the docs (redux)

От
Andres Freund
Дата:
Hi,

On 2020-11-21 15:57:28 +0100, Magnus Hagander wrote:
> On Thu, Nov 19, 2020 at 8:50 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2020-11-18 18:28:49 +0100, Magnus Hagander wrote:
> > > We've discussed this many times before, and I think so far they've all
> > > bogged down at "google suck" :) The problem is that they don't even
> > > consider the case like we have where the pages *aren't* identical, but
> > > yet related.
> >
> > Is any search engine better at this? I don't think so?
> 
> I doubt it, most tend to copy Google. And in either case it doesn't
> matter that much -- the *vast* majority of our inbound search traffic
> is google vs the other searches. By such a margin that it's not even a
> point in considering the others.

I was more wondering whether it's "search engines sucks" or "google
sucks" - obviously g search is dominant...


> > > The problem it usually comes down to is that if we do that, then you
> > > will no longer be able to say search for something in the old docs *at
> > > all*.
> >
> > I think that'd still be better than the current situation. But I hope we
> > can do better:
> >
> > > A good example right now might be that recovery.conf stuff goes
> > > away. Even if you explicitly search for "postgresql recovery.conf 11".
> > > And I'd guess the majority of people are actually looking for things
> > > in versions that are NOT the latest (though an even bigger majority of
> > > people will be looking for things in versions that are not 9.1).
> >
> > E.g. not applying canonical when there's no newer version.
> 
> That we can definitely go. So for recovery.conf it would still work,
> but anything that goes on a page where the page still exists, I don't
> see how we could separate that out and not do a canonical for that...

Compute a similarity metric ;). No, I'm not serious.


I wonder if it's worth adding some more metadata to our pages for
google's benefit. Perhaps it'd be *slightly* less annoying to navigate
to the right version of the docs if we added breadcrumb annotations
https://developers.google.com/search/docs/data-types/breadcrumb#json-ld_1

I can imagine - but have nothing but intuition to back that up - that we
also make google's job harder by having very recent timestamp for each
version of the docs. Perhaps we ought to add datePublished /
dateModified annotations, and freeze datePublished to the release?

And probably also not update dateModified when the page didn't change,
but I think you were discussing that elsewhere.

Greetings,

Andres Freund