Re: no mailing list hits in google

Поиск
Список
Период
Сортировка
От Daniel Gustafsson
Тема Re: no mailing list hits in google
Дата
Msg-id E6414B35-D288-4958-BEFB-42FF56F4BB2F@yesql.se
обсуждение исходный текст
Ответ на Re: no mailing list hits in google  (Magnus Hagander <magnus@hagander.net>)
Список pgsql-www
> On 30 Aug 2019, at 12:08, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:
> > On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote:
> > On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:
> >> On 2019-Aug-29, Magnus Hagander wrote:
> >>
> >>> Maybe Google used to load the pages under /list/ and crawl them for links
> >>> but just not include the actual pages in the index or something
> >>>
> >>> I wonder if we can inject these into Google using a sitemap. I think that
> >>> should work -- will need some investigation on exactly how to do it, as
> >>> sitemaps also have individual restrictions on the number of urls per file,
> >>> and we do have quite a few messages.
> >>>
> >>>> Why is that /list/ exclusion there in the first place?
> >>>
> >>> Because there are basically infinite number of pages in that space, due to
> >>> the fact that you can pick an arbitrary point in time to view from.
> >>
> >> Maybe we can create a new page that's specifically to be used by
> >> crawlers, that lists all emails, each only once.  Say (unimaginatively)
> >> /list_crawlers/2019-08/ containing links to all emails of all public
> >> lists occurring during August 2019.
> >
> > Hm. Weren't there occasionally downranking rules for pages that were
> > clearly aimed just at search engines?
>
> I think that’s mainly been for pages which are clearly keyword spamming, I
> doubt our content would get caught there.  The sitemap, as proposed upthread,
> is the solution to this however and is also the recommended way from Google for
> sites with lots of content.
>
> Google does however explicitly downrank duplicated/similar content, or content
> which can be reached via multiple URLs and which doesn’t list a canonical URL
> in the page.  A single message and the whole-thread link does contain the same
> content, and neither are canonical so we might be incurring penalties from
> that.  Also, the postgr.es/m/ <http://postgr.es/m/> shortener makes content available via two URLs,
> without a canonical URL specified.
>
> But robots.txt blocks the whole-thread view (and this is the reason for it).

Maybe that’s part of the explanation, since Google no longer wants sites to use
robots.txt for restricting crawlers on what to index (contrary to much indexing
advice which is vague at best they actually say so explicitly)?  Being in
robots.txt doesn’t restrict the page from being indexed, if it is linked to
from somewhere else with enough context etc (for example if a thread is
reproduced on a forum with a link to /message-id/raw).  Their recommended way
is to mark the page with noindex:

    <meta name=“robots” content=“noindex” />

> And postgr.es/m/ <http://postgr.es/m/> does not actually make the content available there, it redirects.

Right, but a 301 redirect is considered by Google as deprecating the old page,
which may or may not throw the indexer off since we continue to use
postgr.es/m/ without a canonicalization?

> So I don't think those should actually have an effect?

That could very well be true, as with most things SEO it’s all a guessing game.

> That being said, since we haven’t changed anything, and DuckDuckGo happily
> index the mailinglist posts, this smells a lot more like a policy change than a
> technical change if my experience with Google SEO is anything to go by.  The
> Webmaster Tools Search Console can quite often give insights as to why a page
> is missing, that’s probably a better place to start then second guessing Google
> SEO.  AFAICR, using that requires proving that one owns the site/domain, but
> doesn’t require adding any google trackers or similar things.
>
> I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because
theyare in /flat/ or /raw/, but nothing at all about the regular messages.  

That’s disappointing, I’ve gotten quite good advice there in past.

cheers ./daniel


В списке pgsql-www по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: no mailing list hits in google
Следующее
От: Rodrigo Ramírez Norambuena
Дата:
Сообщение: PGweb: Patches and tests