Обсуждение: robots.txt sometimes disallowing all?
I noticed an unusual search result shown as the top result by Google (search query "POSTGRESQL DROP TRIGGER", first result for me leads to www.postgresql.org/docs/8.3/static/sql-droptrigger.html ). The title of the result is somehow "英語 - PostgreSQL", and below that title reads: "A description for this result is not available because of this site's robots.txt – learn more." Sure enough, when I checked http://www.postgresql.org/robots.txt in Chrome on OS X, I see: User-agent: * Disallow: / though when I check in other browsers (Safari, wget), I see a more reasonable robots.txt: === User-agent: * Disallow: /admin/ Disallow: /account/ Disallow: /docs/devel/ Disallow: /list/ Disallow: /search/ Disallow: /message-id/raw/ Disallow: /message-id/flat/ Sitemap: http://www.postgresql.org/sitemap.xml === Is it intentional that we're serving up that first robots.txt to (apparently) Googlebot and Chrome? Josh
This behavior seems to still be going on, but I think I have a clue. I noticed while experimenting with: wget -O robots.txt http://www.postgresql.org/robots.txt && cat robots.txt that wget tells me the available servers for www.postgresql.org it has found in DNS: Resolving www.postgresql.org... 87.238.57.232, 217.196.149.50, 174.143.35.230 When I fall to 217.196.149.50 and 87.238.57.232, I get the normal robots.txt. When I fall to 174.143.35.230, I get the bad version disallowing all access to the site. BTW, this behavior seems to not be dependent on the user-agent string, contrary to my earlier speculation. Could someone please check out what's going on with robots.txt on 174.143.35.230, as it seems to seriously be screwing with our Google search results. Josh On Wed, Jun 18, 2014 at 9:26 AM, Josh Kupershmidt <schmiddy@gmail.com> wrote: > I noticed an unusual search result shown as the top result by Google > (search query "POSTGRESQL DROP TRIGGER", first result for me leads to > www.postgresql.org/docs/8.3/static/sql-droptrigger.html ). The title > of the result is somehow "英語 - PostgreSQL", and below that title > reads: "A description for this result is not available because of this > site's robots.txt – learn more." > > Sure enough, when I checked http://www.postgresql.org/robots.txt in > Chrome on OS X, I see: > > User-agent: * > Disallow: / > > though when I check in other browsers (Safari, wget), I see a more > reasonable robots.txt: > > === > User-agent: * > Disallow: /admin/ > Disallow: /account/ > Disallow: /docs/devel/ > Disallow: /list/ > Disallow: /search/ > Disallow: /message-id/raw/ > Disallow: /message-id/flat/ > > Sitemap: http://www.postgresql.org/sitemap.xml > === > > Is it intentional that we're serving up that first robots.txt to > (apparently) Googlebot and Chrome? > > Josh
Thanks for the diagnostics! I was expecting it was something like that, but somehow managed to misplace your original report and therefor didn't investigate it further.
I will take a look at it tonight.
//Magnus
On Tue, Jun 24, 2014 at 4:05 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
This behavior seems to still be going on, but I think I have a clue. I
noticed while experimenting with:
wget -O robots.txt http://www.postgresql.org/robots.txt && cat robots.txt
that wget tells me the available servers for www.postgresql.org it has
found in DNS:
Resolving www.postgresql.org... 87.238.57.232, 217.196.149.50, 174.143.35.230
When I fall to 217.196.149.50 and 87.238.57.232, I get the normal
robots.txt. When I fall to 174.143.35.230, I get the bad version
disallowing all access to the site. BTW, this behavior seems to not be
dependent on the user-agent string, contrary to my earlier
speculation. Could someone please check out what's going on with
robots.txt on 174.143.35.230, as it seems to seriously be screwing
with our Google search results.
Josh
On Wed, Jun 18, 2014 at 9:26 AM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
> I noticed an unusual search result shown as the top result by Google
> (search query "POSTGRESQL DROP TRIGGER", first result for me leads to
> www.postgresql.org/docs/8.3/static/sql-droptrigger.html ). The title
> of the result is somehow "英語 - PostgreSQL", and below that title
> reads: "A description for this result is not available because of this
> site's robots.txt – learn more."
>
> Sure enough, when I checked http://www.postgresql.org/robots.txt in
> Chrome on OS X, I see:
>
> User-agent: *
> Disallow: /
>
> though when I check in other browsers (Safari, wget), I see a more
> reasonable robots.txt:
>
> ===
> User-agent: *
> Disallow: /admin/
> Disallow: /account/
> Disallow: /docs/devel/
> Disallow: /list/
> Disallow: /search/
> Disallow: /message-id/raw/
> Disallow: /message-id/flat/
>
> Sitemap: http://www.postgresql.org/sitemap.xml
> ===
>
> Is it intentional that we're serving up that first robots.txt to
> (apparently) Googlebot and Chrome?
>
> Josh
--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Yup, the problem was what I thought it was - the list of frontend IPs wasn't properly updated when we renumbered the network in one of our hosting centers.
Fixed now, thanks for being persistent! Now we just have to wait for Google to pick it up.
//Magnus
On Tue, Jun 24, 2014 at 4:15 PM, Magnus Hagander <magnus@hagander.net> wrote:
Thanks for the diagnostics! I was expecting it was something like that, but somehow managed to misplace your original report and therefor didn't investigate it further.I will take a look at it tonight.//Magnus--On Tue, Jun 24, 2014 at 4:05 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:This behavior seems to still be going on, but I think I have a clue. I
noticed while experimenting with:
wget -O robots.txt http://www.postgresql.org/robots.txt && cat robots.txt
that wget tells me the available servers for www.postgresql.org it has
found in DNS:
Resolving www.postgresql.org... 87.238.57.232, 217.196.149.50, 174.143.35.230
When I fall to 217.196.149.50 and 87.238.57.232, I get the normal
robots.txt. When I fall to 174.143.35.230, I get the bad version
disallowing all access to the site. BTW, this behavior seems to not be
dependent on the user-agent string, contrary to my earlier
speculation. Could someone please check out what's going on with
robots.txt on 174.143.35.230, as it seems to seriously be screwing
with our Google search results.
Josh
On Wed, Jun 18, 2014 at 9:26 AM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
> I noticed an unusual search result shown as the top result by Google
> (search query "POSTGRESQL DROP TRIGGER", first result for me leads to
> www.postgresql.org/docs/8.3/static/sql-droptrigger.html ). The title
> of the result is somehow "英語 - PostgreSQL", and below that title
> reads: "A description for this result is not available because of this
> site's robots.txt – learn more."
>
> Sure enough, when I checked http://www.postgresql.org/robots.txt in
> Chrome on OS X, I see:
>
> User-agent: *
> Disallow: /
>
> though when I check in other browsers (Safari, wget), I see a more
> reasonable robots.txt:
>
> ===
> User-agent: *
> Disallow: /admin/
> Disallow: /account/
> Disallow: /docs/devel/
> Disallow: /list/
> Disallow: /search/
> Disallow: /message-id/raw/
> Disallow: /message-id/flat/
>
> Sitemap: http://www.postgresql.org/sitemap.xml
> ===
>
> Is it intentional that we're serving up that first robots.txt to
> (apparently) Googlebot and Chrome?
>
> Josh
--
Sent via pgsql-www mailing list (pgsql-www@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-www
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/