• ashtrix@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    1 year ago

    Yeah, it’s already too late. Why didn’t they provide this before they already scraped websites?

    • P03 Locke@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      You think Google thought about robots.txt before they developed their search engine? Nah, it’s all public Internet, and they scraped away.

      A non-zero percentage of web sites will bother to follow these instructions, but it might as well be zero.

      • Scrubbles@poptalk.scrubbles.tech
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 year ago

        Yeah I always assumed robots.txt only told them to hide it from search results, but Google still scrapes everything they can from you. The illusion they skipped over you

      • The Doctor@beehaw.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 year ago

        Very early on, at least, their spiders respected robots.txt.

        I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.

        • chameleon@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          1 year ago

          I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don’t block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn’t security, it’s that I don’t like them (or Bing or Yandex).

          Another gotcha I’ve seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn’t respect a * exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.

          And I suppose they’ll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.