Now you can block OpenAI’s web crawler

1 year ago

source link: https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Now you can block OpenAI’s web crawler

Internet users can block GPTBot and keep their site out of ChatGPT.

By Emilia David, a reporter who covers AI. Prior to joining The Verge, she covered the intersection between technology, finance, and the economy.

Aug 7, 2023, 5:36 PM UTC|

Share this story

Image: OpenAI

OpenAI now lets you block its web crawler from scraping your site to help train GPT models.

In a blog post, OpenAI said website operators can specifically disallow its GPTBot crawler on their site’s Robots.txt file or block its IP address. “Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI said in the blog post. For sources that don’t fit the excluded criteria, “allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

Blocking the GPTBot may be the first step in OpenAI allowing internet users to opt out of having their data used for training its large language models. It follows some early attempts at creating a flag that would exclude content from training, like a “NoAI” tag conceived by DeviantArt last year. It does not retroactively remove content previously scraped from a site from ChatGPT’s training data.

The internet provided much of the training data for large language models such as OpenAI’s GPT models and Google’s Bard. However, OpenAI won’t confirm if it got its data through social media posts, copyrighted works, or what parts of the internet it scraped for information. And sourcing data for AI training has become increasingly contentious. Sites, including Reddit and Twitter, have pushed to crack down on the free use of their users’ posts by AI companies, while authors and other creatives have sued over alleged unauthorized use of their works. Lawmakers also latched onto data privacy and consent questions in several Senate hearings around AI regulation last month.

As reported by Axios, companies like Adobe have floated the idea of marking data as not for training through an anti-impersonation law. AI companies, including OpenAI, signed an agreement with the White House to develop a watermarking system to let people know if something was generated by AI but made no promises to stop using internet data for training.

Inside the flop that changed Apple forever

Apple’s Macintosh, released in 1984, is celebrated for ushering in a new era of user-friendly computing. But! The Mac owes a lot to its lesser-known, older sister Lisa. Here’s how the Lisa, while seen as a flop today, used clever interface design to welcome everyone into the personal computer era. Though as new technologies like AR, VR, and AI chatbots arrive, are we finally leaving Lisa’s legacy behind?

Recommend

www.tuicool.com 5 years ago
Cache

Building a Dark Web Crawler in Go

I have been passionated by web crawler for a long time. I have written several one in many languages such as C++, JavaScript (Node.JS), Python, ... and I love the theory behind them. But first of all, what is a w...

rolisz.ro 5 years ago
Cache

Web crawler in Rust

I have heard many good things about Rust for several years now. A couple of months ago, I finally decided to start learning Rust. I skimmed through the Book an...

www.semrush.com 3 years ago
Cache

5 Critical Errors A Site Crawler Can Reveal For Your Site

5 Critical Errors A Site Crawler Can Reveal For Your SiteErika VaragouliJun 10, 20218 min readWebsite crawlers visit your site often over t...

searchengineland.com 1 year ago
Cache

GPTBot: OpenAI releases new web crawler

You can now prevent OpenAI's ChatGPT from accessing your website, or parts of it, using robots.txt. Barry Schwartz

www.theregister.com 1 year ago
Cache

OpenAI identifies its GPTBot web crawler so you can block it • The Register

How to spot OpenAI's crawler bot and stop it slurping sites for training data Aww, c'mon, let us scrape your pages, we've got bil...

blog.gslin.org 1 year ago
Cache

OpenAI 的 web crawler 叫做 GPTBot

OpenAI 的 web crawler 叫做 GPTBot Hacker News 上看到「

www.techspot.com 1 year ago
Cache

Websites can now block OpenAI's web crawling bot | TechSpot

Websites can now block OpenAI's web crawling bot Will ChatGPT's plagiarizing algorithms abide to the "robots.txt" custom rules? By

arstechnica.com 1 year ago
Cache

Sites scramble to block ChatGPT web crawler after instructions emerge

Oh, what a tangled web we weave — Sites scramble to block ChatGPT web crawler after instructions emerge Restrictions don't apply to current OpenAI models, but will affect f...

searchengineland.com 1 year ago
Cache

Dozens of big brands have blocked GPTBot, OpenAI's new web crawler

At least 15% of the top 100 websites and 7% of the top 1,000 websites are blocking GPTBot, a new analysis finds. Dann...

www.cyberciti.biz 1 year ago
Cache

How to block AI Crawler Bots using robots.txt file

How to block AI Crawler Bots using robots.txt file Are you a content creator or a blog author who generates unique, high-quality content for a living? Have you noticed that generative AI platforms like OpenAI or CCBot use your co...

Now you can block OpenAI’s web crawler

Now you can block OpenAI’s web crawler

Internet users can block GPTBot and keep their site out of ChatGPT.

Share this story

Inside the flop that changed Apple forever

Recommend

About Joyk