AI Scraping Tool is a huge website with traffic.


The creator of Internet image-scrubbing tools like Stable Diffusion is telling website owners who want to stop using artificial intelligence image generators to actively opt out, and says it’s “sad” they’re struggling. The inevitable rise of AI.

“It’s unfortunate that most of you don’t realize the potential of AI and can’t unlock AI, and as a result you decide to fight it,” said Roman Beaumont, creator of img2dataset image processing tool. GitHub page. “You will have many opportunities to benefit from AI in the coming years. I hope you see it sooner rather than later. As an innovator, you have many opportunities to benefit from it.”

Img2dataset is a free tool shared by Beaumont on GitHub that allows users to automatically download and convert a list of URLs. The result is an image dataset, the kind that trains image-generating AI models such as Open AI’s DALL-E, an open-source Stable Diffusion model, and Google Image. Beaumont is an open source contributor to LAION-5B, one of the world’s largest image datasets used in Image and Stable Diffusion, containing over 5 billion images.

Img2dataset tries to scrape images from any site as long as site owners don’t add https headers like “X-Robots-Tag: noai” and “X-Robots-Tag: noindex”. That means the onus is on site owners to opt out of img2dataset, many of whom probably don’t even know img2dataset exists.

On Sunday, Terence Eden commented on his GitHub page that the tool “hammered” several of his pages and asked him to opt-in.

“I don’t understand why opting out of this tool would obligate me to add a new header to my site,” Eden said. “Can you please change the default behavior so that it only works on sites that have X-Robots-Tag: YesAI?”

“If you don’t want people to see images from your website, the best way is to turn it off,” Beaumont replied. Beaumont did not respond to a request for comment.

When Eden and other Github commenters pushed back, Beaumont said it was “unethical” to opt-in to img2dataset instead of opting out.

“It is definitely unethical to allow a small minority to not share their images and benefit from the latest gen AI tool,” he said on GitHub. “Volunteering is not immoral. You can give your consent to anything you want. It seems that millions are trying to decide [sic] Without asking other people’s permission.

In an email to Motherboard, Eden noted that img2dataset was scraping its own website. Open benchwhich invites users to upload images of memorial chairs and locations from around the world. Currently, OpenBenches has mapped 27,629 benches, and hosts 250GB of photos.

“I noticed that I received a notification from my host that the site is under constant attack,” Eden said. “I had to pay part of my weekend to upgrade my server, pay extra for outbound traffic, and prevent abuse from this particular bot.”

Beaumont also defended the img2dataset by comparing it to the way Google indexes all websites online to power its search engine.

“I take advantage directly from search engines when they drive me useful traffic,” Eden told Motherboard. But, most importantly, the Google bot is respectful and won’t hit my site. And most bots respect the robots.txt directive. Romain’s device does not work. It appears that website owners are deliberately set up to ignore the guidelines in place. And, frankly, it doesn’t do me any direct good. The “robots.txt” file tells search engines like Google.

The recent popularity of AI tools raises questions about permission and ownership that are as old as the Internet. Google’s featured snippets extract the most relevant content from certain websites. Obsolete in practice. Facebook increased engagement with news stories in its News Feed, then cornered most of the advertising dollars, squeezing media companies (some countries like Australia Now ask media companies for Facebook’s payment for this practice).

Tools like ChatGPT and Stable Diffusion work similarly because they scrape vast swaths of the Internet: articles, forum posts, art, photos, etc. without even giving users the chance to opt out of what they’ve shared with their friends or fans online. Much of this data predates the existence of the Open AI, Stability AI or LAION datasets.

The leaders of the new crop of AI companies believe that their technology can be replaced 80 percent Jobs and Locations in the U.S.”Huge accidents” to society. We should be skeptical of these claims, but it’s important to note that if people who build devices they think will disrupt this technology want internet users to fuel the technology, their efforts do so without asking the internet users who power the AI.

Big companies looking at how AI is shaping up are not fools. Executives see new revenue potential in AI and want to cut back. last week, Reddit said. Google, OpenAI and other companies are changing the API so that you can no longer scrape for free. A few days later, ChatGPT will one day largely replace Stack Overflow as a resource for programmers. He did so. Elon Musk threatened. Open AI To scour Twitter for information.

It’s a simple logic: Why would these companies sit idly by while a new generation of technology steals data to build devices that can later compete with them? Why should these companies provide that data for free?

There were individual internet users like Eden. Asking the same questions All the while the AI ​​slowly came out. They simply don’t have an easy way to fight.

“Thousands of devices are released every day,” Eden said. “Am I expected to play Whac-a-Mole and block every new one that appears? Expecting people to behave is a perverse way. These bots cost people time and money without providing any tangible benefit… Consent is the bedrock of ethics. Datasets built on non-consensual data pose a clear threat to owners and users of that model.

We offer you some site tools and assistance to get the best result in daily life by taking advantage of simple experiences