• The Chatbots May Poison Themselves

    From ltlee1@21:1/5 to All on Thu Jun 22 08:20:42 2023
    https://www.theatlantic.com/technology/archive/2023/06/generative-ai-future-training-models/674478/

    Generative AI is utterly reliant on the sustenance it gets from the web: Computers mime intelligence by processing almost unfathomable amounts of data and deriving patterns from them. ChatGPT can write a passable high-school essay because it has read
    libraries’ worth of digitized books and articles, while DALL-E 2 can produce Picasso-esque images because it has analyzed something like the entire trajectory of art history. The more they train on, the smarter they appear.

    Eventually, these programs will have ingested almost every human-made bit of digital material. And they are already being used to engorge the web with their own machine-made content, which will only continue to proliferate—across TikTok and Instagram,
    on the sites of media outlets and retailers, and even in academic experiments. To develop ever more advanced AI products, Big Tech might have no choice but to feed its programs AI-generated content, or just might not be able to sift human fodder from the
    synthetic—a potentially disastrous change in diet for both the models and the internet, according to researchers.

    The problem with using AI output to train future AI is straightforward. Despite stunning advances, chatbots and other generative tools such as the image-making Midjourney and Stable Diffusion remain sometimes shockingly dysfunctional—their outputs
    filled with biases, falsehoods, and absurdities. ... In a recent study on this phenomenon, which has not been peer-reviewed, Shumailov and his co-authors describe the conclusion of those amplified errors as model collapse: “a degenerative process
    whereby, over time, models forget,” almost as if they were growing senile.

    Generative AI produces outputs that, based on its training data, are most probable. ... Each successive AI trained on past AI would lose information on improbable events and compound those errors, Aditi Raghunathan, a computer scientist at Carnegie
    Mellon University, told me. You are what you eat.
    ...
    In other words, the programs could only spit back out a meaningless average—like a cassette that, after being copied enough times on a tape deck, sounds like static. As the science-fiction author Ted Chiang has written, if ChatGPT is a condensed
    version of the internet, akin to how a JPEG file compresses a photograph, then training future chatbots on ChatGPT’s output is “the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.

    ...
    The potential for AI-generated data to result in model collapse, then, emphasizes the need to curate training datasets. “Filtering is a whole research area right now,” Dimakis told me. “And we see it has a huge impact on the quality of the models”
    —given enough data, a program trained on a smaller amount of high-quality inputs can outperform a bloated one. Just as synthetic data aren’t inherently bad, “human-generated data is not a gold standard,” Ilia Shumailov said. “We need data that
    represents the underlying distribution well.” Human and machine outputs are just as likely to be misaligned with reality (many existing discriminatory AI products were trained on human creations). "

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)