Penguin Random House, one of the world’s largest publishers, has taken action to prevent firms from using its vast portfolio to train AI systems, according to a report from publishing trade The Bookseller. AI companies often scrape sources like books, newspapers, and social media to train their AI models, which has led to legal controversies. Along with Simon & Schuster, Hachette, HarperCollins, and Macmillan Publishers, Penguin Random House is one of the “Big Five” English language publishers, controlling 80% of the US book trade as of 2022.
Penguin has updated the copyright wording on all its titles worldwide and across all its imprints. It now states: “No part of this book may be used or reproduced in any manner to train artificial intelligence technologies or systems.” According to The Bookseller, this new wording will appear on all new titles and any reprinted old titles.
The statement refers to a European Parliament directive released earlier this year, which gives copyright holders the right to protect their material from text or data mining by AI companies, as long as their work has been opted out of being used by AI.
Not all book publishers have taken such a strict stance on how their material is used. Wiley, Oxford University Press, and Taylor & Francis have all signed agreements to allow their content to be used to train AI under certain conditions, The Bookseller reported in August.
Media publications are also divided on AI scraping. In December 2023, The New York Times sued OpenAI and Microsoft for copyright infringement, claiming that millions of its articles were used to train the companies’ AI models. Others have signed deals with companies like OpenAI and Microsoft for permission to scrape content.