In late January, Dean of Libraries Leo Lo kicked off UVA Library’s new “Ethical Dimension of AI Literacy” series, which, this spring, will feature numerous presentations by AI scholars from across the University. Lo’s talk, titled “Memory Without Origin: Provenance, Consent, and Trust in the Age of Generative AI,” was originally scheduled to be held in the Shannon Library Seminar Room but because so many people registered for the event, it had to be moved to a larger location — the auditorium in Harrison/Small, just outside the Special Collections Library.
This turned out to be a perfect setting, as Lo’s talk focused on archives and the importance of protecting them from AI services that could “ingest” them. “Think of all the sensitive materials donated to archives — letters, personal items,” Lo said, gesturing to the Special Collections vault nearby. “Academics care about citation and evidence. Archivists care about context. If we do not set enforceable boundaries for AI use in cultural heritage archives now, we will lose provenance and then lose trust.”
Just say no (to training data)
Generative AI (like ChatGPT, Claude, Copilot, and Gemini) uses generative models, or artificial neural networks, to create text, videos, images, code, digital audio, and more. These models learn how to generate content, make predictions, and identify patterns through “training data”: massive datasets consisting of open material across the internet (books, webpages, articles, spreadsheets, etc.) that AI has absorbed, or “ingested.”
AI companies, Lo said, are running out of training data. They are making deals with content companies to train their models. For example, Oxford University is partnering with OpenAI (maker of ChatGPT). As part of that agreement, OpenAI will digitize public domain materials in Oxford’s Bodleian Libraries, including dissertations from 15th - 19th centuries that were previously unavailable online. Certainly, there are some positive elements of this partnership, Lo said. AI can make inaccessible archival materials publicly accessible, modernize workflows, and improve metadata. “Without this enormous influx of money [from AI companies], it would take forever to digitize and transcribe this stuff,” he said.
However, partnerships like these come at an enormous cost. If archival materials are absorbed into training data for generative models, it is usually impossible to isolate the original source of items once training is complete, Lo said. And once your data has been absorbed by AI, it cannot be removed. “Information without origin is a liability. We are at risk of trading long-term control for short-term convenience,” he said. “Libraries need to be AI-literate in order to maintain control of our archives as well as institutional trust.”

The UVA Archival AI Protocol
Lo recently worked with the UVA Special Collections team to develop “The University of Virginia Archival AI Protocol,” a document that serves as a practical approach for evaluating AI requests involving cultural heritage collections. It provides AI training and access standards for archival organizations and is organized around a simple rule: “Irreversible models do not get access unless item-level provenance and meaningful attribution can be demonstrated in practice, and the archival organization retains contractually enforceable control to stop further use.”
The protocol is built on three foundational pillars. The first, Provenance and Attribution, highlights the importance of traceability, linking, and credit in AI systems. The second, Donor, Community, and Ethical Obligations, outlines the imperative to honor all commitments made in deeds of gift, transfer documentation, and purchase agreements. The third, Institutional Control, explains the importance of an institution asserting a “right to stop” before entering into an agreement with any AI company.
The document also draws a clear distinction between two types of AI use, which in Lo’s “Memory Without Origin” talk he described as “the great divide.” General purpose training, as described above, should be avoided, as it absorbs data into its model, often without context, and reversal is not realistic. The other types of AI use, retrieval and controlled internal models, are generally permitted in the protocol, as these are systems that look at data but don’t absorb it, and source items remain stored and controlled by the archive.

An opportunity for all libraries, museums, and repositories
Lo did not mince words in the closing moments of his talk. “Every time there is a huge new technology it’s an existential threat for libraries,” he said. “There is a great opportunity to set some boundaries around AI rather than be reactive.”
In terms of protecting archives and special collections, Lo said that whatever is not currently open/digitized is what institutions can control. “I’m trying to protect whatever we have left,” he said. “I don’t like being a passive passenger on this ride. Let’s do something. It doesn’t cost us anything.”
Lo encouraged attendees to spread the word about the Archival AI Protocol beyond the Library, across Grounds, and even to outside institutions. “This protocol is something for all repositories to grab hold of,” he said. “My goal is a shared standard that libraries, archives, and museums can use in real decisions and contracts.”