Machine Learning, Archives and Special Collections: A high level view

Article from the Flash journal, n°38. Members of ICA, to read the full focus on Artificial Intelligence and Archives, follow this link. If you are not a member, you can follow this link to join us.

Extravagant predictions are being made for what the popular media calls “artificial intelligence.” This will eliminate millions of jobs; it will give rise to self-driving cars; it will take over medical diagnosis and prescription  of treatments, corporate and government  decision-making. There is a sense it will do something – usually poorly defined – to transform knowledge work and stewardship,  and the activities  of cultural memory organizations.  This brief  article is an attempt to provide some reasonably sober and concrete sense of what actual and relevant changes might  occur within the next decade or so, without  going into technical details, and what these changes might  imply for the practices  of archives and special collections, or cultural memory organizations more broadly.

Dr Clifford A. Lynch, Executive Director of the Coalition for Networked Information (CNI), at Jisc/CNI Conference in 2018, Photo Courtesy Jisc

Remarkable strides have been made in recent years, primarily  in the specific and limited sub-discipline of machine learning. Roughly speaking, machine learning uses collections of examples to train software to recognize patterns, and to act on that recognition. For example, it has allowed a computer program to become the worldwide champion in the game of Go, which is generally considered much more complex than chess, and also to allow computers to learn how to excel at various video games. Software now matches human performance in screening various kinds of medical imaging to identify some diseases. Many of the most celebrated predicted breakthroughs couple  machine  learning with various forms of robotics and “computer vision” (really, a broad range of imaging and other environmental sensors), most notably in applications  such as autonomous cars, trucks, ships, drones, or military devices.

There are three drivers for the introduction of machine learning: reduce costs by eliminating humans (autonomous vehicles), outperform human capabilities (games), or do things that cannot be accomplished today at the desired scale with acceptable  costs (ubiquitous surveillance). It is this last driver that also offers opportunities for memory organizations.

Some applications where machine learning have lead to breakthroughs that are highly relevant to memory organizations include translation  from one language to another; transcription from printed or handwritten text to computer representation (sometimes called optical character recognition); conversion of spoken words to text; classification of images by their content (for example, finding images containing dogs, or enumerating all the objects that the software can recognize within an image); and, as a specific and important special case of image identification, human facial recognition. Advances in all of these areas are being driven and guided by the government or commercial sectors, which are infinitely better funded  than cultural memory; for example, many nation-states and major corporations are intensively interested  in facial  recognition. The key strategy for the cultural memory sector will be to exploit these advantages, adapting and tuning the technologies around the margins for its own needs.

It’s important to note that when speaking of technology investment in machine learning, this investment comes in several forms. There’s investment in software, and in the computational algorithms that underlie the software. Critically, there’s investment in training data: this is data that can be used to train and validate machine learning models; it generally represents large collections of cases that have been evaluated by the best human experts – for example, radiology images with annotations as to whether there is a tumor; summaries of the contents of photographs; pictures of faces with names attached. Gathering these collections of training data can be very challenging, and often involve repurposing other data – facial images and names collected for drivers’ licenses or passports by governments, for example. The cultural memory sector needs to think very carefully about what datasets exist that can be similarly repurposed or adapted (perhaps through crowdsourcing) for its own training purposes.

One of the great, largely unexplored challenges for cultural memory organizations is the extent to which it is advantageous to “customize” or specifically train machine learning on individual collections – an individual’s handwriting, as opposed to Victorian copperplate script broadly; or the set of family members that might likely appear in a collection of photographs. Creating these training sets will be expensive, and the cost and workflow trade-offs will be critical.

In the next few decades, machine learning technologies in cultural memory institutions are only going to be relevant to material that is already in digital form; it will require collections to be digitized, or for them to be born in digital form and thus accessioned as digital materials. This is critical, as it limits the scope for applying these technologies because not much material in some institutions is digital; but it is also worth recognizing that, increasingly, new material entering archives and special collections is arriving in digital form. But I do not believe that we will soon see many scholar/archivist/curator robots roaming among our physical collections, selecting and examining and analyzing materials.

Consider the barriers to deploying self-driving vehicles: while this would save money, the cost of human drivers are currently built into the economy; to justify capturing these savings, it must be conclusively proven that these autonomous vehicles are substantially safer than current human drivers. In contrast, the existing state of access to collections is poor due to lack of resources to hire people; if machine learning is used to improve this access, the risk of errors is usually low compared to current practice. Think of a special collection holding a lot of photographs of people. This is a relatively low risk environment to deploy facial recognition. Most commonly, there’s no indexing today of the people in the photographs, so even moderately good recognition will be a substantial improvement. Also, the cost of error is low; a failure to identify a person in a photo will not create a national security risk, and a false identification won’t cause an innocent person to be detained and questioned, or worse. Indeed, the biggest challenge for the curators of the collection indexed by less-than-perfect facial recognition software will be getting a cadre of grateful users to understand that the software is in fact imperfect and to absorb a sense of positive and negative failure rates and the most common failure scenarios. 

Let me conclude with three points. Workflows, and the appropriate organization and structuring of data will be critical to progress. In many machine learning and analysis applications today, the vast majority of time is spent in collecting and cleaning up the data and putting the workflows in place, rather than in the core machine learning work. Memory organizations will face these challenges, and they are likely to severely constrain progress. Also, some machine learning is computationally intensive, and hence expensive to train and subsequently operate.

Second, improved access is going to lead to many debates about privacy and best practices. Facial recognition will be a driver here. Just consider the experience of many universities that have digitized old school yearbooks. These can both feed databases to train facial recognition in other contexts (they contain pictures with names attached), but also, when indexed, they can provide very embarrassing or regrettable images linked to people who perhaps became public figures many years later. The appropriateness of applying facial recognition based indexing will be a subject of major debate in coming years; this is already a very real issue in the social media context, and it will spread to archives and special collections.

Finally, consider one additional scenario for machine learning in memory organizations, this one leveraging investments in machine learning by the intelligence, law enforcement and forensics sectors. More and more often, donations of “personal papers” are dominated by a collection of digital storage devices – laptops, external hard drives, and the like. Other than securing the bits onto modern storage media, the general reaction to these acquisitions is despair: curatorial staff will never catch up with evaluating and describing these materials. Imagine a machine learning application that could at least perform first-level triage and classification of these digital materials. I believe this is well within reach the next few years.

This scenario typifies what I believe to be the overall near-term effects of machine learning applications, to the extent that memory organizations can develop the skills and workflows to apply them: it will substantially improve the ability to process and provide access to digital collections, which has historically been greatly limited by a shortage of expert human labor. But this will be at the cost of accepting quality and consistency that will often fall short of what human experts have been able to offer when available.

My thanks to Cecilia Preston, Mary Lee Kennedy, Joan Lippincott and Diane Goldenberg-Hart for helpful comments on drafts of this essay.

Clifford A. Lynch

Executive Director at the Coalition for Networked Information (CNI)

Licensed under CC BY-NC-SA