Dark Archive

PDFs are difficult for computer systems to decipher into something that can be queried: while they are getting better, the Optical Character Recognition software that reads pdf image files still make errors.

Even beyond error-prone full-text searches, computational analyses of pdfs suffer insofar as the code—ASCII or UTF8—insufficiently chunks text. For example, at a workshop sponsored by the University of Toronto's Critical Digital Humanities Institute,¹ Sarah McConnell and Julia Flanders demonstrated the value of XML tags when researching and visualizing place names in the corpus of early modern women writers that has been TEI-encoded for the Women Writers Project.

cf to Ted Underwood's preliminary conclusion that novels frequently discuss money: "What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books."²

EMOP video

Notes

1. Praxis Workshop: Exploring Data Modeling, Digital Tools, and Research at the Women Writers Project 13 March 2023 @ 2:00 pm - 4:00 pm EDT Back

2. How to find English-language fiction, poetry, and drama in HathiTrust. The Stone and the Shell. December 29, 2014. Back

Back to Section Home

Home