Dark Archive
PDFs are difficult for computer systems to decipher into something that can be queried: while they are getting better, the Optical Character Recognition software that reads pdf image files still make errors.
Even beyond error-prone full-text searches, computational analyses of pdfs suffer insofar as the code—ASCII or UTF8—insufficiently chunks text. For example, at a workshop sponsored by the University of Toronto's Critical Digital Humanities Institute,1 Sarah McConnell and Julia Flanders demonstrated the value of XML tags when researching and visualizing place names in the corpus of early modern women writers that has been TEI-encoded for the Women Writers Project.
cf to Ted Underwood's preliminary conclusion that novels frequently discuss money: "What we see in Google’s “Fiction” collection is something that happens in volumes of fiction, but not exactly in the genre of fiction — the rise and fall of publishers’ catalogs in the backs of books."2
Notes
1. Praxis Workshop: Exploring Data Modeling, Digital Tools, and Research at the Women Writers Project 13 March 2023 @ 2:00 pm - 4:00 pm EDT Back
2. How to find English-language fiction, poetry, and drama in HathiTrust. The Stone and the Shell. December 29, 2014. Back