Machine Learning and The Historical Novel
This project came about because I wanted to know if the historical novel had become more or less common over time.
It seemed like a simple enough question but it proved to be anything but that. One problem is that the historical novel, as a form, is a loose concept and so any effort to count the number of historical novels necessitates defining the form. Another is that libraries do not track genre in their catalogues. James F. English has shown that literary novels have become increasingly set in the past but this does not tell us anything about the form of the historical novel.
With that in mind, I wrote a script that could identify historical novels (I borrowed from Andrew Piper’s work) . You can see my code here. I trained the model using a Support Vector Machine algorithm and was able to return results with around 80% accuracy. The problem I experienced was a lack of material. This will only work if there are a sizable number of texts for the model to learn from. The best source I could find was TxtLab’s LIWC for Literature of 25,000 novels from the last 200 years. This had a few problems: the biggest one is that only a fraction of these novels were historical novels. This meant that I had to train the model using only 60 texts.
I presented this project at the DH and University of Virginia Library Fall Open House and a number of people pointed me to the tools available at the Hathi Trust. I attended the HathiTrust classes to learn how write a script that would let me download textual material and, in the next iteration, I will use the HathiTrust texts to train my model (it also informed my Booker Prize project). This will improve the accuracy of the model. With this new approach, I will then use random sampling to calculate how many historical novels were published each decade over the last 200 years. This is an ongoing project.