In humanistic research, Named Entity Recognition is highly useful, but it mines surface data, rather than revealing the complex nature of relationships between these entities. Named Entity Recognition (NER) extracts the names of people, locations, organizations, and, depending on the model, may also extract references to money, percentages, dates, and times, in addition to a miscellaneous class. Although this is certainly useful, NER does not represent the richness of the documents with which we work.
As a test corpus, this project uses four digitized, OCRed, and hand-cleaned nineteenth-century French chronicles of Ottoman Algerian history. The volumes range between 41,341 words and 170,737 words and cover the period 1567 to 1837 with a focus on Constantine, the easternmost province in Algeria. The challenge is to extract not only named entities and their relations to one another, but to extract unnamed persons and their relationships as well. In simple NER, the names Moustafa and Namoun, would be the only extracted data in the following sentence: ‘Moustafa avait épousé une des filles de Namoun,’ but the daughter of Namoun who married Moustafa would not appear. (‘Moustafa married one of Namoun’s daughters.’ Vayssettes, 2003: 52. Author’s translation.) The goal of this project is to uncover the positions and roles of women in Algerian society, so it is essential to locate and retrieve data about unnamed people.