We then keep all isolated proper nouns i. For each of them, we store an index of the sentences they appear in for further processing. This approach allows to recover all occurrences of each capitalized word, as long as they are not systematically at the start of sentences.
In practice, the resulting list needs to be refined thereafter, as some capitalized common nouns still happen to end up in it. The reasons for this may be multiple, ranging from sentence tokenization errors, typos in the source text or other stylistic effects that may influcence punctuation and case. This refining can be done accurately and efficiently by combining three strategies:.
Figure 1 illustrates that the typical distributions often allow for easy separation, with very few outliers. Figure 1. Typical mean positions of uppercased words in their respective tokenized sentences vs. Once identified, proper nouns usually fall in three main categories that serve different purposes to narration: they can namely be characters, places , or others brands, abstract concepts, acronyms….
We designed and evaluated six independent classifiers. Each classifier gets one word at a time as an input as well as the context that is necessary and relevant for its way of processing data, and returns the predicted category namely character, place , or other. We first present the implementation characteristics of each component before looking in details at the resulting scores. When one encounters a proper noun in a sentence, a good guess on its nature can sometimes easily be taken due to the immediate context.
The simplest case, which we will refer here as obvious context , would be if the noun is immediately preceded by a title or a predicate that hints at what it refers to.
See a Problem?
For this classifier, we compiled a simple list of obvious context classifiers that allow to make good guesses about the nature of the immediately or next to following proper noun:. In French, like in many other languages, the grammatical structure makes it more likely for sentences to follow a pattern that puts the subject of the action at the beginning, and the location toward the end.
This characteristic can be used when one looks at enough examples to make a simple, yet quite powerful guess about the global roles of the proper nouns. The accuracy of this classifier is indeed strongly dependent on the writing style of the author, as the frequent use of specific figures of speech may break its work hypothesis, and longer sentences may narrow the gap between the categories or blurry the bounds.
This can be seen clearly in Figure 2 , where we show the relative positions of identified classes of names for three different stories. Figure 2. Relative mean position of characters and places names for three classical French novels.
This approach is very different from section 3. For this implementation, we compiled lists of words that are more likely but not exclusively to appear, respectively, nearby characters, places, or abstract concepts. For instance, we expect names of characters to be more often surrounded by words related to emotions, body functions, speech, or professions, whereas names of places would be more closely related to motion verbs, place features, and prepositions.
Starting from common nouns that are unambiguously related to one of the categories we are interested in, we used a French synonyms dictionary service 10 to put up a list aiming to be as extensive as possible. The final files resulted in 4, words for characters, for places and 50 for concepts see Appendix B for the complete list. The script then looks for these words in the neighborhood of the nouns to be disambiguated and returns the most probable category.
As characters and places serve different narrative purposes, one may expect the grammatical constructs surrounding them to differ in a significant way. For instance, place names are often preceded with prepositions or determiners, whereas it is expected for character names to be more often directly followed by verbs. We thus introduced a script classifying names based on its knowledge of the full text, grammatically tagged using TreeTagger, 11 and tokenized in sentences. To guess the nature of the names, it then matches all sentences containing them against a set of rules that are typical constructions one uses when writing about a person or a place.
We tried out a set of seven manually established rules covering the most straightforward grammatical constructs described in details in Table 1 , plus two that help filter out tokenization errors at a sentence level by flagging words that are preceded by a punctuation mark or that are alone in their sentence. When matched, they increase or decrease the probability score for one or several classifications, and the category yielding the highest score is chosen and returned in the end. A lot of proper nouns can be non-ambiguously or with a high probability related to one or several categories based on general knowledge.
But the same knowledge may equivocally tell that those same words could also potentially be related to the ship RMS Queen Elisabeth , an abstract concept project Manhattan , or a place Amur river , probably with a lower likelihood if no other context is available. For many nouns, the knowledge we are looking for is well captured in the categorization of their related Wikipedia pages.
Using categories instead of the text of the articles also presents the advantages of being very straightforward and reduces a lot noisy signals related to text processing techniques. To test this idea, we implemented a simple algorithm that gathers the categories of the page whose name is closest to the noun we are looking for and looks for ones tagging people, places, or abstract concepts. In the case no category gives a hint which tends to happen both with very complex or very precise pages , it tries to recursively walk up the hierarchy until the necessary clues are found.
Several works already showed the relevance of locating direct and indirect speech parts to identify characters in novels Glass and Bangay, ; Goh et al.
- Gloomy autumn on way for French writing.
- Reaching Colorado (Harrison Wilke Trilogy Book 2).
- Yves Courrière - Babelio.
- Quatre jours en mars (Folio) (French Edition);
Most of these approaches rely heavily on the lexical database WordNet 12 to find out speech-related verbs and refine their accuracy, but for performance reasons and since we wanted the classifiers to remain efficient even on very long texts we implemented a simpler version that simply checks the proximity of detected proper nouns to quotation marks.
For each proper noun w appearing m w times, the system would essentially count the number q w of mentions that appear near quotations. It then computes:. Once all classifiers returned their answer for a given word, the last step is to compare these results and to decide on a final answer. This meta-classification step can be done by voting systems, choosing the final result according to the majority of predictions using various strategies, or by a meta-recognition system, aiming to discard classifiers that seem to have encountered a problem on the considered text file.
We implemented and discussed the performance of four distinct meta-classification methods. The easiest and most obvious solution to average the different classifications is a simple voting system i. However, since there is an even number of classifiers, ties are to be expected. This situation is quite unlikely since it would require exactly three classifiers deciding correctly, and the three others agreeing incorrectly on a wrong categorization.
Still, in case, this situation occurred the final choice would be non-deterministic by lack of model to support one option or the other. For this reason, we introduced a second meta-classification, which involves for each classifier to compute a confidence self-assessment score. For most classifiers, their internal mechanics allow themselves to evaluate to which extent the strategy they are using seems likely to return reliable results, given the current work context.
Hence, a simple strategy to help the voting process in the case of ties is for each classifier to return a confidence index, between 0 and 1. This index is thus expected to equal 1 if the decision was made with no ambiguity and 0 if the clues were equally distributed. For instance, considering the Quotes classifier computes a ratio of 0.
Again, this index is expected to see its value tend toward 0 for ambiguous cases and toward 1 for the more definite ones. On top of that, some classifiers are given the possibility to return 0 to mark their results as known to be invalid, and thus irrelevant at voting time. This can happen for instance when we do not find any known title preceding a word throughout the text, if no grammatical rule could be matched, or if Wikipedia does not have any result for the searched word. The improved voting algorithm then first discards all classifications that have a confidence mark of 0 and proceeds to a simple vote between the remaining ones for each noun.
In case of a tie, the results rated with the highest confidence will be privileged. Not all classifiers exhibit the same behavior regarding precision and recall. It thus can be justified to put more confidence on some of them in cases when we know they are more likely to succeed. For this test, we used manually set weights putting more importance to the obvious context classifier section 3. With the help of confidence rating section 4. Hence, those cases will be discarded regardless of the coefficient. A good compromise can be reached by giving 3 times more weight to the obvious context classifier, allowing the others to still easily overpower it in the unlikely case a majority of them reach a contradictory agreement.
A meta-recognition algorithm follows the idea of improving its accuracy by entirely removing one classifier if it detects it is consistently failing, typically due to stylistic biases or other broken assumptions on the considered book. Our hypothesis here is that since the remaining classifiers reached a higher agreement, the discarded one must have globally failed in some way and needs to be put aside.
Let us consider in Figure 3 the precisions vs. One can immediately see a typical pattern in any information retrieval system: one parameter is detrimental to the other, and no two classifiers behave in a similar way. We can also see that for each of them, some books get incredibly good results, and few others turn out very bad.
Interestingly and as backed up by the full numerical values shown in Table 2 , those are almost never the same, confirming our hypothesis that some methods may work way better or worse on some texts, giving a strong justification for the multi-classifier Mcapproach. The averaged results seem to confirm this intuition. In Figures 4 and 5 , we can see that all meta-classification schemes overall pushed the results toward the top, and at the same time made the clustering denser, hence reducing the differences between the books and output more consistant results by removing the worse outliers.
Figure 3. Comparison between precision and recall for each classifier, on each book.
Alliance Française Washington DC (AFDC)
By looking at the numerical results Tables 2 and 3 , several interesting facts can be stated about each classifier:. However, one has to note that with a precision and recall of respectively 0. Its precision of 0. One can notice it performed best, with a very satisfying 0. Its F 1 score 0. Regarding the meta-classification schemes Table 4 , we first wanted to compare them to our the OpeNER baseline. On our test corpus and using similar evaluation OpeNER averaged a precision and recall of 0.
This may seem surprisingly low compared to the standards usually set by this tool, but actually is a good illustration of how difficult it may get to find the correct tagging in fictional texts. The two worst cases Les Malheurs de Sophie and Germinie Larcerteux are shown in details in Tables A1 and A2 Appendix A , and show that those problems are about as much related to bad classifications as missing entities. Our meta-classification methods perform on average better than that, and we can see an encouraging trend that the various strategies we tried out tended to get increasingly better results and to close the standard deviation gap.
That being said, as far as SD is concerned and taken into account, this improvement was not statistically significative. Yet, the meta-recognition method managed a F 1 score no worse than 0. Table 4. Averaged precision, recall, and F 1 scores, for each classifier. The contribution of this paper is to establish a set of efficient and autonomous tools that can be run in a limited environment such as a web server , on any French novel without the need of training or manual user input, yet keeping reliable results.
We showed that combining different classifiers, especially using a meta-recognition technique, allowed to attain an overall better score than each of them would separately, and to outperform some state-of-the-art tools in the very narrow considered use case. Yet, we are aware this process implies several hard assumptions that may break under certain circumstances and cause the system to fail, like the systematic extraction limited to capitalized proper nouns. Future works may focus on coreference resolution by merging entities that refer to the same character, or conversely, disambiguation of homonyms, even if this does not happen very often in closed environments like fictional works.
In parallel to a reliable extraction of the characters networks, further textual analytics may try to uncover the nature of the relations between them. CB initiated this research, wrote the software components, and elaborated the test sets and methods described in this report. FK supervised this research and advised on the necessary tests and developments needed to improve its relevance.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. This research was made possible by a very precious and close collaboration with Daniel de Roulet, Swiss architect, computer scientist, and author.
Albanese, A. Small demons makes big splash at Frankfurt book fair. Publishers Weekly Google Scholar. Azpeitia, A. NERC-fr: supervised named entity recognition for French. In Text, Speech and Dialogue , Vol. Sojka, A. Pala, — Cham: Springer International Publishing. Dumais, S. Latent semantic analysis. Annual Review of Information Science and Technology — Fleiss, J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Feihu waizhuan, chap.
- Semiconductor Power Devices: Physics, Characteristics, Reliability.
- A Twisted Map Under Crooked Stars?
- Beyond any Manuscript (Special book series dialogue 2).
- The Walking Dead: Roman (The Walking Dead-Serie 1) (German Edition).
- Quarterback Scramble (Sports Illustrated Kids Graphic Novels).
Le mariage, cependant, demeure important, voire incontournable, pour ces femmes. Un homme aura beaucoup de femmes dans sa vie. Quand il a du temps, je lui tiens compagnie. On trouve les deux facettes chez Jin Yong et chez Yi Shu. A ltenburger Roland Le Roman populaire, B arthlein Thomas Modern China , n o 25 2 : C hua Amy Battle Hymn of the Tiger Mother. New York: Penguin. Wang Jiann-Yuh The Deer and the Cauldron.
Oxford: Oxford University Press 3 vol. Changchun: Shidai wenyi chubanshe 2 vol.