Methodologies and Challenges in Forensic Linguistic Casework. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Methodologies and Challenges in Forensic Linguistic Casework - Группа авторов страница 15
JG also took a second approach, which involved conducting a stylometric, computational analysis to feature selection. The basic idea in such an analysis is to compare the frequencies range of well-established feature types or other textual measurements across the possible author writing samples—such as the relative frequency of word, character, and part of speech n-grams.5 The main advantage of this approach is that is does not require the expertise or attention of the analyst, allowing many more features to be analyzed and identified that might otherwise be missed or whose relative use is real and consistently different but not sufficiently distinctive to be identified by hand. It is also replicable and should give us far more confidence that we are looking at an unbiased feature set. Such an approach is not generally taken in forensic authorship analysis for two reasons: the courts are generally interested in the expertise and explanation of the linguist and presence of categorical patterns; and the texts and comparison texts are generally too short to warrant quantitative analysis.
In total, JG identified 51 different feature types that appeared to distinguish between the possible writings of Debbie and Jamie Starbuck, which were informally classified as belonging to nine levels of analysis:
Text level (average text length, common email openings and closings)
Paragraph level (average paragraph length, common paragraph initial words)
Sentence level (average sentence length, common sentence initial words)
Phrase level (common two-word n-grams, common three-word n-grams)
Word levels (average word length, common function word)
Abbreviations, acronyms, and emoticons (common text messaging acronyms, common emoticons)
Contractions (common standard contracted forms, common nonstandard contracted forms)
Spelling and case (common spelling errors, repetition of letters for emphasis)
Punctuation (common use of exclamation marks, nonstandard semicolon usage)
Some of these feature types consist of a single measurement (e.g., average word length in characters or spelling of “a lot” as one or two words), whereas others consisted of a large number of individual features (e.g., frequency of common function words or words that are commonly used in sentence initial position). Examples of these features are provided in Table 2.1. In addition, JG also recorded general holistic impressions of the two authors. For example, he found Debbie’s style to be more narrative and informal than Jamie’s.
We believe that combining the stylistic and the stylometric is the best way to get a meaningful, reliable, and robust feature set. Both approaches can miss important features. The computational approach is bad at identifying the highly idiosyncratic features that are often most meaningful, whereas a manual approach is bad at identifying subtle patterns that are often most robust. Using a mixed approach allows us to look at a broad range of evidence and should give more confidence that we have not missed important evidence. That said, it is important to remember that almost every feature is subject to selection bias. No stylometric analysis uses all possible stylometric features, and no two linguists will produce identical hand analyses. Furthermore, the analyses of the same feature set will not necessarily align. There is generally more than one way to count a feature, as the following examples illustrate.
ATTRIBUTION
After the known writings had been analyzed and JG had identified the feature set that distinguished between the known emails of Jamie and Debbie Starbuck, only then did TG pass to JG the remaining set of 29 potentially disputable emails. In this stage, JG went through each of the texts by hand and searched for each of the feature types systematically. As there were relatively few texts, and because a number of the features were difficult to search computationally, this process was primarily carried out by hand to ensure that no evidence was missed. For each text, the number of features that matched each author was recorded, and then the length of these lists and the nature of the features were considered to come to an attribution judgment.
In some emails, the evidence was mixed: it contained features associated with both Debbie’s and Jamie’s known styles. For example, some disputed emails contained some features associated with Debbie’s known emails—they were relatively long or started sentences with the word “And.” However, where these emails predominantly contained features associated with Jamie’s style, we still attributed the email to Jamie. If the mix was more balanced or if there were very few features in either direction, no attribution was made.
Overall, the feature set proved clear enough to attribute a substantial number of the texts across the timeline to Jamie Starbuck. For example, one of the most distinctive features was sentence length, as well as the way that longer sentences were constructed. In general, Debbie used considerably long sentences, including frequent use of sentence coordination and often run-on sentences linked with comma splices and dashes. Alternatively, Jamie often used short sentences, including one-word sentences, and very rarely run-on sentences (Table 2.1). These types of sentential patterns were far more common in the questioned documents offering strong evidence they were more likely to have been written by Jamie. Various other features associated strongly with Jamie’s writing samples were also present in these texts, including the deletion of apostrophes before the genitive marker in possessives, the use of highly informal features like interjections and emoticons, and the presence of several common compounded words including awhile, in between, and up-to-date, with those specific spellings.
However, perhaps the strongest evidence came from the closings of the known emails. Two patterns stood out. The first was that when Debbie had a two-line signoff (e.g., Lots of love, [line break] Debbie xxx). She would always put a single line break between them, whereas Jamie would always put a double line break in his two-line signoffs (e.g., Many thanks, [line break] [line break] Jamie). The emails which occurred later in the questioned series as provided by TG always contained double line breaks like Jamie’s and, indeed, the switch in this feature strongly coincided with other features to indicate the date of a shift in style. The second closing feature was that where Debbie generally signed off emails with “Debbie”, the questioned documents generally signed off as “Deb”, which was never used by Debbie in the known set but often was by Jamie in his own emails.
These two pieces of evidence clearly pointed to Jamie as the author of the questioned documents. The break point was indicated at the second email in the disputed group. Following this second email, that is to say by May 3, 2010, it appeared that Jamie alone was writing emails from Debbie’s account. This placed the date of the account takeover to be before the couple were supposed to have left the UK. This second email is particularly