Aspects of linguistic ageing in literary authors across time

This work offers an investigation into linguistic changes in a corpus of literary authors hypothesised to be possibly attributable to the effects of ageing. In part, the analysis replicates an earlier study into these effects, but adds to it by explicitly analysing and modelling competing factors, specifically the influence of background language change. Our results suggest that it is likely that this underlying change in language usage is the primary force for the change observed in the linguistic variables that was previously attributed to linguistic ageing.


INTRODUCTION
Language is subject to constant change, both with respect to a particular linguistic variety that affects all its speakers as well as on an individual level for each speaker separately during their lifetime. "Stylometry" is the study of a writer's stylistic fingerprint based on collected writings over his or her lifetime. Due to the sequential and long term nature of publishing, stylometric studies may be influenced by temporal language development and its effects might be misconstrued and misinterpreted as a result. More recently this issue has given rise to a temporal variant of stylometric analysis, i.e. "stylochronometry", that studies changes in style over time, as exemplified, for instance, by the work of Forsyth (1999), Stamou (2007) and Klaussner and Vogel (2018b). However, even though stylochronometric studies consider the temporal dimension, these analyses still conflate individual stylistic changes with those induced by ageing, such as changes in authors' vocabulary size over time and, most importantly, influences that affect speakers of the same language variety equally, such as general underlying language shifts. Although previous studies have examined sets of linguistic variables with respect to both healthy and pathological ageing (Pennebaker and Stone 2003;Le et al. 2011;Kemper et al. 2001), to the best of our knowledge there do not yet exist composite studies considering all three aforementioned factors.
The current work extends a research paradigm created by Pennebaker and Stone (2003) who analyzed linguistic ageing both in emotional disclosure studies and in a corpus of literary authors. In this work, we build on previous results by examining a larger literary corpus as well as controlling for background language change. Our objective is to replicate the earlier study on a different literary corpus that is temporally-aligned with a reference corpus for that same time period, thus allowing us to investigate possible influences of general language shifts. We also propose some methods that can be used to attempt to disentangle general effects from those that are individual.
Within this paper, Section 2 discusses previous work in the area, Section 3 presents the literary authors data set, Section 4 and Section 5 discuss methods and experiments respectively and Section 6 and Section 7 analyze and summarize the results.

RELATED WORK
Patterns of general, underlying language change have been studied by, for instance Lieberman et al. (2007), finding that the question of whether an irregular verb in English will acquire the "-ed" regularization largely depends on its token frequency. Highly entrenched, changed his style over his creative lifespan. Based on literary scholars' findings (e.g. Beach 1918), Hoover investigates natural partitions of James' style into three different temporal divisions of early (1877)(1878)(1879)(1880)(1881), intermediate (1886)(1887)(1888)(1889)(1890) and late style  using the most frequent word unigrams and a variety of different methods, such as Cluster Analysis, Burrows' Delta, Principal Component Analysis and Distinctiveness Ratio. 3 Apart from these divisions, Hoover also notes the existence of gradual transitions in between, with for instance the first novels of the late period being somewhat different from the rest of them. Le et al. (2011) contrasted the writings of three female British novelists for detecting markers of dementia, specifically Iris Murdoch, who died with Alzheimer's disease, Agatha Christie, who was suspected of having it, and P. D. James, who aged healthily. Previous research (Kemper et al. 2001;Bird et al. 2000;Burke andShafto 2008 as cited by Le et al. 2011) indicated that for instance vocabulary and syntactic complexity declined more rapidly in the presence of dementia, particularly with respect to words of lower frequency and higher specificity as well as passive constructions. Simultaneously, occurrence of lexical repetitions and disfluencies would increase. Analyzing a variety of lexical and syntactic measures, Le et al. (2011) could largely confirm their hypotheses with regard to more rapid lexical decline in Murdoch. More than 20 years before any Alzheimer's symptoms became apparent, her vocabulary started to decline, resulting in a significant increase in lexical repetitions of content words. However, her lexical specificity, measured through the proportion of specific indefinite nouns and verbs, remained intact throughout. All but two of Christie's lexical types showed an overall decline. In contrast, the vocabulary, repetition and specificity scores vary only slightly across James' novels. Thus, it is noted that although Murdoch does not share Christie's increase in indefinite nouns, they both show common lexical decline not found in James, validating the hypotheses with respect to lexical markers.
Although the analysis and data preparation was very carefully conducted, as the authors note, the data set is somewhat small with only 1-2 people for each of the two conditions, leaving it unclear what aspect of the results are reliable, as each of the examined women could potentially be unrepresentative of their group. In addition, general language shift or stylistic change could also have had an influence on the observed change.
While the work by Le et al. (2011) considered symptoms of pathological linguistic decline, the study by Pennebaker and Stone (2003) (hereafter also: P&S) focused on aspects of regular and expected linguistic ageing. In particular, they proposed four hypotheses about the effect of ageing on language. Firstly, they suggested that ageing was associated with a drop in negative affect words and a slight increase in positive affect words (hypothesis 1). Further, social words and firstperson plural pronouns 4 were hypothesized to decrease relative to a person's decrease in social networks (hypothesis 2). If ageing was associated with a greater concern with the past relative to the future, linguistic shifts from future to past tense as well as a reduction in references to time altogether could be expected (hypothesis 3). Finally, older people were predicted to use fewer cognitively complex words (cognitive mechanisms and causal, insight, and exclusive words), whereas markers of verbal ability were not expected to show either monotonic increases or decreases (hypothesis 4). P&S investigated how the age of a person affected these linguistic categories, with respect to two very different data sets: one based on self-reports from emotional disclosure studies (the 'Disclosure project'; hereafter also: DP) and the other based on collected works of ten different authors across their individual life spans, hereafter also referred to as the 'Author project' (AP).
The Disclosure project featured 3,280 participants from 45 separate studies, of which 32 were traditional emotional disclosure experiments in which participants were randomly assigned to write about either a traumatic or emotional topic, or a superficial topic in the case of the controls (for details, see Pennebaker and Stone 2003). Although this data is ordered by age of participants, the samples may have originated from the same time period. Both DP and AP were assessed using correlation analysis and in addition the DP was also analyzed through simple linear and quadratic regression.  Table 1 shows P&S's results for individual age-variable correlations for both data sets (limited to those variables that are also analyzed as part of the current research, as this paper only details a partial replication of the original study). For this, the results for the two DP conditions ('Experimentals'/'Controls') were based on between-subject analyses correlating each of the Linguistic Inquiry and Word Count (LIWC) variables with age. For the Author project, the correlation coefficient is based on mean within-author correlations between each author's age and the LIWC analyses for the works written at that age. As can be observed from the table, for first-person singular pronouns, 5 present and future tense, and big words, hereafter also referred to as long-letter sequences, all three correlations are in the same direction across the two sets, although only in two cases are all of them also significant (and the direction of first-person singular pronouns is inversely correlated with age while the other two variables directly correlate with age). Table 2 shows the collection of authors in the AP of the P&S study. Although it is balanced across genders, it contains some idiosyncrasies, such as the fact that most authors originated from Great Britain (England and Scotland), except for writers Louisa May Alcott and Edna St. Vincent Millay of American origin. Genre types include novels, plays and poetry, a fact that could present a confounding factor specifically for the analysis of pronouns that are usually distributed somewhat differently across these text types. The most relevant issue in this context is that authors' works are spread across five centuries  and language use would be expected to somewhat vary between the 16 th and 20 th centuries. It is to be assumed that this design was deliberate in order to extract very diverse samples -nevertheless, this may render them still less comparable and results could be spurious. In particular, if language has been affected by a continuous shift throughout this time, a significant effect in authors who did not compose language in parallel may still be attributable to general language change rather than ageing.
The final column in Table 2 shows the result of using regression weights for the LIWC variables based on the DP data to create an ageing coefficient for each individual author, which was then correlated with age. Thus, larger correlations signify more similarity to the DP analysis regarding the ageing variables. It is noticeable that five out of six significant correlations, i.e. Joanna Baille, Robert Graves, Edna St. Vincent Millay, William Wordsworth and William Butler Yeats, are based on genre types that could be more prone to irregularities, e.g. poetry and plays. Overall, neither analysis anchored in the DP or AP data is reported to have evaluated the influence of general language change.
Thus, apart from general language shifts, other possible confounding factors for the P&S study could have been introduced by the differences in pronoun distributions across varying text types as well as individual stylistic differences and developments, irrespective of any particular ageing process. In this work, we revisit the question of linguistic ageing for six variables previously analyzed. Specifically, we do not reanalyze P&S's data, but conduct a comparable experiment on a more temporally and genre-homogenous data set. We then compare our findings on the same variables to P&S's earlier results.
[ 201 ] Note: For most novels, stories, and plays, each work was analyzed separately. For poetry, a work was defined by the various poems written within a given year. Exceptions include poems or collections that were known to have been written over several years, which were entered as separate text files.
[ 202 ] DATA The data analyzed for this research is divided into two main sets: twenty-two literary authors, comprising ten women and twelve men, and a corresponding reference corpus for the same time period. Table 3 shows the set of literary authors, all of whom published work between 1847-1923. 6 The corpus was populated in the following way: first The corpus is motivated and described in more detail by Klaussner and Vogel (2018a). The data set is available at http://www.scss.tcd.ie/clg/ DCLSA/ -last verified October 2021.
[ 203 ] the prolific authors Mark Twain and Henry James were chosen, which was inspired by several sources that suggested they may be interesting to contrast (Beach 1918;Canby 1951). The remaining contemporaneous authors were selected by first assembling a list of male and female American authors of the 19 th -20 th century using Wikipedia 7 and then selecting a subset of these authors, all of who had a few long works publicly available and spread out over at least twenty years. Also, for the purpose of estimating stable word distributions, shorter works of less than 150 kilobytes in length were excluded. In terms of temporal alignment, a fair subset of the authors wrote largely in parallel. The literary prose texts were mainly collected from Project Gutenberg (PG): 9 this part of the corpus consists of 397 hand-transcribed works; it was supplemented with 158 scanned works from the Internet Archive (IA). 10 In general, we might prefer to choose a handtranscribed version of a text from Project Gutenberg rather than the possibly more noisy OCR version from the Internet Archive. However, in this case acquiring data with a time stamp close to the first publication date was essential and for this reason and especially when the equivalent PG version did not have a time stamp, the IA version was chosen instead if available. On occasion, the OCR versions were manually corrected, but this was determined on an individual basis and through human inspection only.
All data was prepared by manually removing parts that were written at a different time from the main work, along with introductions or comments not by the author, such as copyright headers/footers, notes or introductions by editors. Additionally, tables of contents were also removed, as these do not usually follow a normal sentence structure. Klaussner and Vogel (2018a) provides more specific descriptions of the data and its basic pre-processing. The publication date of a text was set by taking the first documented date, e.g. first copyright or publication date, unless a preface clearly stated that the work had been subject to explicit revisions. The issue with dating in this case is that either dating a work too early or too late would distort the results.
The reference language corpus for the current work was assembled by taking an extract from The Corpus of Historical American English (COHA: Davies 2012). 11 COHA is a 475-million word corpus that contains samples of American English from 1810-2009, balanced in size, genre and sub-genre in each decade (1000-2500 files each). Depending on the particular type of analysis, different excerpts from the entire data set were used. The corpus contains balanced language samples from fiction, popular magazines, newspapers and non-fiction books, which are again balanced across sub-genre, such as drama and poetry. 12 While the corpus is balanced overall, some years contain proportionally more data from certain genres than others, where we observed strange frequency effects. However, to the best of our knowledge, for our current requirements of providing an approximation to general language usage at the time, this corpus still provided the best option.

METHODS
Section 4.1 describes how features were extracted, and is followed by Section 4.2: the statistical models used for the analysis. 11 A free web-based version is accessible on: https://www. english-corpora.org/coha/ -last verified October 2021.
12 There is an Excel file with a detailed list of sources available on: https: //www.english-corpora.org/coha/ -last verified October 2021.

Feature extraction
We begin by describing the feature extraction adopted by Pennebaker and Stone (2003), interlaced with our own design, where modifications were deemed necessary. As previously mentioned, Pennebaker and Stone (2003) based their analysis on the LIWC system, whose categorization scheme is generally not openly accessible. This renders replication of less objective linguistic variables, such as negative or positive emotion words difficult. 13 Table 1 only lists examples of non-reflexive uses of pronouns and main tenses, so it is unclear whether reflexive pronouns were included and how complex verb forms also indicating aspect, such as present perfect or future perfect, were treated in their analysis. For extracting 1SG/1PL pronouns in the current work the word was used in conjunction with the part-of-speech tag to identify the correct items, e.g. to avoid uses of I that refer to numbering. 14 As our experiments did not show differences between including or excluding reflexive pronouns, this analysis only reports on non-reflexive pronoun types.
Originally, P&S also included what they refer to as "time-related" words, such as clock, hour and soon. One can assume that they would also include temporal adverbs in general like yesterday or today. These temporal expressions may change the interpretation of regular tenses and could result in shifts between them. However, this may not be a trivial problem, as sometimes the overall tense would be more strongly signaled by the temporal adverb, e.g. examples (1) and (2), whereas in other cases the verb would be the determining factor, as in example (3). (1) She's there tomorrow.
(3) She was there today. 13 To the best of our knowledge, these words were classified by several different students and can be (indirectly) accessed through the LIWC program. Research papers usually only provide examples rather than exhaustive lists.
14 For all the computations in this work, the statistical programming language R (R. Core Team 2014) and associated packages were used. For POS-tagging the NLP (Hornik 2016) and openNLP (Hornik 2015) packages were used. This suggests the need for a more intricate classification system than could be done justice as part of the present work. Here we resort to only using verb tenses to approximate the overall tenses. The main effect of not including temporal adverbs may be a shift from future to present tense counts. In order to approximate tense representation, we adopted the following classification: while POS tags could be used to directly identify some of the simpler tenses, this would not suffice to always correctly determine the difference between the present or present perfect tense usage of have and neither could it identify occurrences of the going-to future tense, as this is not marked explicitly on going-to. 15 To be able to make these distinctions, we used chunk tags to extract verb phrases and then analyzed the combination of tags within to determine the type of tense. In this, several sub-types corresponding to finer shades of difference in meaning are classified into the three main categories (past/present/future), as follows. The present type includes: simple present, present progressive, and conditional and modal variants, such as can/could/may go. The past type captures simple past, present perfect, past perfect, past progressive and, as with the present type, conditional and modal variants, such as could have gone. Finally, the future type covers simple future construction, such as will/shall go and going to go, but also will have gone. Finally, we define long-letter sequences as previously, as words whose length is greater than or equal to six letters.
After extracting the relevant features, texts in each corpus were combined by considering the year of publication, thereby reducing each set to one file per year per (author) corpus. Relative frequencies for each feature type were calculated by considering the ratio of the occurrence of the feature and all tokens for the same year. In addition, ordinal variables were created corresponding to year of publication (year), age of author at publication of text (age) and a categorical variable indicating the author (A) of a text.

Statistical modelling
This section describes aspects connected to the statistical analysis, i.e. regression models and standardization techniques, before moving on to model assessment.
Temporally-ordered data can be analyzed in different ways, for instance relating a variable to itself at different points in time as part of a "time-series" model or, as in the present case, by considering other variables at the same point in time thereby using an "explanatory model". Consequently, the prediction of a variable y is based on a function over a set of distinct variables: x 1 , x 2 , . . . , x p−1 , x p = X , with y / ∈ X , at the same time point t : {t ∈ 1, . . . , n}, and some error term: The regression models computed in the following experiments vary with respect to the data set used and whether individual author variation had to be accounted for. The reference corpus (RC) does not contain an age variable and is only evaluated with respect to year of publication, which serves to check whether a particular variable of interest is likely to have changed in relative frequency over time.
However, when analyzing the literary authors corpus, both age and year have to be considered as predictors, since the authors will align differently depending on the variable, i.e. James and Twain were not the same age in the same year. Thus, in order to argue for an ageing effect to be present for an individual, it has to (also) be found in a combined model of the authors, clearly outperforming the equivalent year-based model that does not depend on age, but may capture stylistic changes over time instead.
When analyzing different authors at the same time, one may have to resort to random effects models to account for individual variation between authors as shown by Equation (1), where y t j is the response variable for author j at time t, x t j is the individual-specific random effect and A j is the author-specific random effect; ϵ t j represents the error term. Similarly, Equation (2) shows the same for the quadratic model, adding predictor β 2 x 2 t j .
(1) y t j = β 0 + β 1 x t j + β 2 A j + ϵ t j (2) y t j = β 0 + β 1 x t j + β 2 x 2 t j + β 3 A j + ϵ t j [ 208 ] For fitting linear and normally distributed models, the nmle R package was used (Pinheiro et al. 2013). Data that was only log-normal was fitted through the glmmPQL function in the MASS package (Venables and Ripley 2002). In order to preserve similarity with P&S's study, the predictors age and year were standardized two-ways, one by computing z-scores, i.e. subtracting the mean and dividing by one standard deviation for the simple linear regression models, and also by taking the absolute value of the difference from the mean over the sample for the quadratic models. For correlation analysis, either Pearson correlation coefficient r or Spearman's ρ were used, for normally and non-normally distributed data, respectively.
The decision as to what type of model and correlation measure to use, i.e. parametric or non-parametric, was based on whether the linear model fulfilled all model assumptions: all models were tested for normality, kurtosis, skewness, nonlinear link function (for testing linearity) and heteroscedasticity. 16

EXPERIMENTS
This section begins by examining background language change with respect to the six linguistic variables outlined in Table 1. Having considered background language change, Section 5.2 then investigates how these effects can be explicitly modelled in the case of the literary authors. This also allows us to determine to what extent background language may be responsible for effects observed in the individuals.

Background language change
Examining the change in linguistic variables over time raises the question to what extent these variables were subject to other outside in-fluences, especially when considering a time span of ∼40 years or more. To be able to assign meaning to measures of linguistic ageing, a separate analysis of the change in the background language is conducted as part of this section. In general, observed individual effects could be either subsumed by language change or rendered more significant if they happen to be in the opposite direction. Thus, taking background language into account can both lessen and strengthen individual effects. Table 4 shows correlation results for the reference corpus and both P&S's Disclosure project and Author project. The results for computing simple linear (β) and quadratic (β 2 ) models are displayed only for the reference corpus alongside the DP as the same model computations were not available for the AP. Our reference corpus shares characteristics with both of P&S's studies in that it covers a similar length of time as the DP (∼70 years) and years contain multiple individual samples rather than a strict within-subject design. However, it is more comparable to the AP design in that it is genuinely sampled from different time periods, whereas some of the DP's data representing different age groups could have originated from the same time period. For this reason, we aim for a general comparison or replication rather than remaining very close to the original study.
Language change effects can be observed with respect to at least three of the six variables, and this is specifically notable in the case of 1PL pronouns and past tense, where the effect is in the same direction as for the DP, and the case of long-letter sequences, where effects are in the opposite direction for both of P&S's studies.

5.1.1
Change in pronouns Figure 1 depicts 1SG and 1PL pronouns in the RC over the time span from 1830-1919. 17 As can be observed, 1SG pronouns slightly increase in relative frequency over time. All model parameters in Table 4 show a positive but non-significant trend over time.
Both P&S's studies have significant, but negative associations for 1SG pronouns over time. 1PL pronouns experience a highly significant decrease in relative frequency over the reference corpus, and 17 As there were some sampling irregularities in the reference corpus around 1923, the years after 1919 were excluded, resulting in 90 years of data.
[ 210 ]  There is less evidence of background language interference in the case of 1SG pronouns, but stronger indications in the case of 1PL pronouns.

5.1.2
Change in tenses Figure 2 shows relative frequencies for past, present and future tense. Future tense shows little variation over time or at least not at a significant level, while examining Table 4 shows that both P&S's data sets have a positive association for future tense over time. Present tense appears stable in relative frequency and has a significant positive quadratic trend as can also be observed in P&S's DP. Past tense in the RC has a highly significant positive correlation (0.7 * * * ) and highly significant regression coefficient β, and while r is also positive and significant in P&S's DP, it is reported to have a significant negative linear regression coefficient (−0.16 * * ). Their AP has a non-significant positive correlation for both present and past tense. Both visual and statistical analysis indicate that the tenses, but especially the past tense, underwent change in frequency in background language use for the time period examined, and as with 1PL pronouns could therefore introduce noise into stylistic or ageing analyses.

Change in long-letter sequences
The development of long-letter sequences over the RC is shown in Figure 3. There is a continuous downward trend visible, which is confirmed by both a highly significant correlation coefficient ρ (−0.51 * * * ) and a linear regression coefficient β (−0.02 ! * * * ) in Table 4. Both P&S's DP and AP have positive trends and therefore trends in the opposite direction (r of 0.13 * * and 0.10 respectively). This section has examined six linguistic variables in a continuous section of general language usage that have been hypothesized in the literature to be affected by ageing in individual writers. Pennebaker and Stone (2003) found significant decreases in all their [ 213 ] data sets with respect to 1SG pronouns. For the time frame examined here, no significant trend for 1SG pronouns based on publication year was observed in the reference language. This adds weight to the interpretation of the P&S 1SG effect as being one of ageing. 1PL pronouns were negatively associated with age for the Disclosure study and our reference corpus also showed a highly significant negative trend over time. This suggests that the 1PL effects may not be due to ageing. Pennebaker and Stone's work observed a significant decrease in past tense verbs in the DP, while this variable could be observed to increase in the RC. Present tense was not found to be a likely factor in ageing by P&S, which can be partially confirmed as the relative frequency did not seem to undergo a very pronounced shift. Similarly, there did not appear to be a very strong effect for future tense in our reference corpus, whereas it was found to increase over all of P&S's data sets, possibly implicating this as a real ageing effect. Long-letter sequences are comparable to the past tense situation: Pennebaker and Stone (2003) report a significant increase over their Disclosure project, whereas there is a significant decrease over the background language sample examined here. If their data were subject to similar effects, then this could render the linguistic ageing results more pronounced. This analysis has shown there to exist significant language change in most of the ageing variables examined. To what extent this challenges or amplifies results in the original study is not further examined here. Rather, the next section addresses how these underlying influences can be taken into account when examining linguistic ageing variables in the literary authors corpus, by attempting to estimate the impact of background language change more systematically for the literary authors. We then consider to what extent this underlying change influences interpretation of effects previously only attributed to ageing.

5.2
Estimating impact of language change In this section, we aim to investigate the ageing hypotheses with respect to the literary authors corpus while controlling for background language influence. For instance, a random effects model as shown [ 214 ] in Equation (3) can be used, taking into account reference language, where ref i j is the relative frequency of the reference language for author j (A j ) at age i and random error ϵ i j . Equation (4) shows the equivalent quadratic model. (3) The set of literary authors varied somewhat and for most variables only a subset of authors produced a normal or log-normal fit. For this reason different subsets of the entire data were used to test individual variables' hypotheses. Table 5 shows the results of computing simple linear random effects models for the six linguistic variables. The first two columns show model coefficients for the age and background language predictors. The third column specifies what model type was used, i.e. normal (N) or log-normal (LN) and the final column lists the respective size of author set. Overall, there is little evidence for either a very strong influence of background language change or linguistic ageing. The only nearly significant reference language coefficient is 1PL pronouns. Figure 4 presents evidence for some language change influence, i.e. removing the reference language predictor causes the Year.std predictor to become significant, while the ageing predictor Age.std in the equivalent model does not become more important, indicating that time of publication remains more salient than age of author. The only significant ageing predictor is for future tense; however, considering the equivalent model using year (of publication) instead of age (at time of publication) renders an even more significant model, calling into question the validity of age as a main cause of the observed effect. Table 6 shows the results for computing quadratic random effect models for the six variables based on Equation (4). Similarly to the simple linear model results, quadratic models also do not yield well fitting models (in terms of significant predictors) for either age or background language predictors. For 1PL pronouns, the reference language predictor is almost significant in the sense of very nearly crossing the threshold for statistical significance at the 95% confidence level, as in the case of the simple linear model in Table 5. Although the ageing predictor for future tense in Table 6 is not significant, the equivalent quadratic year predictor is.
Finally, we turn to the last part of this analysis, namely the question of stylistic differences between authors. For instance, one could consider the question of whether there is likely to be anything particular about Mark Twain's and Henry James' style development compared to the other authors given that these two have received considerable attention from literary scholars. Further, we consider the specific case of first-person pronouns. Figure 5, Figure 6, Figure 7 and Figure 8 show 1SG and 1PL pronouns for Twain and James alongside some of the other authors in the set, as well as a line representing the average over all authors in the set. 18 Figure 5 shows James and William Dean Howells and Figure 6 shows Twain and Elizabeth Stuart Phelps Ward. For neither Twain nor James does there appear to be a particular development in the form of a trend for 1SG pronouns. Nor is their level of variation around the authors' average among the highest. As the plots indicate, Howells and Ward show more variation for 1SG pronouns than either Twain or James. The 'aut-ref' line represents an average over all authors in the set, computed by, for each year, taking the raw frequencies for that year and two years before and after for each author separately, then averaging over all tokens in those years. Given this set of relative frequencies for a feature, the final frequency is given by averaging over all authors for a given year. Hereafter, this is also referred to as 'author reference corpus' or 'ARC'. for James, Howells and the ARC q q q q qq q q q qq q q q qq qqq q q q q q q qq q qqq q q qqq q q q q q q qqq q q q q q q q q q q q q q qqq q q q q q q q q somewhat more variation around the authors' average. However, both authors are not unique in their tendencies. Like James, Alice Brown deviates comparatively little from the average, while Timothy Shay Arthur's relative frequency also increases in his last works similarly to Twain. Thus, there appears to be little evidence that Twain and James are decidedly different from their contemporaries in terms of style change. In the previous section, we identified an effect for 1PL with respect to background language effect, yet overall there is little evidence that there is a systematic influence of age or background language for these literary authors, at least for the variables examined.
[ 218 ] Linguistic ageing in literary authors q q q q qq q q q qqqq qq q q q qq q qq qq q q q q q qq q q qq q qq q q q q q q q q qq qq q qq q q q This could indicate that literary authors have a higher command over their language usage and may be more impervious to outside influences.

DISCUSSION
This work has considered aspects of linguistic ageing and how this influences literary authors. In part, the study presented here was a replication of an earlier study by Pennebaker and Stone investigating the [ 219 ] ageing effects in emotional disclosure studies and a corpus of literary authors. Although significant effects were found with respect to pronouns, future and past tense, and long-letter sequences in their study, these results did not replicate with respect to the authors examined here in a unified fashion that would suggest a rise or fall in frequency is actually due to age rather than only stylistic variation of individual authors. The fact that the results of the earlier study could not be replicated may be due to properties of this particular data set, but it could also hint at the possibility of this linguistic ageing effect not existing for professional writers, who could conceivably possess a higher command over their language style than non-professional writers. This would be consistent with P&S's findings insofar as their results for literary authors were also less significant than those for non-professional writers. This does not necessarily challenge the existence of linguistic ageing as a phenomenon, but rather suggests that the variables analyzed here do not provide good proxy measures for it, at least not with respect to literary writers. However, for this analysis no other non-linear models have been examined, something that would have to be done to completely refute the proposed hypotheses with respect to ageing.
The other purpose of this study was to examine these six variables for evidence of language change, and the results indicate significant change in the usage of at least 1PL pronouns, past and present tense verbs, and long-letter sequences. Overall, the models computed above for the literary authors present little evidence that background language (change) had a strong influence on them. However, the models built for 1PL pronouns present some evidence of background language influence, which indicates the necessity to control for it in general. A final result of this analysis was the diversity in the literary authors, which interestingly was not (only) caused by the prominent writers Mark Twain and Henry James. Instead, our analysis suggests that overall they seemed to align well with their contemporaries.
Based on this analysis, it appears that there could be some variation between authors for the six variables examined, possibly indicating stylistic differences with respect to other variables. These differences could be explored in more depth by looking more generally at stylistic change in the literary authors against the backdrop of general language shifts.

CONCLUSION
This work has considered to what extent ageing affects language development, examining six linguistic variables that had been reported as significant in the literature. While effects in previous studies were mainly found for non-professional writers, even significant effects confirmed by P&S for literary authors could not be replicated here. This does not necessarily prove an absence of previously identified effects, but calls for additional research to investigate this further. There is strong evidence of background language change for these variables, calling for explicit modelling of this influence, as has been exemplified as part of this work.