Evaluation of Automatic Updates of Roget’s Thesaurus

Thesauri and similarly organised resources attract increasing interest of Natural Language Processing researchers. Thesauri age fast, so there is a constant need to update their vocabulary. Since a manual update cycle takes considerable time, automated methods are required. This work presents a tuneable method of measuring semantic relatedness, trained on Roget’s Thesaurus , which generates lists of terms related to words not yet in the Thesaurus . Using these lists of terms, we experiment with three methods of adding words to the Thesaurus . We add, with high confidence, over 5500 and 9600 new word senses to versions of Roget’s Thesaurus from 1911 and 1987 respectively. We evaluate our work both manually and by applying the updated thesauri in three NLP tasks: selection of the best synonym from a set of candidates, pseudo-word-sense disambiguation and SAT-style analogy problems. We find that the newly added words are of high quality. The additions significantly improve the performance of Ro-get’s -based methods in these NLP tasks. The performance of our sys-tem compares favourably with that of WordNet -based methods. Our methods are general enough to work with different versions of Roget’s Thesaurus .

Thesauri and similarly organised resources attract increasing interest of Natural Language Processing researchers.Thesauri age fast, so there is a constant need to update their vocabulary.Since a manual update cycle takes considerable time, automated methods are required.This work presents a tuneable method of measuring semantic relatedness, trained on Roget's Thesaurus, which generates lists of terms related to words not yet in the Thesaurus.Using these lists of terms, we experiment with three methods of adding words to the Thesaurus.We add, with high confidence, over 5500 and 9600 new word senses to versions of Roget's Thesaurus from 1911 and 1987 respectively.
We evaluate our work both manually and by applying the updated thesauri in three NLP tasks: selection of the best synonym from a set of candidates, pseudo-word-sense disambiguation and SAT-style analogy problems.We find that the newly added words are of high quality.The additions significantly improve the performance of Roget's-based methods in these NLP tasks.The performance of our system compares favourably with that of WordNet-based methods.Our methods are general enough to work with different versions of Roget's Thesaurus.

introduction
Thesauri and other similarly organised lexical knowledge bases play a major role in applications of Natural Language Processing (NLP).While Roget's Thesaurus, whose original form is 160 years old, has been applied successfully, the NLP community turns most often to Word-Net (Fellbaum 1998).WordNet's intrinsic advantages notwithstanding, one of the reasons is that no other similar resource, including Roget's Thesaurus, has been publicly available in a suitable software package.It is, however, important to note that WordNet represents one of the methods of organising the English lexicon, and need not be the superior resource for every task.Roget's Thesaurus updated with the most recent vocabulary can become a competitive resource whose quality measures up to WordNet's on a variety of NLP applications.In this paper, we describe and evaluate a few variations on an innovative method of updating the lexicon of Roget's Thesaurus.
Work on learning to construct or enhance a thesaurus by clustering related words goes back over two decades (Tsurumaru et al. 1986;Crouch 1988;Crouch and Yang 1992).Few methods use an existing resource in the process of updating that same resource.We employ Roget's Thesaurus in two ways when creating its updated versions.First, we construct a measure of semantic relatedness between terms, and tune a system to place a word in the Thesaurus.Next, we use the resource to "learn" how to place new words in the correct locations.This paper focusses on finding how to place a new word appropriately.
We evaluate our lexicon-updating methods on two versions of Roget's Thesaurus, with the vocabulary from 1911 and from 1987.Printed versions are periodically updated, but new releases -neither easily available to NLP researchers nor NLP-friendly -have had little effect on the community.The 1911 version of Roget's Thesaurus is freely available through Project Gutenberg. 1 We also work with the 1987 edition of Penguin's Roget's Thesaurus (Kirkpatrick 1987).An open Java API for the 1911 Roget's Thesaurus and its updated versions -including every addition we discuss in this paper -are available on the Web as the Open Roget's Project. 2 The API has been built on the work of Jarmasz (2003).The process of adding new words to Roget's Thesaurus Figure 1 outlines the process of updating Roget's Thesaurus.We work with Wikipedia as a corpus and with the parser MINIPAR (Lin 1998a).Raw text is parsed, and a word-context matrix is constructed and re-weighted in both a supervised and an unsupervised manner.The nearest synonyms of each word in the matrix are generated and a location for them in Roget's Thesaurus is deduced using it as a source of tuning data.The last step can be applied iteratively to update the lexicon of Roget's Thesaurus.
This work makes six main contributions: • apply the supervised measures of semantic relatedness from (Kennedy and Szpakowicz 2011) and (Kennedy and Szpakowicz 2012) to the updating of Roget's Thesaurus, and evaluate it carefully; • propose and compare three methods of automatically adding words to Roget's Thesaurus; • build the updated editions of the 1911 and 1987 versions of Roget's Thesaurus; • create new datasets for pseudo-word-sense disambiguation and the selection of the best synonym; [ 3 ] • propose and evaluate a new method for solving SAT-style analogy problems; • compare semantic similarity calculation with Roget's Thesaurus and WordNet on accuracy and on runtime. 1.1

About Roget's Thesaurus
In the early 1800s, Peter Mark Roget, a physician, began to categorise terms and phrases for his personal use in writing.The ensuing Roget's Thesaurus, first published in 1852, has gone through many revisions continuing to this day (Kendall 2008).A nine-level hierarchy makes up most of the structure of Roget's Thesaurus: 1 Class 6 Part of Speech 2 Section 7 Paragraph 3 Sub-Section 8 Semicolon Group 4 Head Group 9 Words and Phrases 5 Head Eight classes are subdivided into Sections and Sub-Sections.There are around 1000 Heads -the main category in Roget's Thesaurus, corresponding to major concepts.Heads with opposing or complementary concepts form a Head Group.A Part of Speech (POS) groups all noun/verb/adjective/adverb realisations of the Head's concept.The closest counterpart of WordNet's synsets is a Semicolon Group (SG).An SG contains closely related words (usually near-synonyms); a Paragraph contains related SGs.Note the division by part-of-speech quite low in the hierarchy, not at the very top as in WordNet.We define a Roget's grouping to be the set of words contained within an instance of any of these levels.A Section or even a Class is also a Roget's grouping, but usually we talk about words in the same POS, Paragraph or SG. Figure 2 shows an example of a Head.Head #586 in the 1911 Roget's Thesaurus contains terms pertaining to language.A number before a word refers to a Head in which that word-sense may also be found.Although a thorough update of Roget's Thesaurus should include such cross-references, they are beyond the scope of this work. 3  3 They do not figure in any of the applications we consider here to test the quality of the updated versions of the Thesaurus.[ 4 ] Evaluation of Automatic Updates of Roget's Thesaurus  Evaluation of a new semantic distance measure should then be useful at identifying words in the same POS, Paragraph and SG.

previous work on updating thesauri
There have been few attempts to expand the lexicon of Roget's Thesaurus thus far.Cassidy (2000) added manually a few hundred words to the 1911 edition of Roget's Thesaurus.Kennedy and Szpakowicz (2007) disambiguated hypernym instances in the 1987 Roget's Thesaurus.Both projects augmented Roget's Thesaurus, but did not offer insight into how to update the lexicon automatically.
[ 5 ] Other related work includes mapping word senses between Roget's Thesaurus, WordNet and LDOCE (Procter 1978).The contexts where a word appears, whether it is words in the same Paragraph, WordNet synset or an LDOCE definition, are used to deduce which words are likely to be related (Kwong 1998a,b;Nastase and Szpakowicz 2001).

Updating WordNet
The automatic expansion of WordNet's lexicon has been attempted several times.Snow et al. (2006) extracted thousands of new words from a corpus for possible inclusion in WordNet (though that expansion never materialised in practice due to its low accuracy).Many of the new terms were proper nouns found in a corpus by a machine learning system (Snow et al. 2005) which was used to discover is-a relations using dependency paths generated by MINIPAR (Lin 1998b).Pantel (2005) created semantic vectors for each word in WordNet by disambiguating contexts which appeared with different senses of a word.The building of semantic vectors is described in (Pantel 2003).WordNet's hierarchy was used to propagate contexts where words may appear throughout the network.A word sense was then represented by contexts from its semantic vector not shared with its parents.Pantel did not attempt to place new words into the resource, only evaluated the method on existing words.This technique was only examined for nouns.It presumably applied to verbs as well, but could not be tried on adjectives or adverbs, for which there was no usable hypernym hierarchy.
A folksonomy is a Web service which allows users to annotate Web sites (among other things) with strings of their choice.One such folksonomy was Delicious where users categorised Web pages.Hypernym/Hyponym relations can be extracted from folksonomies by identifying tags subsuming other tags.Zheng et al. (2008) describe how to use folksonomies to discover instances of hypernymy and so help put new words into WordNet.
Not directly applicable but relevant to our work is semi-automatic enhancement of WordNet with sentiment and affect information.Esuli and Sebastiani (2006) used machine learning to build SentiWordNet by labelling synsets in WordNet 2.0 as objective, positive or negative.In WordNet Affect (Strapparava and Valitutti 2004), synsets got one or more labels, often related to emotion.An initial set of words marked [ 6 ] with emotions was built manually.Next, those emotions were propagated to other synsets via WordNet relations.This work was based on WordNet Domains (Magnini and Cavagliá 2000), a framework which allows a user to augment WordNet by adding domain labels to synsets.No new words were added, but these projects highlight some of the more successful experiments with enhancing WordNet.
There is a reasonable amount of work on mining hypernym relations from text, which could then be used to update WordNet.This includes using set patterns (Hearst 1992;Sombatsrisomboon et al. 2003) or discovering new patterns using a few seed sets of hypernyms (Morin and Jacquemin 1999).Languages other than English for which hypernym mining has been attempted include Swedish (Rydin 2002), Dutch (Sang 2007) and Japanese (Shinzato and Torisawa 2004).There also has been research on hierarchically related verbs (Girju et al. 2003(Girju et al. , 2006)).

Wordnets in other languages
There has been much work on building wordnets for languages other than English, loosely coordinated by the Global Wordnet Association. 4ne strategy is to take the Princeton WordNet (Fellbaum 1998) as a starting point.That was the mode of operation in the EuroWordNet project (Vossen 1998), an early initiative meant to build wordnets for several European languages.One of its offshoots is BalkaNet. 5he other wordnet-building strategy is to avoid the influence of Princeton WordNet.Polish WordNet (Piasecki et al. 2009) is one such resource built from scratch.Its development was supported, among others, by WordNet Weaver, a tool which helps increase the vocabulary of a wordnet.The tool implements a two-phase algorithm.Phase I identifies a network vicinity in which to place a new word, while phase II connects possible candidate synsets.Phase II is semiautomatic: it is the linguists who decide what additions are ultimately made to the growing Polish WordNet.//www.dblab.upatras.gr/balkanet/) [ 7 ] Alistair Kennedy, Stan Szpakowicz Lemnitzer et al. (2008) discuss adding semantic relationships between nouns and verbs to GermaNet, a German wordnet.Those were verb-object relationships believed to be useful in applications such as text summarisation or anaphora resolution.Sagot and Fišer (2011) present an automatic, language-independent method (tested on Slovene and French) of extending a wordnet by "recycling" freely available bilingual resources such as machine-readable dictionaries and on-line encyclopaedias.

measuring semantic relatedness
Distributional measures of semantic relatedness (MSRs) use a word's context to help determine its meaning.Words which frequently appear in similar contexts are assumed to have similar meaning.Such MSRs usually re-weight contexts by considering some measure of their importance, usually the association between a context and the terms it contains.One of the most successful measures is Pointwise Mutual Information (PMI).PMI increases the weight of contexts where a word appears regularly but other words do not, and decreases the weight of contexts where many words may appear.Essentially, it is unsupervised feature weighting.Kennedy andSzpakowicz (2011, 2012) discussed introducing supervision into the process of context re-weighting.Their method identifies contexts where pairs of words known to be semantically related frequently appear, and then uses a measure of association to re-weight these contexts by how often they contain closely related words.The method, very general, can work with any thesaurus as a source of known synonym pairs and with measures of association other than PMI.Here, this measure will help update Roget's Thesaurus.This section describes in general how this method is applied.

Building a word-context matrix for semantic relatedness
We used Wikipedia6 as a source of data and parsed it with MINI-PAR (Lin 1998a).The choice of dependency triples instead of all neighbouring words favours contexts which most directly affect a word's meaning.Examples of triples are 〈time, mod, unlimited〉 and 〈time, conj, motion〉: "time" appears in contexts with the modifier "unlimited" and in a conjunction with "motion".Some 900 million dependency triples generated by parsing Wikipedia took up ≈20GB.
Three matrices were built, one each for nouns, verbs and adjectives/adverbs.7For each word-relation-word triple 〈w 1 , r, w 2 〉 we generated two word-context pairs (w 1 , 〈r, w 2 〉) and (w 2 , 〈w 1 , r〉).Words w 1 and w 2 could be of any part of speech.All relations r were considered, with the direction of r retained.When w 1 or w 2 was an individual term, it had to be a noun, verb, adjective or adverb, written in lower case (MINIPAR only leaves proper nouns capitalised).
With these constraints we used all of the Wikipedia dump when building the matrices for verbs and adjectives/adverbs, but only 50% for nouns.This limit was chosen both because it was the most data which could be held in a system with 4GB of RAM and because the leftover data could be used in later evaluation.
Very infrequent words and contexts tend to be unreliable, and often appear because of spelling errors.We established thresholds for how often a word or context needs to appear.We measured the quality of synonyms generated for a set of randomly selected words which appear with different frequencies in the matrix.Next, in a series of straightforward experiments, we selected a cutoff after which the quality of the synonyms does not appear to improve: 35 for nouns and for adjectives, 10 for verbs.Also, an entry must appear in a context at least twice for the context to count.Table 1 shows the counts of words and contexts in each matrix before and after the cutoff.Non-zero entries are cells with positive values.While the reduction of the matrix dimensionally was quite large, the decrease in the number of non-zero entries was very small.So, we lost little information, but created a much denser and more informative matrix.

Measures of semantic relatedness
We explored two complementary methods of re-weighting the wordcontext matrix.An unsupervised method measures association between words and contexts; a supervised method uses known pairs of synonyms in Roget's Thesaurus to determine which contexts have a higher tendency to contain pairs of known synonyms (Kennedy andSzpakowicz 2011, 2012).Supervision can be conducted on each individual context, or on groups of contexts with a syntactic relation in common.It was found that supervision at the context level worked best for nouns and verbs, while grouping contexts by relation worked best for adjectives (Kennedy and Szpakowicz 2012).
Both supervised and unsupervised methods employ measures of association; Kennedy and Szpakowicz (2012) found that in all cases PMI was the most successful.These two kinds of methods can actually be complementary.It is possible to use the supervised method of matrix re-weighting and then apply the unsupervised method on top of it.This was generally found to yield the best results; so this is how we report the results.
To evaluate this work, we created a random set of 1000 nouns, 600 verbs and 600 adjectives and generated lists of neighbouring words for each of them. 8Those words were left out of the training process.We then measured the precision -how many neighbouring words appeared in the same SG, Paragraph or POS -in the 1987 Roget's Thesaurus.Precision was measured at several recall points: the top 1, 5, 10, 20, 50 and 100 words retrieved from the 1987 Thesaurus.
Table 2 shows the results for the unsupervised baseline, using PMI weighting and the results for the combined supervised methods using synonyms from either the 1911 or the 1987 version of Roget's 8 There were not enough adverbs to construct such a set.Adverbs will be left for future work.
[ 10 ] Thesaurus as training data.Statistically significant improvement over the baseline appears in bold, while significantly worse results are italicised; we applied Student's t-test.With a few small exceptions, we found that the supervised system performs better.The number of times the scores were better, unchanged, or worse can be found in Table 3.
In general, we concluded that the combination of supervised and unsupervised context weighting created a superior MSR, better suited to  updating Roget's Thesaurus than the unsupervised method alone.We used the supervised method of generating lists of related words when adding new terms to the Thesaurus.

placing new words in roget's thesaurus
In this section, we evaluate a variety of systems for adding new words to Roget's Thesaurus.The baseline method places a word in the same POS, Paragraph and Semicolon Group as its closest neighbour in the Thesaurus.We improve on this baseline using multiple words to deduce a better location or better locations.

Methods of adding new words
We took advantage of the hierarchy of Roget's Thesaurus to select the best place to add words.We found first the POS, then the Paragraph, then the SG. 9 We refer to the word to be added to Roget's Thesaurus as the target word.A word already in the Thesaurus may be an anchor, acting as a "magnet" for a given target.For every target word t, we generated a list of nearest neighbours N N (t), along with similarity scores, and identified anchors using N N (t).
We experimented with three methods, evaluated against the following baseline: the target t is placed in the same POS, Paragraph and SG as w i , where w i is the first word in N N (t) found in Roget's Thesaurus.Since w i may be polysemous, t can go into multiple locations in Roget's Thesaurus.Often w i will be w 1 if the first neighbour of t is found in the Thesaurus.For the values in Table 4, this baseline has been calculated using the MSRs built with combined weighting, trained with the 1911 or the 1987 Thesaurus.The results show one number for the count of POSs, Paragraphs and SGs where the target t was placed and the precision of placing the word into the POSs, Paragraphs and SGs.
9 Identifying the POS effectively gives us the correct Head as well.
[ 12 ] The first method is to apply a nearest-neighbour model.X nearest neighbours from N N (t) are identified for each target word t.If W of these X words appear in the same Roget's grouping, the target word is placed there.It is a weakness that this method considers -somewhat unrealistically -the same number of neighbours for every target word.
In the second method, scores replace rank.Words with scores of Y or higher are identified.If W of them are in the same Roget's grouping, the target word is placed there.This allows for varying numbers of neighbours, but similarity scores partially depend on the target word, so the same score between two different word pairs may indicate different degrees of similarity.A very frequent word which appears in many contexts may have more highly related neighbours than a word which appears in few contexts.Such a frequent word may thus have inordinately many synonyms.
The third method considers relative scores.It assumes that the first similar word w 1 is very closely related to t, then takes all synonyms within Z% of the similarity score for w 1 .This means that if w i has a score of within Z% of w 1 , then it can be used as an anchor of t for determining the correct Roget's grouping.Once again, if W of these words in the same Roget's grouping have a relative score of Z% or higher, then the target word can be placed there as well.
We also considered how to optimise the measures.In placing words into a Roget's grouping, the method has two parameters to optimise, W and one of X , Y or Z.One possibility is to base F-measure on the precision with which words are placed in Roget's Thesaurus and recall on the number of words from the test set which could actually be placed.Another possibility of counting recall would be to identify the number of places where a word appears in the Thesaurus and see in how many of them it was placed.This measure has some problems.
For one, rare senses are not well represented by the vectors in the word-context matrix, so synonyms for only the most dominant senses will be found.Also, an even balance of precision and recall is not appropriate for this task.Adding incorrect words could be quite detrimental, so we assume that identifying the POS must weight precision more highly than recall.We set a 0.33 ratio of recall to precision (an F0.33 measure rather than F1).Once the POS has been identified, the Paragraph and SG will be identified using the F1 measure.The  choice of F0.33 is somewhat arbitrary, but favouring precision over recall should mostly bring advantages.A high-precision system is, in theory, more likely to place words in the correct Roget's grouping at the cost of lower recall.Any method of adding new words to Roget's Thesaurus, however, could be run iteratively and thus make up for the lower recall.Rather than attempting to add a lot of words in one pass, our method will add fewer words in each of multiple passes.When using this method to actually add new words, sometimes it is necessary to create new Paragraphs or SGs.If a POS is identified but no Paragraph, then a new Paragraph will be created.Likewise, if a Paragraph but not an SG can be identified, then the word is placed in a new SG in the selected Paragraph.
The methods were tuned on the same dataset as that used to evaluate the MSR in Section 3.For evaluation, we constructed a test set equal in size to the tuning set.We evaluated all methods on the task of identifying the correct POS to place a target word t.The best method is then applied to the task of placing a word in the appropriate Paragraph and SG.

Baseline
Table 4 shows the results of the baseline experiments, measured for the 1911 and 1987 versions of Roget's Thesaurus.The former did not contain all the words for evaluation that the latter did -hence

Tuning parameters for adding new words
Table 5 shows the parameters, optimised for F0.33, for the three nonbaseline methods.Tables 6-8 present the results on the tuning and test data.
When optimising for the X nearest neighbours (Table 6), the results show a large improvement over the baseline (Table 4).The results for nouns were actually better on the test dataset than on tuning data,  version.Generally about one third to half of the words found in the top X needed to be present in the same Roget's grouping in order to accurately select the correct grouping.
Table 7 shows optimising word placement with scores Y or higher.The optimal scores were noticeably lower than when we optimised for X nearest neighbours (Table 6).The minimum score Y appeared to be lower for nouns than for verbs or adjectives, though more words were required in order to identify the Roget's grouping positively.This method is not as successful as simply selecting the X nearest neighbours.For verbs added to the 1911 Roget's Thesaurus, there was actually no improvement over the baseline (Table 4).This is the least successful method of the three.
Table 8 reports on optimising for the relative score Z.We found that most neighbouring words had to be within 80-90% of the closest neighbour in terms of score.This improved the results noticeably over a simple selection of a hard score cut-off (Table 7).Nonetheless, we did not improve on simply taking the X nearest neighbours (Table 6).For determining relatedness, it would appear, rank is often a feature more important than score.With this in mind, we applied the nearestneighbour function using X to find the best parameters for identifying the POS, Paragraph and SG.The parameter W shown in for the POS level.We have three versions, W -POS, W -Para and W -SG for the POS, Paragraph and SG respectively.Table 9 shows the optimal values of X , W -POS, W -Para and W -SG. The same value of X was used for identifying groupings at the POS, Paragraph and SG levels.There is a bit of variance in the measures.The values of W -POS, W -Para and W -SG decrease as the groupings become smaller.To identify the correct SG, only 2 or 3 words were used.For the 1911 Roget's Thesaurus, the same number of words were used to identify the Paragraph as the SG.More words could be used to identify the POS for the 1987 Thesaurus than for the 1911 version.
Tables 10 and 11 show the precision, recall and F1 measure at the POS, Paragraph and SG level for the 1987 and 1911 Thesauri.The results show clearly that the F1 measure is highest when identifying the Paragraph level; this is largely because the POS level is optimised for the F0.33 measure.Once again, the scores for the 1987 version  tend to be better than those for the 1911 version.Most of the time it is possible to identify the correct POS with at least 40% accuracy.The recall for the 1987 Thesaurus was 0.233 or higher at the POS level.This is important, because it indicates how many new word additions to the Thesaurus can be expected.For the 1911 Thesaurus, the results tend to be much lower, with scores from 0.097 to 0.182 on the test set.
The number for verbs is very low; for nouns and adjectives it is better, but still lower than the corresponding results for the 1987 thesaurus. 4.4

Adding words to the Thesaurus
We now show how the method described in Section 4.3 adds words to Roget's Thesaurus.In practice, a few small modifications were needed.First, we only let a word be placed in a POS if it was not already present in either that POS or in another POS within the same Head Group.This reduced the possibility of entering antonyms, which may be distributionally similar, into the same POS.Within each POS, we let a word be placed only in one Paragraph.We also did not allow  Once a new word has been added to Roget's Thesaurus, it can be used as an anchor to help add subsequent words.We built two updated versions of each Thesaurus, one with a single pass to update the Thesaurus, another with five updating passes.We considered each word in each matrix, excluding stop words,10 to be a target and generated a list of the nearest 100 neighbours for each of these words.11It was from these lists that we attempted to add new words to the Thesaurus.
Several measures are of interest when adding new words to the Thesaurus.The first is the number of times a target word has sufficient X and W values to be placed in Roget's Thesaurus, regardless of whether it was already present.The second measure is the total num-ber of words added to the Thesaurus.The third measure is the number of unique words added.These two are likely to be similar since most often a target word is only added to a single location in the Thesaurus.The fourth measure counts new words whose derivational form already exists in the Thesaurus.The fifth measure counts new words which have no derivationally related words in the Thesaurus.The last measure is the number of Heads where a new word was added.The results for all five passes can be seen in Table 12.
In addition to the five passes of adding new words, we experimented with random addition.All process parameters are the same, up to the point when our system determines a location where it believes a word belongs.Before checking whether that word already appears at this location, it is swapped for a random word.The counts appear in Table 13.Since the random word is selected after a location has been decided, it is very rare for this word already to be in that Head Group.As a result, the number of attempted placements is very close to the total number of words added, much closer than for the counts from Table 12.
Ultimately three updated version each of the 1911 and 1987 versions of the Thesaurus were created, those updated with one pass, five passes and one random pass -X1, X5 and R in Table 14.The updated versions are referred to as 1911X1, 1911X5, 1911R, 1987X1, 1987X5 and 1987R.The new thesauri have been evaluated manually (Section 5) and through selected NLP applications (Section 6).
Another statistic to consider is the total number of words, SGs and Paragraphs added to each version of Roget's Thesaurus, shown in Table 14.Overall, some 5500 new words were added to 1911X5 and 9600 to 1987X5.In the 1911 Thesaurus, approximately two thirds of the new words were placed in a new SG, while about a quarter were added to a new Paragraph.For the 1987 Thesaurus, a little under half of the new words were placed in new SGs, while around one fifth were added to new Paragraphs.

manual evaluation
To determine the quality of the additions reliably, one needs manual evaluation.In the next subsection, we describe several possibilities and explain how we chose our evaluation method.

Methods considered
The first evaluation method would test how well people can identify newly added words.Given a set of Paragraphs from Roget's Thesaurus, the annotator would be asked to identify which words she thought were added automatically and which were originally in the Thesaurus.
The percentage of times the annotator correctly identifies newly added words can be used to evaluate the additions.If a word already in the Thesaurus were as likely to be picked as one newly added, then the additions would be indistinguishable -an ideal outcome.We could also perform a "placebo test": the annotator gets a Paragraph where no words have been added, and decides whether to remove any words at all.A drawback is that the annotator may be more likely to select words whose meaning she does not know, especially in the 1911 Thesaurus, where there are many outdated words.Even the 1987 version has many words infrequently used today.
The second method of manual evaluation we considered was to ask the annotator to assign a new word to the correct location in the Thesaurus.A weighted edit-distance score could then tell how many steps the system's placement is from that location.We would also mea-  Such a method would be labourintensive: the annotator would need to read an entire Head before deciding how far a word is from its correct location.Larger Heads, where most new words are added, could contain thousands of words.
Identifying whether there is an SG more appropriate for a given word could also take a fair bit of effort.It might not be feasible to annotate enough data to perform a meaningful evaluation.
The strategy we finally adopted combines elements of the two preceding methods.The first step of this evaluation exercise is to decide whether new words added to an existing SG or a new SG in an existing Paragraph are in the correct location.The annotator is given the name of the Head, the part of speech and the text of the Paragraph where the word has been added.The new term is specially highlighted, and other terms in its SG are in bold.The annotator is asked to decide whether the new word is in the correct SG, wrong SG but correct Paragraph, wrong Paragraph but correct Head, or incorrect Head. Figure 3 shows a sample question.
The second evaluation step determines whether a word added to a new Paragraph is in the correct Head.As context, we provide the first word in every Paragraph in the same POS.It is too onerous to determine precisely in which SG or Paragraph a new word would belong, because some POSs are very large.Instead, we only ask whether the word is in the correct Head.A sample question appears in Figure 4.  We manually evaluated only the additions to the 1911 Roget's Thesaurus.As Paragraph size, we allowed at most 250 characters, thus limiting the number of words the annotators had to look at.The evaluation was completed by the first author and four volunteers.We chose enough samples to guarantee a 5% confidence interval at a 95% confidence level. 12We also included a high baseline and a low baseline: words already present in the Thesaurus 13 and words randomly added to it.There are enough samples from the baselines to guarantee a 5% confidence interval at a 95% confidence level if the samples from all three parts of speech are combined, though individually the confidence interval exceeds 5%.
Every new word in 1911X1 appears in 1911X5, 14 so a percentage of the samples needed to evaluate 1911X5 can be selected from the samples used to evaluate 1911X1.We thus must evaluate only a selection of the words from 1911X5 not present in 1911X1.We randomly selected words from the sample set for 1911X1 to make up the rest of the samples for the 1911X5 evaluation.
Random selection was made from each annotator's dataset: 40 tests for adding words to existing Paragraphs and 40 tests for adding words to new Paragraphs.These data points were added to each annotator's test sets so that there would be an overlap of 200 samples for each experiment, on which to calculate inter-annotator agreement.
The positive examples are words already present in Roget's Thesaurus.
The negative examples are words randomly placed in the Thesaurus.
12 http://www.macorr.com/sample-size-calculator.htm 13 They are referred to as "pre-existing" in Tables 15-16, in Figures 5-6 and in the discussion in Section 5.2 14 We remind the reader that X1 and X5 denote updating with one pass and five passes respectively.
[ 24 ] Evaluation of Automatic Updates of Roget's Thesaurus

Manual annotation results
Tables 15 and 16 show the combined manual annotation results for new words added to existing Paragraphs and for new Paragraphs.A number of interesting observations can be taken from Table 15.The results are summarised in Figure 5.In the case of pre-existing examples, around 60% of the time the annotators could correctly determine when a word belonged in the SG in which it was found.The annotators agreed on the correct Head approximately 80-90% of the time.One reason why annotators might believe the words belonged in a different grouping was that many of the words were difficult to understand.A high number of words which the annotators could not label fell into the pre-existing category.For the randomly assigned words, 70-80% of the time the annotators correctly stated that those words did not belong in that Head.For nouns there were numerous cases when the annotators could not answer.It would appear that the meaning of words pre-existing in the Thesaurus, and of those randomly added, is harder to determine than the meaning of automatically added words.
We now turn to the quality of additions.The distribution of 1911X1 scores in Table 15 is very close to that of the distribution for words pre-existing in Roget's Thesaurus.This suggests that after one pass the added words are nearly indistinguishable from those already in the Thesaurus.This is very good news: it confirms that our process of updating the lexicon has succeeded.The distribution of 1911X5 scores suggests that those additions were less reliable.The scores are worse than for 1911X1, but still much closer to the preexisting baseline than the random baseline.Multiple passes increase the error, but not by much.
The results are a bit different when it comes to inserting words into new Paragraphs.These results are summarised in Figure 6.Once again the high and low baselines appeared to be fairly easy for the annotators, who usually got around 80% of the questions right.Also, a solid majority of the unknown words appeared in these two groups.The additions to 1911X1 showed high scores, too, comparable to the high baseline, sometimes even exceeding it slightly.It may be that for this baseline the annotators were unaware of the sense of some words, so they mistakenly labelled those words as incorrect.
[ 25 ] The 1911X5 results -multi-pass update -clearly fall a fair distance from the scores for 1911X1.It would appear that multiple passes introduce considerable error into the Thesaurus, when words are placed into new Paragraphs.This is in stark contrast to the result of adding words to existing Paragraphs, when the drop in scores between 1911X1 and 1911X5 was relatively small.

Inter-annotator agreement
Each annotator was given 200 examples which reoccurred between the annotations.Inter-annotator agreement was measured on these overlaps, using Krippendorff's α (Krippendorff 2004), a measure designed to work with various kinds of data, including nominal, ordinal and interval annotations.We used ordinal in our experiments.The value of α was calculated for adding words both to existing Paragraphs and to new Paragraphs.When adding words to an existing Paragraph, we obtained a score of α = 0.340; when adding words to new Paragraphs, the score was α = 0.358.Such scores are often considered a "fair" amount of agreement (Landis and Koch 1977).

automatic evaluation
We now examine how the various versions of Roget's Thesaurus, as well as WordNet 3.0, perform on several NLP applications.The problems we selected are designed to evaluate Roget's Thesaurus on a diverse cross-section of NLP tasks: synonym identification, pseudo-word-sense disambiguation and SAT-style analogy problems.We use WordNet 3.0 and all available versions of the Thesaurus: 1911Thesaurus: , 1911X1, 1911X5, 1911R, 1987, 1987X1, 1987X5 and 1987R. , 1987X1, 1987X5 and 1987R.Although the updated versions of Roget's Thesaurus are larger than the original, and new words have been added with relatively high precision, there is no a priori guarantee that they will give higher scores on any NLP applications.Before we harness these resources into NLP applications, we will very briefly compare the structure of Roget's Thesaurus to that of WordNet. 15 major difference between WordNet and Roget's Thesaurus is that the former is built around a hypernym hierarchy of arbitrary depth.Words appear at all levels, rather than only at the bottom level, as in Roget's Thesaurus.Words are grouped into synsets.Synsets are similar to SGs in the Thesaurus, but are often smaller and contain only close synonyms.Synsets are linked by a variety of explicitly named semantic relations, while in the Thesaurus the SGs in a Paragraph are loosely related by a variety of possible implicit relations.

Synonym identification
Synonym identification is a means of evaluating the quality of newly added words in Roget's Thesaurus.In this problem one is given a term q and seeks its best synonym s in a set of words C. The system from Jarmasz and Szpakowicz (2003bSzpakowicz ( , 2004) ) identifies synonyms using the Thesaurus as the lexical resource.This method relies on a simple function which counts the number of edges in the Thesaurus between q and words in C. In Equation 1, 18 is the highest possible distance in the Thesaurus, so the closest words have the highest scores (edgesBetween simply counts the edges).We treat a word X as a lexeme: a set of word senses The best synonym is selected in two steps.First, we find a set of terms B ⊆ C with the maximum relatedness between q and each word sense x ∈ C (Equation 2).
Next, we take the set of terms A ⊆ B where each a ∈ A has the largest number of shortest paths between a and q (Equation 3).
The correct synonym s has been selected if s ∈ A and |A| = 1.Often the sets A and B will both contain one item, but if s ∈ A and |A| > 1, there is a tie.If s / ∈ A, the selected synonyms are incorrect.If an n-word phrase c ∈ C is found, its words c 1 , c 2 ..., c n are considered in turn; the c i closest to q is chosen to represent c.A sought word can be of any part of speech, though only some WordNet-based methods allow for adjectives or adverbs, and none can measure distance between different parts of speech.In these problems, we do not consider a word and its morphological variant to be the same.[ 29 ] We generated synonym selection problems specifically for words newly added to Roget's Thesaurus.We took all words which appeared either in 1987X5 or in 1911X5, but were not present in the original 1987 or 1911 versions, and used them as query words q for the new problems.We then found in WordNet synsets which contain at least one of q's synonyms found in the original (not updated) version of the Thesaurus.We completed the problem by finding in the original Thesaurus three detractors from q's co-hyponym synsets.This was done for nouns and for verbs, but not for adjectives, for which WordNet does not have a strong hypernym hierarchy.
Four different versions of this problem were generated for the 1911 and 1987 Roget's Thesauri using nouns and verbs.The linking structure for adjectives in WordNet precludes the creation of a data set in this manner.We present the final scores as precision and recall.The precision excludes questions where q is not in Roget's Thesaurus, and recall is the score over the entire data set.Precision is thus the proportion of correct guesses out of the questions attempted, while recall is the proportion of correct guesses out of the maximum number of attempted questions.This method of evaluating such work was proposed by Turney (2006).
Table 17 shows the results for nouns and verbs added to both the 1987 and the 1911 versions of Roget's Thesaurus.The results are quite similar for all four data sets.Obviously, a precision and recall of 0 is attained for the original versions of the Thesaurus.The randomly updated versions did poorly as well.Versions updated after one pass had recall between 18% and 26%, while the versions updated in 5 passes had 40% or more.The random baseline is 25% if all of the questions can be answered.The thesauri updated in 5 passes significantly beat this baseline. 16The thesauri updated in one pass tended not to show statistically significant improvement, though many problems were unsolvable (q was absent from 1911X1 or 1987X1).
The recall improvement for Roget's Thesaurus updated in 5 passes was significantly better (at p < 0.05) than for the Thesaurus updated in one pass.In turn, the Thesaurus updated in one pass was significantly better than the original Thesaurus (again at p < 0.05).The exception was the 1911 verb data set, for which the improvement could only be measured as significant at p < 0.065.This is largely because the dataset was fairly small.Another observation is that the randomly updated Thesaurus only once had a significant improvement over the original Thesaurus, in the case of the 1911 noun data set.These results suggest that the words newly added to Roget's Thesaurus are close to the correct location.The newly added words and their synonyms were closer than the newly added words and their cohyponyms.Generally the precision measure showed words added to the 1911X1 and 1987X1 thesauri to be approximately as accurate as, if not slightly more accurate than, those added in passes 2-5.The randomly updated Thesaurus did not perform as well, usually falling below the 25% baseline on the precision measure.The results for nouns added to the 1911 Thesaurus are a noticeable exception.In the other datasets at most one question was answered correctly by the randomly updated Thesaurus, but in this case there were three correct answers.It should be noted, however, that the evaluated sample was very small, so this is likely to have been a coincidence.

Pseudo-word-sense disambiguation
Pseudo-word-sense disambiguation (PWSD) is a somewhat contrived task, meant to evaluate the quality of a word-sense disambiguation (WSD) system.The set-up for this task is to take two words and merge them into a pseudo-word.A WSD system, then, has the goal of identifying which of the two words actually belongs in a given context in which the whole pseudo-word appears.We have had a chance to create a very large dataset for PWSD.This is an opportunity to consider WordNet and the versions of Roget's Thesaurus in PWSD, and to compare them not only for accuracy but also for runtime.
We used PWSD instead of real WSD for two main reasons.Firstly, as far as we know, there is no WSD data set annotated with Roget's word senses and so one would have to be built from scratch.Worse still, to compare WSD systems built using Roget's Thesaurus and Word-Net we would need a dataset labeled with senses from both.Secondly, PWSD gives us a fast way of building a dataset which can be used to evaluate the WSD systems based on the Thesaurus and on WordNet.
A common variation on this task is to make triples out of a noun and two verbs, then determine which of the verbs takes the noun as its object.The aim is to create a kind of verb disambiguation system which incorporates the edge count distance between nouns.In theory, this measure can help indicate how well a system identifies contexts (verb object) in which a verb appears.That can be useful in real WSD.Others who have worked on variations of PWSD include Gale et al. (1992); Schütze (1998); Lee (1999); Dagan et al. (1999);Rooth et al. (1999); Clark and Weir (2002); Weeds and Weir (2005); Zhitomirsky-Geffet and Dagan (2009).The methodology we followed was similar to that of Weeds and Weir.
The data set was constructed in four steps.
This creates two data sets.One is a training set of noun-verb pairs 〈n, v〉.The other is a test set made up of noun-verb-verb triples 〈n, v, v ′ 〉.Examples of such triples are 〈task, assign, rock〉 and 〈data, extract, anticipate〉.We selected v ′ so that its frequency is v's frequency ± 1.We also ensured that the pair 〈n, v ′ 〉 does not appear anywhere in the training or test data.To reduce noise and decrease the overall size of the dataset, we removed from both the test and training set all noun-verb object pairs which appeared less than five times.This produced a test set of 3327 triples and a training set of 464,303 pairs.We only used half of Wikipedia to generate this data set, the half not used in constructing the noun matrix.
We employed edgeScore (Equation 1) for all versions of Roget's Thesaurus.The methods implemented in the WordNet::Similarity software package (Pedersen et al. 2004) determine how close two words are in WordNet.These methods are J&C (Jiang and Conrath 1997), Res (Resnik 1995), Lin (Lin 1998a), W&P (Wu and Palmer 1994), L&C (Leacock and Chodorow 1998), H&SO (Hirst and St-Onge 1998), Path (counts edges between synsets), Lesk (Banerjee and Pedersen 2002), and finally Vector and Vector Pair (Patwardhan et al. 2003).The measure most similar to the edgeScore method is the Path measure in WordNet.J&C, Res, Lin, W&P, L&C and Path can only measure relatedness between nouns and verbs, because they only make use of hypernym links.H&SO uses all available WordNet relations in finding a path between two words.The Lesk and Vector methods use glosses and so might be just as easily implemented using a dictionary.They need not take advantage of WordNet's hierarchical structure.
To perform the PWSD task for each triple 〈n, v, v ′ 〉, we found in the training corpus k nouns which were the closest to n.Every such noun m got a vote: the number of occurrences of the pair 〈m, v〉 minus the number of occurrences of 〈m, v ′ 〉.Any value of k could potentially be used.This means comparing each noun n in the test data to every noun m in the training set if these nouns share a common verb v or v ′ .Such a computation is feasible in Roget's Thesaurus, but it takes a very long time for any WordNet-based measure.17To ensure that a fair value is selected, we divided the test set into 30 sets.We use 29 folds to find the optimal value of k and apply it to the 30 th fold.
The score for the PWSD task is typically measured as an error rate where T is the number of test cases (Equation 4).
Error rate = 1 Table 18 shows the results of this experiment.When it comes to the values of k, k = 1 was always found to be the optimal value on this dataset.So, the best way to perform PWSD [ 34 ] is to select the nearest noun taken as the object of either v or v ′ .
The CPU usage was perhaps the most pronounced difference with Roget's-based methods, which ran in a tiny fraction of the time which WordNet-based methods required.H&SO took around 28 days to run, so this measure simply is not an option for large-scale semantic relatedness problems.Even Lin, the fastest WordNet-based method, took around 5.5 hours, over 340 times longer than than the method based on the 1911 Thesaurus.
For all systems, a total of 193192 word pairs must be compared.We also examined the number of necessary comparisons between word senses.If one resource contains a larger number of senses of each word it is measuring distance on, then it will necessarily have to perform many more comparisons.The method based on the 1987 Thesaurus required nearly 120 million comparisons.The method based on the 1911 Thesaurus needed 14.7 million comparisons.For the Word-Net-based methods only 3.5 million comparisons were necessary.Clearly the implementation of Roget's Thesaurus has a very strong advantage when it comes to runtime.

SAT analogies
The last class of problems to which we applied Roget's Thesaurus were analogy problems in the style of Scholastic Aptitude Tests (SAT).In an SAT analogy task, one is given a target pair 〈A, B〉 and then from a list of possible candidates selects the pair 〈C, D〉 most similar to the target pair.Ideally, the relation between the pair 〈A, B〉 should be the same as the relation between the pair 〈C, D〉.For example:

Target pair word, language
Candidates paint, portrait poetry, rhythm note, music tale, story week, year Roget's Thesaurus performs well on problems of selecting synonyms and pseudo-word-sense disambiguation, but it is not clear just [ 35 ]  how well it will do on tasks of identifying analogies.That is because relations in the Thesaurus are unlabelled.We explore two methods of solving such problems with both the Thesaurus and WordNet.The first method attempts to identify a few kinds of relations in the Thesaurus and then apply them to identifying analogies.The second method uses edge distance between the pairs 〈A, B〉-〈C, D〉 and 〈A, C〉-〈B, D〉 as a heuristic for guessing whether two word pairs contain the same relation.
The dataset contains 374 analogy problems extracted from real SAT tests and practice tests (Turney 2005).A problem contains a target pair 〈A, B〉 and several pairs to choose from: test i = 〈X i , Y i 〉, i = 1..5.In evaluation, we consider seven scores: correct, ties, incorrect, filtered out, precision, recall and equal-weighted F-score.We define precision and recall in the same way as in Section 6.1.In the case of an n-way tie, the correct answer counts as 1/n towards the precision and recall.We consider recall as the most important measure, because it evaluates each method over the entire data set.

Matching relations
Unlike WordNet, Roget's Thesaurus contains no explicitly labelled semantic relations, but certain implicit relations can be inferred from its structure.Near-synonyms tend to appear in the same SG.Nearantonyms usually appear in different Heads in the same Head Group.One can also infer a hierarchical relation between two words if (1) they are in the same Paragraph and one of them is in the first SG, or (2) they are in the same POS and one of them is in the first SG of the [ 36 ] first Paragraph.So, three relations can be deduced from the Thesaurus.Two words can be near-synonymous, near-antonymous or hierarchically related.From WordNet, we allow words to be related by any of the explicit semantic relations.We also apply hypernymy/hyponymy transitively.
Using these semantic relations, the analogy problem is solved by identifying a candidate analogy which contains the same relation as the target pair.There will be no solution if no relation can be found for the target pair.This experiment is interesting in that it helps test whether narrower semantic relations in WordNet are more useful or less useful than the broader relations in Roget's Thesaurus.Table 19 shows the results; "Filtered" shows the number of pairs which were not scored because no relation could be established between the words in the target or candidate pairs.
The WordNet-based method has high precision, but recall is low compared to that of the Roget's-based versions.Interestingly, the precision and recall both increase as more words are added to the 1911 and 1987 versions of Roget's.We consider recall as more important in this evaluation, so it is clear that the most updated versions of Roget's Thesaurus outperform WordNet by a fair margin.Although the original 1911 version gave a lower F-score than WordNet, all other versions performed better.The existence of very specific semantic relations in WordNet did give it an edge in precision, but WordNet was only able to answer a few questions.This suggests that the relations between pairs in analogy tests are not only of the type encountered in Word-Net.While the broader relations identified in the Thesaurus appear to be less reliable and give lower precision, the recall is much higher.

Edge distance
The second method of solving analogy problems uses edge distance between words as a heuristic.Analogy problems have been solved in this way using Equation 5 proposed by Turney (2006).
The highest-scoring pair 〈X i , Y i 〉 is guessed as the correct analogy.This method assumes that A and X i should be closely related and [ 37 ] so should B and Y i .An illustrative example is 〈carpenter, wood〉 and 〈mason, stone〉.
In Equation 5, sim a is the attributional similarity.We replaced it with an edge distance measure r, either edgeScore (Equation 1) or one of the measures built on WordNet.Because edgeScore only returns even numbers between 0 and 18, it tends to give many ties.We used a formula with a tie breaker based on the edge distance between A and B and between X i and Y i : The last term of the sum in Equation 6acts as a tie-breaker which favours candidates 〈X i , Y i 〉 with an edge distance similar to the target 〈A, B〉.We include another constraint: A and X i must be in the same part of speech, and so do B and Y i .Only one sense of each of A, B, X i and Y i can be used in the calculation of Equation 6.For example, the same sense of A is used when calculating r(A, X i ) and r(A, B).
We applied Equation 6to the 374 analogy problems using all versions of Roget's Thesaurus and the WordNet-based edge distance measures.The results are shown in Table 20.The "Filtered" column shows how many SAT problems could not be solved because at least one of the words needed was absent in either the Thesaurus or WordNet.Unfortunately, expanding Roget's Thesaurus did not reduce the number of filtered results.That said, both precision and recall increased when more words were added to the Thesaurus.Overall, we found that in absolute numbers the updated 1987X5 Roget's Thesaurus performed better than any other resource examined.Even the updated versions of the 1911 Thesaurus performed on par with the best WordNet-based systems.We must note, however, that none of the improvements of the 1987X5-based method over any given WordNet method are statistically significant.

Summary
We have described a method of automatically updating Roget's Thesaurus with new words.The process has two main steps: lists of semantically related words are generated, and next those lists are used to find a place for a new word in the Thesaurus.We have enhanced both steps by leveraging the structure of Roget's Thesaurus.
When creating lists of related words, we have evaluated a technique for measuring semantic relatedness which enhances distributional methods using lists of known synonyms.We have shown this to have a statistically significant effect on the quality of measures of semantic relatedness.
In the second step, the actual addition of new words to Roget's Thesaurus, we generated a list of neighbouring words and used them as anchors to identify where in the Thesaurus to place a new word.This process benefits from tuning on the actual Thesaurus.The task here is to find words which will be good anchors for determining where to [ 39 ] tween the context space in two different languages (Kennedy and Hirst 2012).Methods of this kind could be used to emphasise similarities between words based on sentiment, emotion or formality, rather than simply on synonymy.Using emotionally related words as a source of training data could enable the creation of a measure of semantic relatedness which favours words of the same emotional class over other, nearer, synonyms conveying a different emotion.
Perhaps other more complex methods of adding new words to Roget's Thesaurus can be considered.For example, mixing rank and score (maybe using machine learning) might lead to an even more accurate method.Other methods of identifying where in the Thesaurus to place a word could also be considered.In particular, Pantel's (2005) method could potentially be modified to work for Roget's Thesaurus.
Our method only adds individual words to Roget's Thesaurus.It should be possible to expand it into adding multi-word phrases.Many dependency parsers can identify noun phrases and so can be used to create distributional vectors for such phrases.Adding multi-word phrases to verb or adjective Roget's groupings may be possible by identifying n-grams which are frequent in a text.Two problems arise.One is determining whether high frequency alone is a good enough reason to add a multi-word phrase.The second is how to represent such multi-word phrases.It could be possible to represent them by vectors of word-relation pairs for syntactically related words in the same sentence, but outside of the phrase being considered.The meaning of a phrase may also be deduced by composing the distributional vectors of its individual words.There is ongoing, and very interesting, research in this area (Mitchell and Lapata 2008;Turney 2012).
A problem which we have not tackled yet is that of adding crossreferences: if the same word appears in two places in Roget's Thesaurus, then often a cross-reference links the two occurrences.Making use of these cross-references could be a considerable undertaking, because it requires, amongst other things, some form of effective word-sense disambiguation.
The manual annotation has only been conducted on the 1911 version of Roget's Thesaurus, because it is the only version which can be released to the public, and because the annotation experiment has been very time-consuming.In the interest of completeness, the updates to the 1987 version could be evaluated similarly.We expect that those [ 41 ] updates should actually be more accurate, because the 1911 version is both older and smaller.This would be in line with the automatic evaluation from Section 4, but it is yet to be confirmed manually.
It should be possible to adapt our methods of placing words in Roget's Thesaurus to work for WordNet.Instead of identifying words in the same POS, then Paragraph, then SG, word groupings could be created from WordNet's hypernym hierarchy.We envisage two ways of doing this.The first would be to pick a relatively high level within the hierarchy and classify each word into one or more of the synsets at that level, much as we did with the POS level.A synset could be represented by all the words in the transitive closure of its hyponym synsets.Next, the word would be propagated down the hierarchy -as we do with Paragraphs and SGs -until it can go no further, and then added to the synset there.
This method could not (yet) be applied to adjectives, and would only take one kind of relation into account when placing a word in WordNet.Another option is to create a neighbourhood of words for each synset, based on a variety of relations.A word could then be placed in a larger grouping of multiple synsets before the particular synset it belongs to is determined.If no synset can be picked, then a new synset can be created with some sort of ambiguous link joining it to the other synsets in its neighbourhood.A hybrid of these two methods is also possible.Our first method could be enhanced by using not only a synset's terms, but also its close neighbours.This would expand the set of anchor words at the cost of introducing words common to multiple synsets.
It should also be possible to port our method to thesauri and wordnets in other languages.The main problem might be our method's reliance on a dependency parser.Such parsers are not available yet for many languages.Nonetheless, it could be possible to replicate much of the relevant functionality of a dependency parser using a part-ofspeech tagger -and taggers are quite widely available.For example, one may assume that a noun can only be modified by other nouns or adjectives in its vicinity, and so only use those terms in constructing a distributional vector.
Another direction which this kind of research could take would be to test the methods on adding words in a particular domain.Most of the words in Roget's Thesaurus are from everyday English, as op- [ 42 ] posed to, say, medical terms.The nearest synonyms of such technical words will be other technical words.This could make it more difficult to actually add domain-specific terms to Roget's Thesaurus.That said, the trainable measure of semantic relatedness from Kennedy andSzpakowicz (2011, 2012) could be built using words of a particular domain.If domain-specific and everyday words could be grouped as near-synonyms, then an MSR could be trained for adding domainspecific terms to Roget's Thesaurus.
Similar to adding domain-specific words is the challenge of adding brand new coinage to Roget's Thesaurus.Very new words may not have close synonyms in the Thesaurus, which is why we add words in multiple passes.It would be interesting to investigate how many passes are required before, say, the word "iPhone" is added to the Thesaurus.Closely related phrases like "mobile phone" or "smart phone" would need to already be present.Other terms, such as "cellular network", "texting" or "Apple", could also be useful in choosing where to place a word like "iPhone".
Finally, note that we have only applied Roget's Thesaurus to three NLP tasks, to demonstrate value in both its structure and language coverage.Many other applications of the Thesaurus are possible.Some obvious ones include real word-sense disambiguation and lexical substitution.Roget's Thesaurus has already been used in the construction of lexical chains (Morris and Hirst 1991;Jarmasz and Szpakowicz 2003a).Lexical chains might be applied to summarisation or text segmentation.Since the Thesaurus contains a large number of opposing concepts, it may be possible to apply it to lexical entailment as well.
NLP researchers are always on the hunt for newer and larger data sets on which to train and evaluate their experiments.Many of these experiments will require measuring semantic distance among huge sets of words.In the coming years, the trend towards analyzing big data will drive the need for fast semantic relatedness calculation.Roget's Thesaurus is uniquely suited for that.[ 43 ] Figure 1: The process of adding new words to Roget's Thesaurus

Figure 2
Figure 2: Head 586: Language from the 1911 Roget's Thesaurus adding the same word to multiple SGs within the same Paragraph or indeed to multiple Paragraphs in the same POS.

Figure 3
Figure 3: Example of the annotator task for adding a word to a Paragraph

Figure 4
Figure 4: Example of the annotator task for adding a word to a POS

Table 4 :
Baseline for identifying the POS of a word on the tuning and test data

Table 7
but somewhat worse for verbs and adjectives.As with the baseline, the results were better for the 1987 Roget's Thesaurus than the 1911

Table 9 :
Optimal parameters for X (the number of nearest neighbours) and W (neighbours needed to insert a word into a Roget's grouping) at the POS, Paragraph and SG levels

Table 18
The improvement on 1911X1 and 1911X5 over the original 1911 version of the Thesaurus was statistically significant at p < 0.05, according to Student's T-test.The improvement on the updated 1987 version was not statistically significant for 1987X1 with p ≈ 0.15, but it was significant for 1987X5.The 1911X5 version gave results comparable to the 1987 version.The Roget's-based methods were comparable to the best WordNet-based methods.