High-level methodologies for grammar engineering . Introduction to the special issue

Grammar Engineering is the task of designing and implementing linguistically motivated electronic descriptions of natural language (so-called grammars). These grammars are expressed within well-defined theoretical frameworks, and offer a fine-grained description of natural language. While grammars were first used to describe syntax, that is to say, the relations between constituents in a sentence, they often go beyond syntax and include semantic information. Grammar engineering provides precise descriptions which can be used for natural language understanding and generation, making these valuable resources for various natural language applications, including textual entailment, dialogue systems, or machine translation. The first attempts at designing large-scale resource grammars were costly because of the complexity of the task (Erbach et al. 1990) and of the number of persons that were needed (see e.g. Doran et al. 1997). Advances in the field have led to the development of environments for semi-automatic grammar engineering, borrowing ideas from compilation (grammar engineering is compared with software development) and machine learning. This special issue reports on new trends in the field, where grammar engineering benefits from elaborate high-level methodologies and techniques, dealing with various issues (both theoretical and practical).

Grammar engineering is the task of designing and implementing linguistically motivated electronic descriptions of natural language (socalled grammars).These grammars are expressed within well-defined theoretical frameworks, and offer a fine-grained description of natural language.While grammars were first used to describe syntax, that is to say, the relations between constituents in a sentence, they often go beyond syntax and include semantic information.Grammar engineering provides precise descriptions which can be used for natural language understanding and generation, making these valuable resources for various natural language applications, including textual entailment, dialogue systems, or machine translation.The first attempts at designing large-scale resource grammars were costly because of the complexity of the task (Erbach 1990) and of the number of persons that were needed (see e.g.Doran et al. 1997).Advances in the field have led to the development of environments for semi-automatic grammar engineering, borrowing ideas from compilation (grammar engineering is compared with software development) and machine learning.This special issue reports on new trends in the field, where grammar engineering benefits from elaborate high-level methodologies and techniques, dealing with various issues (both theoretical and practical).

grammar engineering
Grammar engineering, the task of designing and implementing linguistically motivated electronic grammars, has been an active field for decades, following seminal work by Chomsky (1957) on formal languages.The kind of deep structures produced in this field contain rich information, which makes them a valuable resource for various NLP applications, including natural language parsing / generation, textual entailment, or dialogue systems.Among the reasons which make grammar engineering a complex task, one may cite the variety of theoretical frameworks that are used to represent linguistic information, and the intrinsic complexity coming from interactions between rules within large grammars.
In this context, the classical model of hand-crafted grammar has been replaced with more advanced techniques, which we may call semi-automatic grammar production.These techniques vary depending on the target formalism, the target language, or the target linguistic dimensions (e.g.syntax, semantics, morphology, etc.).In the following, we first report on the production of formal grammars (Section 1.1).We then report on the main resource grammars that are available (Section 1.2).We then give a brief overview of the current issues in grammar engineering (Section 1.3).Finally, in Section 2, we summarise the contributions of this special issue, and in Section 3, we conclude about the current status of grammar engineering.

Semi-automatic production of formal grammars
Over the last decades, several approaches to formally describing natural language syntax have been proposed, starting with relatively basic string rewriting systems such as Context-Free Grammar (Chomsky 1956), to continue with more elaborate constraint-based systems such as Head-driven Phrase Structure Grammar (Sag and Pollard 1987).These formal grammars differ in terms of expressive power and computational complexity.While theoretical research on formal grammar addresses the question of what expressive power is needed to describe natural language syntax (and of what computational cost it implies),1 more practical research is concerned with issues arising from building large grammars.
Large grammars often exhibit a high structural redundancy, especially when using lexicalised formalisms where each grammar rule is associated with at least one lexical item. 2 This high redundancy heavily affects grammar production and maintenance.Indeed, some representation choice applies to many grammatical structures.Should it be modified, a costly revision of the grammar would be required.
In order to facilitate grammar engineering, two main approaches have been considered.The first approach, which we may call knowledge-driven grammar engineering, aims at formally describing the structures belonging to a grammar (which in turn describes natural language syntax).Such a formal description is defined by linguists using a description language. 3Examples of such description languages include the PATR II (Shieber 1984), the DATR (Evans and Gazdar 1996), and more recently the XMG (Crabbé et al. 2013) languages.
Such languages offer a well-defined syntax and semantics to express the relations between grammar structures.These relations are then automatically processed to build a set of structures (i.e., the target grammar).While working with description languages, grammar engineering becomes similar to software engineering.Indeed, both rely on developers working together on a source code, which can be processed to produce some information (e.g.some binary code in one case, or some syntactic structures in the other case).
The second approach to grammar engineering, which we may call data-driven, aims at acquiring the structures belonging to a grammar from annotated corpora (so-called treebanks) (Abeillé 2003).The complexity of grammar engineering is moved from designing grammar rules to designing learning algorithms.Examples of such grammar induction include for instance work by Charniak (1994, Chapter 7), Villavicencio (2002), or Cahill et al. (2005).As is the case with statistical approaches in general, grammar learning suffers from the sensitivity to the corpus used to infer the grammar, not mentioning the fact that it requires large annotated corpora which may be lacking when 2 Such lexicalised formalisms are particularly interesting for the lexicon and can be seen as a mapping between a word and its various uses in a sentence, and parsing complexity is reduced since only the grammar rules associated with the input words need to be considered.
3 In this respect, one may consider that the grammar description itself is a linguistically motivated description of natural language.
[ 7 ] working on under-resourced languages.Still, data-driven approaches to grammar learning showed promising results for English, in particular in terms of coverage (Cahill et al. 2008).
These two approaches can also be seen as complementary.While knowledge-based methods make it possible to design precision grammars where one can integrate various extra information (e.g.semantic structures), they often hardly scale up so far as describing unrestricted text.On the other hand, while data-driven methods allow to build robust grammars, which can achieve very good results in terms of coverage, these automatically acquired grammars sometimes fail at describing linguistic phenomena which are very infrequent.

Resource grammars
Among the many formalisms which have been proposed to describe natural language syntax, some have been used in practice to develop core or large grammars for a wide range of languages.Formalisms for which there exist available electronic grammars include (by chronological order of publication): Tree Adjoining Grammar (TAG, Joshi et al. 1975), Lexical Functional Grammar (LFG, Kaplan and Bresnan 1982), Head-driven Phrase Structure Grammar (HPSG, Sag and Pollard 1987), Combinatory Categorial Grammar (CCG, Steedman 1987), Interaction Grammar (IG, Perrier 2000), or Property Grammar (PG, Blache 2005).
Many efficient description-language-based integrated grammar development environments have been created for these formalisms, such as XLE (Butt et al. 1999) for LFG, ALE (Carpenter and Penn 1999), TRALE (Meurers et al. 2002) and LKB (Copestake 2002) for HPSG, or DotCCG (Baldridge et al. 2007) and GF (Ranta 2011)  At the same time, efficient learning algorithms have been developed to induce large grammars from annotated corpora for some of these formalisms, see e.g.Xia (1999) for TAG, Cahill et al. (2002) for LFG, Miyao et al. (2005) for HPSG, or Hockenmaier and Steedman (2002) for CCG.These automatically learned grammars have been [ 8 ] High-level methodologies for grammar engineering evaluated on test suites, and often used in practical applications such as semantic construction (Bos et al. 2004), dialogue systems (Foster et al. 2005), or machine translation (Birch et al. 2007). 4 1.3

Current issues
As mentioned above, the field of grammar engineering has been active for several decades.It succeeded in providing the research community with both large resources for a wide range of languages, and techniques for efficient grammar production.By efficient, it is meant that knowledge-based approaches now offer expressive and modular description languages, together with tools for computer-aided grammar design. 5In the same spirit, data-driven approaches now offer generic algorithms and frameworks which can be applied to the induction of grammars for many formalisms and languages (provided there exist available treebanks for these languages).
Still, the field has a lot more to offer, on-going projects aim at removing existing barriers in grammar engineering, such as the lack of enhanced grammar development techniques and tools, which would facilitate grammar debugging, grammar evaluation, or collaborative grammar design.Attempts at providing such techniques include work by Gardent and Kruszewski (2012) on debugging and by Hoetmer (2005) and Sygal and Wintner (2011) on grammar design.
Another current issue in grammar engineering concerns parsing efficiency.Indeed parsing complexity depends not only on the length of the input sentence, but also on the grammar size.In order to parse sentences using large grammars, several options have been considered, including the on-line (symbolic or probabilistic) selection of a sub-part of the grammar (Zhang et al. 2009;Gardent et al. 2014), 6 or parsing using factorised grammars (Carroll et al. 2000;Villemonte De La Clergerie 2010).
Other challenges include multilingual and cross-framework grammar engineering.While there exist several projects aiming at building 4 Some of these automatically acquired grammars are available on-line, see e.g.http://web.engr.illinois.edu/~juliahmr/CCGlexicon/index.html for CCG, or http://lfg-demo.computing.dcu.ie/lfgparser.html for LFG.
6 Following seminal work by Bangalore and Joshi (1999), this selection is often called supertagging.
[ 10 ] parallel grammars (see e.g.Butt et al. 2002;Flickinger et al. 2012), cross-framework grammar engineering did not (yet) achieve the same results.One may cite seminal work by Clément and Kinyon (2003) on the description of parallel TAG-LFG from a common abstract description (called metagrammar), or more recent work by Crabbé et al. (2014) on the design of a constraint-based description language which could be applied to the description of grammars belonging to distinct formalisms.In the latter, the authors show how to enrich the description language to support several target formalisms, while in the former the authors show how to project a common description to several target formalisms (the metagrammar could be seen to some extent as a universal grammar).Without going as far as designing a universal grammar, grammar reusability (i.e., sharing information between grammars) remains an important challenge.
Another interesting topic concerns grammar interfaces.One of the motivations behind grammar engineering is the possibility to build rich semantic representations.The definition and implementation of a syntax / semantics interface within large grammars is an active field (see e.g.Gardent 2008;Kallmeyer and Osswald 2013), for both theoretical (definition / selection of an adequate semantic formalism) and technical (limited grammar readability and extensibility) reasons.
Describing under-resourced languages is also an active field within the grammar engineering community.The objective is twofold.Grammar engineering can help to (i) better understand e.g.minority languages (by implementing linguistic theories and checking how this implementation compares with field data), and also (ii) provide electronic grammars (which would make it possible to develop NLP applications for these languages, and/or build core treebanks, which could in turn be beneficial to grammar engineering).Recent knowledgebased attempts at creating linguistic resources for under-resourced languages include work by Bender (2008) and Duchier et al. (2012).
Last but not least, in order to improve grammar coverage, novel ideas are needed.As mentioned above, a first step towards a better grammar coverage was to automatically learn the grammar from annotated corpora.In order to get grammars with a better coverage while keeping a high precision, hybrid techniques involving both knowledge-based and data-driven methods are needed.Seminal work by Baldwin et al. (2005) expresses the same concerns.[ 11 ] In order to improve grammar coverage, one major issue needs to be addressed, namely, Multi-Word Expressions.Such expressions are often ignored when designing core grammars, while they frequently appear in unrestricted text.Work on MWE detection for enhancing parsing with HPSG has been done by Zhang et al. (2006), where authors use parsing error mining techniques to detect whether unknown words belong to some MWE which is in turn included in the lexicon.7 Further work in this field is needed to improve grammar precision and coverage.8 2 contributions to this special issue This special issue contains contributions dealing with several aspects of grammar engineering, namely description languages (Clément et al.), grammar extraction (Le-Hong et al.), syntax / semantics interface (Lichte and Petitjean), grammar coverage (Moot), multilingual grammars (Müller), and grammar development and maintenance (Perrier and Guillaume).
Clément, Kirman, and Salvati present a logic-based grammar description formalism.They use this formalism to describe both mildly context-sensitive grammars and their semantic interpretation.As an illustration, this formalism is applied to the (syntactic and semantic) description of several linguistic phenomena related to extraction in Dutch, English, and German.
Le-Hong, Roussanaly and Nguyen present the development of a linguistic resource for Vietnamese using the TAG formalism.The authors first show how to semi-automatically extract such a grammar from a treebank of Vietnamese.In a second step, they use this grammar for deep parsing.In particular, they present a complete pipeline for parsing Vietnamese sentences to produce constituent and dependency structures.
Lichte and Petitjean present an extension of the XMG description language with a new linguistic dimension based on semantic frames.In their approach, the authors aim at offering a description language, which can be used to express various constraints on types.They apply this formalism to the definition of a syntax / semantics interface within an English TAG.
Moot reports on the development of a type-logical treebank for French, and its use for wide-coverage syntactic and semantic parsing.This article contains information about the various tasks involved in the development of a competitive type-logical parser for French using an automatically-extracted broad-coverage type-logical grammar.
Müller presents the CoreGram project, which aims at providing HPSG grammars for various typologically distinct languages.In this approach, a multilingual grammar is used to represent a common core shared by these languages.This article gives theoretical linguistic motivations behind multilingual grammars, along with theoretical grammar development concepts, and information about the concrete implementation of the corresponding HPSG grammars.
Perrier and Guillaume present FriGram, a broad-coverage French IG, which relies on a modular architecture and can be interfaced with various lexicons.This article also addresses grammar design and maintenance issues by presenting grammar-consistency principles which are implemented within FriGram.The authors also report on the current status of the grammar (coverage, comparison with other resource grammars for French, evaluation).

conclusion
In this introduction, we gave an overview of past and recent advances in the field of grammar engineering.We presented the main approaches for semi-automatic grammar production, namely knowledgebased approaches, which rely on linguistically motivated descriptions of formal grammar designed by experts, and data-driven approaches, which rely on robust broad-coverage grammars extracted from large annotated corpora.
We also reported on existing available resource grammars for various languages and grammar formalisms, and summarised current issues in grammar engineering.These issues include the lack of tech- [ 13 ] niques and tools for easier grammar extension and maintenance (e.g.debugging facilities), a sometimes low parsing efficiency when dealing with large grammars, the limited coverage of hand-crafted grammars (especially regarding multi-word expressions), the difficulties to interface syntax with other linguistic dimensions, and a weak reusability between grammars belonging to different formalisms or describing different languages.
We finally gave a brief overview of the contributions to this special issue, which cover both knowledge-based and data-driven approaches, along with several grammar formalisms (namely CCG, HPSG, TAG, IG), several linguistic dimensions (syntax and semantics), and several languages (including English, Dutch, German, French, Danish, Persian, etc.).

acknowledgements
As guest editors of this special issue, we would like to thank the members of the editorial board and editorial team of the Journal of Language Modelling for their work regarding the reviewing, copy-editing, and typesetting of the articles submitted to this issue.We are also grateful to the members of the guest editorial board.This issue would not have been possible without their valuable contribution to the reviewing process.Finally, we would like to express our gratitude to Adam Przepiórkowski and Agnieszka Mykowiecka for their support throughout the editorial process.
for CCG.Such environments made it possible to develop large grammars for several languages, see Table 1.Such grammars have been used in practical applications such as machine translation (Lønning and Oepen 2006), textual adventure games (Benotti 2009), or second language learning (Perez-Beltrachini et al. 2012).