Automated Extraction of Tree Adjoining Grammars from a Treebank for Vietnamese

Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Phuong Thai, Azim Roussanaly


In this paper, we present a system that automatically extracts lexicalized tree ad­joining grammars (LTAG) from treebanks. We first discuss in detail extraction al­gorithms and compare them to previous works. We then report the first LTAG ex­traction result for Vietnamese, using a re­cently released Vietnamese treebank. The implementation of an open source and lan­guage independent system for automatic extraction of LTAG grammars is also dis­cussed.

1. Introduction

Grammars in general and lexicalized tree adjoin­ing grammars in particular are one of the most important elements in the natural language process­ing (NLP). Since the development of hand-crafted grammars is a time consuming and labor inten­sive task, many studies on automatic and semi­ automatic grammar development have been car­ried outduring lastdecades.

After decades of research in NLP mostly con­centrated on English and other well-studied lan­guages, recent years have seen an increased in­terest in less common languages, notably because of their growing presence on the Internet. Viet­namese, which belongs to the top 20 most spoken languages, is one of those new focuses of inter­est. Obstacles remain, however, for NLP research in general and grammar development in particu­lar: Vietnamese does not yet have vast and read­ily available constructed linguistic resources upon which to build effective statistical models, nor ref­erence works against which new ideas may be ex­perimented.

Moreover, most existing research so far has been focused on testing the applicability of ex­isting methods and tools developed for English or other Western languages, under the assumption that their logical or statistical well-foundedness guarantees cross-language validity, while in fact assumptions about the structure of a language are always made in such tools, and must be amended to adapt them to different linguistic phenomena. For an isolating language such as Vietnamese, techniques developed for flexional languages can­notbeapplied “asis”.

The primary motivation to develop a system that can automatically extract an LTAG grammar for the Vietnamese language is the need of a rich sta­tistical information and wide-coverage grammar which may contribute more effectively in the de­velopment of basic linguistic resources and tools forautomaticprocessing of Vietnamesetext.

We present in this article a system that auto­matically extracts lexicalized tree adjoining gram­marsfromtreebanks. We first discussin detail the extraction algorithms and compare them to previ­ous works. We then report the first LTAG extrac­tion result for Vietnamese, using the recently re­leased Vietnamese treebank. The implementation of an open source and language independent sys­tem for automatic extraction of LTAG grammars from treebanks is also discussed.

