Xử lí ngôn ngữ
|A Case Study in POS Tagging of Vietnamese Texts|
Thi Minh Huyen Nguyen (1), Laurent Romary (1) and Xuan Luong Vu (2)
Résumé – Abstract
Dans cet article, nous discutons sur la construction des jeux d'étiquettes pour l'analyse morpho-syntaxique du vietnamien, en prenant en compte les spécificités linguistiques de cette langue. Cette construction s'est inspirée du modèle MULTEXT1 dans le but de s'orienter vers les applications multilingues ainsi que la réutilisabilité des jeux d'étiquettes. Nous allons finalement décrire une expérimentation sur l'étiquetage lexical des textes vietnamiens en utilisant QTAG (Mason, 1998) - un étiqueteur probabiliste indépendant des langues.
Mots Clés - Keywords
partie de discours, corpus de texte, étiquetage lexical, MULTEXT, normalisation, QTAG, MULTEXT, part-of-speech (POS), POS tagging, QTAG, standardization, text corpus
Each word in a language has potentially one or more parts-of-speech (POS) depending upon the context of its usage. POS tagging is an identification of morpho-syntactic categories over a continuous stream of word tokens. This task is essential for any further step of language processing: syntactic analysis, semantic or even pragmatic processing.
A word token in POS tagging generally does not correspond to traditional word because of the blind text segmentation without syntactic or semantic information. A traditional word may be divided into several word tokens or morphemes (in case of amalgams or compound words, for example). By contrast, a word sequence may be grouped together into one sole word token: locutions, compound proper noun, compound numerals, compound words etc. About the tagset, with respect to each token definition and/or application, descriptions of morpho-syntactic classes may include one or several features like syntactic category, lemma, gender, number, etc. In (Przepiórkowski, 2003), the authors proposed for a new pure morpho-syntactic classification. They argued that several existent Polish tagsets are linguistically naïve due to uncritical adoption of traditional POS classes, which conducts the lack of reusability. (Tufis, 1998) lodged a two-layer tagset in order to reduce time and memory costs in tagging with such a large tagset of 700 tags.
Currently there exist various tools for POS tagging or morpho-syntactic annotation, as well as a huge resource of annotated corpora for different purposes in processing of many languages. Treebank2 projects are examples of large annotated corpus building. That also means the existence, depending on one's purpose, of various token definitions as well as tagset definitions. In the framework of Multext (Ide, 1994) and of Multext-Est (Ide, 1996), tagsets for a ten of languages are defined with a high consensus about description structure.
Consequently, a question of great interest is about the reusability of these linguistic resources in an increasing number of applications, their comparability in a multilingual framework, and the application extensibility of a tool to other languages. Many projects have been initiated in this prospect: tool evaluation, standardization of morpho-syntactic description structures and their representation (Ide, 2001).
For Vietnamese texts, the work of POS tagging is a new and difficult task for computer scientists, particularly because of disagreement on traditional linguistic word classification among linguistic community. Until now there does not exist any recognizable standard of Vietnamese word categories. Essentially, our research aims for two concurrent objectives: to create tools and linguistic resources for automated Vietnamese text processing for engineering applications on one hand, and on the other hand, to enable availability of these tools, which would be a good support for the linguists undertaking Vietnamese research.
After a brief state of the art in POS tagging, we present important linguistic specificities of Vietnamese in order to define a POS tagset. We voluntarily take into account MULTEXT model in our tagset construction because of multilingual application tendency. This tagset is evaluated with the stochastic tagger QTAG (Mason, 1998).
2. Previous Works on POS Tagging
2.1. Methodology and Evaluation
The POS tagging consists generally of three steps: tokenization for identifying lexical units in the text, lexical lookup for producing possible tags corresponding to each word, and disambiguation for selecting a good tag for every word. There are two principal approaches for the disambiguation task: rule based methods and probabilistic methods.
Rule based methods take a set of grammatical rules to solve the tagging problem. Unsupervised method use constraint rules built by linguists and a dictionary in which each word has its possible tags. Such tagger is likely a parser. Supervised method (Brill, 1992) learns tags and transformation rules from manually tagged corpus. In lexical lookup step, each word is assigned with its tag the most frequent recorded in the lexicon. Subsequently transformation rules permit to correct iteratively this priori tagging.
Probabilistic methods make use of the probability distribution on the space of possible associations between word sequences and tag sequences. This distribution is determined thanks to a training corpus with or without tags. The tag disambiguation task becomes the choice of the tag sequence that maximizes the conditional probability of association to the word sequence. These methods require some probabilistic hypotheses: the association probability of a word to a tag is totally conditioned by the tag knowledge, and the occurrence probability of a tag is conditioned by the knowledge of this tag neighbor of fixed size.
The performance of tagging systems is generally measured by the precision rate (at word level) that depends strongly to the nature and the size of the tagset. Almost results are always superior to 90%. The best results obtained in Grace3 evaluation were 97.8%, 96,7% and 94,8%.
2.2. Standardization aspect of POS tagging
Many efforts have been made towards standardization of data, tools, and linguistic resources, in order to maximize reusability in corpus-based language engineering research and applications. Such an effort is presented in MULTEXT (Multilingual Text Tools and Corpora) project. In the framework of this project, a morphosyntactic model was developed for the harmonization of multilingual corpus tagging as well as the comparability of tagged corpus. It is emphasized that in a multilingual context, identical phenomena should be encoded in a similar way for facilitating multiple applications (e.g. automatic alignment, multilingual terminological extraction, etc.).
One principle of the model is to separate lexical descriptions, which are generally stable, and corpus tag. For lexical descriptions, the model uses two layers: kernel for common morphosyntactic categories and the private layer containing additional information that is private to each language or proper to particular applications. A compromised solution for the morphosyntactic tagset in the common kernel is a tagset of 11 categories: Noun (N), Verb (V), Adjective (A), Pronoun (P), Determiner (D), Adverb (R), Adposition (S), Conjunction (C), Numeral (M), Interjection (I), Residual (X). Optional information in the second layer is presented with attribute-value couples (typed feature structures). For example, a singular masculine common noun is presented by: N[type = common gender = masculine number=singular case=n/a] (contracted form Ncms-).
Still, it is obvious that to cover a wider variety of languages, it is necessary to introduce more flexibility in this underlying framework. The study we present on the Vietnamese Tagset shows that indeed some categories may or may not match the actual linguistic objects for this language. From a standardization point of view, this means that one further step is either to consider describing a whole ontology of categories (as suggested by Farrar et al., 2002), or to register the variety of possible descriptors across languages by implementing a meta-data registry (cf. Ide & Romary, 2001). These two options are not necessary contradictory since elementary data categories may point to nodes in the ontology, allowing to compare Tagsets across languages, but also within one given language). As a matter of fact, it is important to consider that for a given language like Vietnamese, an annotation scheme may rely on several layers of tagset granularity, and this should be taken into account. The following section demonstrates such a strategy, which could lead in particular to a proposal of a reference set of descriptors for Vietnamese, in the context of ISO committee TC37/SC44.
3. Vietnamese Tagset Definition
Vietnamese is an isolate language, in which each word has one sole form and cannot be modified by derivation or inflexion. All grammatical relations are not manifested by the inflection but by word order. Hence the POS classification is not morphologically trivial.
Vietnamese has a special unit called "tiếng" which corresponds at the same time to a syllable in phonological respect, a morpheme in syntax respect, a semanteme in word structure respect, and a word in sentence constituent creation respect. There are three kinds of "tiếng":
1. "tiếng"s with real meaning like sông (river), núi (mountain), đi (go), đứng (stand), nhớ (remember), thương (love tenderly)..., which can stand alone as a sentence constituent and have all semantic and syntactic behaviour, are called typical words.
2. "tiếng"s like nhưng (but), mà (that), tuy (though), nên (so)..., which cannot be a single sentence constituent but are used to compose sentence constituent and have syntactic meaning as typical words, are called tool words.
3. "tiếng"s that come from Chinese like sơn (mountain), thuỷ (water), gia (home), bất (not) ... or that have unclear meaning and usually composed with another syllable like cộ (xe cộ - vehicle), đẽ (đẹp đẽ - beautiful), vẻ (vui vẻ - joyful)... have role of creating word, and can be temporary used like word.
Among various definitions of Vietnamese word, the linguists reach the unanimous agreement that considers word like the smallest unit, which has fully specified meaning and stable structure and which is used to compose sentence constituents. Vietnamese lexicon contains:
● Simple words or monosyllable words corresponding to "tiếng" of categories 1 and 2.
● Complex words having more than one syllable. There are principally three types of syllable combination: phonetic reduplication (e.g. trắng/white - trăng trắng/whitish), semantic coordinated compound (e.g. quần/trousers, áo/shirt - quần áo/clothes) and semantic major/minor compound (e.g. xe/vehicle, đạp/pedal - xe đạp/bicycle). We also notice the existence of some compound words whose syllable combination is no more recognizable (bồ nông/pelican).
● Furthermore, idioms and locutions, which are generally considered as lexical units in sentence constituents.
Because of high compound word frequency, Vietnamese text tokenization task is rather complicated.
3.2. POS Tagset
The issue of grammatical category classification for Vietnamese is always in debate among linguistic community. The difficulty comes from unclear limit between grammatical roles of many words. The verb-noun category mutation is rather systematic (without any morphological variance). Generally every "determiner" can be used as a noun. Even "preposition" (e.g. trên/above, trong/in) can be seen in noun role (trên/the superior, trong/the interior), etc. In this section we present our approach to define a POS tagset consistent with applications in Natural Language Processing.
Grammatical categories reflect various oppositions in the syntactic system. So, the main criterion for our tagset definition is the syntactic distribution. We should have a huge tagset to reflect exactly all syntactic relations. But larger the tagset is, harder is the annotation task. Therefore we need a compromise for obtaining a good enough tagset with acceptable size. We first start with a small tagset that is generally compromised in the literature, appeared in different Vietnamese dictionaries. This tagset includes nine categories: Noun (N), Verb (V), Adjective (A), Pronoun (P), Adjunct (J), Conjunction (C), Interjection (I), Modal Word (E) and Residual (X). Then our task is to define a new tagset by delimiting each above category in more specific tags.
Inspired from the Multext model construction principles, we try to elaborate Vietnamese lexical specifications in a schema comparable with that model. Comparing to the Multext model, we find numerals and determiners in Vietnamese Noun class, adpositions in Vietnamese Conjunction class, while Vietnamese Adjunct class contains adverbs and noun adjuncts (pluralizing noun) and Modal Word class is proper to Vietnamese. To conform to Multext, we add Numeral category to the tagset above. This class takes all cardinal and ordinal numbers from Noun class and all noun adjuncts from Adjunct class. Consequently only adverbs remain in Adjunct class. We do not try to recuperate Determiner and Adposition classes in Multext model because of Vietnamese grammar particularity. Finally we obtain this first level tagset: Noun (N), Verb (V), Adjective (A), Pronoun (P), Adverb (J), Conjunction (C), Numeral (S), Interjection (I), Modal Particle (M), Residual (X). Hereinafter, we present some basic lexical specifications for each category, based on possible lexical combinations.
● Noun: Only the type attribute (common or proper) is relevant for Vietnamese in the Multext model. By contrast, we determine some new attributes5 with values between square brackets: collective [yes (cây cối/vegetation), no (cây/plant)], sense [object (nhà/house), plant (lúa/rice), animal (mèo/cat), human (học sinh/student), material (sắt/iron), abstract (tình cảm/sentiment), fact (sự/event), space (trong/inside), time (ngày/day), senses (màu/colour), style (giáo sư/professor)], countable [absolute (cái/thing6), partial (bàn/table), no (nhân dân/people)], and unit [classifier (cái/the-a), collective (bộ/set), exact measurement (lít/liter), rough measurement (nắm/handful)].
● Verb: As Vietnamese verbs are not at all inflected, so all attributes defined in Multext model are not relevant for Vietnamese. Therefore we create following new attributes proper to Vietnamese: transitive [yes (viết/write), no (ngủ/sleep)], sense [psychology (tin/believe), discourse (nói/tell), direction (lên/rise), movement (chạy/run), existence (mất/loose), transformation (trở thành/become), volition (muốn/want), acceptation (bị/suffer), comparison (bằng/equal), residual (viết/write)]. Additionally, there exists a special verb "là", which is its own tag.
● Adjective: The sole attribute for Vietnamese adjective is type [quality, quantity] (e.g. đẹp/nice, cao/high).
● Pronoun: The main interesting attribute of this category is its type, as there are no case, gender or number in Vietnamese grammar. A suitable value set for this attribute is: personal (e.g. tôi/I, chị/you-sister), temporal (e.g. bây giờ/now), demonstrative (e.g. đây/here, này/this), quantitative (e.g. tất cả/all, bấy nhiêu/that much), predicative (e.g. thế/that), and interrogative (e.g. ai/who, gì/what).
● Adverb: Vietnamese adverbs are very important tool words for representing tense, changing predicate degree, etc. of a sentence. We define new value set for type attribute: time (e.g. đã/already, sẽ/be going to), degree (e.g. rất/very, quá/too), continuation/similarity (e.g. cũng/also, vẫn/still), negation (e.g. không/not) and imperative (e.g. hãy/let, đừng/do not). Further, one more attribute is added: position [pre, post] (e.g. đã/already, rồi/already).
● Conjunction: This category involves type attribute with two values: subordinating (e.g. của/of, do/because of, để/before) and coordinating (e.g. và/and, nhưng/but, nếu ... thì/if ... then).
● Numeral: The type attribute's values of Numeral class are: cardinal (một/one), ordinal (nhất/first), adjunct (những/pluralizing).
● Interjection: No attribute for this category.
● Modal Word: We distinguish two types of words in this category: particle corresponding to words added to a sentence in order to change its intensity, and copulative corresponding to words added to the beginning or to the end of a sentence in order to express speaker's feeling.
● Residual: Lexical units like expressions have no specific classification.
Other features could also be added for different purposes (e.g. compound form information, reduplication form information, etc.). From these lexical specifications, we map to a second level tagset of 48 POS tags. In the next section, we will present the application of a probabilistic tagger to Vietnamese text with these two tagsets as defined.
4. The Tagging Process
No concrete initiative about Vietnamese POS tagging has been noticed until now. We started our research with a Vietnamese lexicon, in which each word is given its possible tags corresponding to the above-mentioned tagset. As we discussed in the section 3.2, the second level tagset that we chose is a compromise in order to avoid a too big tagset. The sole way to justify a tagset choice is to apply that tagset on real corpora, so that we can check syntactic distribution. A tool for tagging automatically a corpus with a given tagset is meaningful. We make use of QTAG (Mason, 1998) tool for this purpose.
QTAG is a language independent stochastic tagger. It learns lexicon, tagset, and lexical and contextual probabilities from a manually tagged corpus. Based on these learned data, the tagger can look for possible tags with their frequency to assign to each token in a new given tokenized corpus. If the search for a token in the learned lexicon fails, the tagger would guess the possible tags for that token by its morphological form. In the worse case, all tags in the tagset should be assigned to that token. Finally, the tagger realizes disambiguation task by using learned probabilistic distributions.
Using QTAG for Vietnamese, a language without morphological variance and so no linguistic information learned from word's form, the morphological guesser is suppressed. We now concentrate on lexicon and training corpus construction and then evaluate obtained results.
4.1. Language Resources & Training Corpus
Based on Vietnamese Dictionary (Hoang Phe, 2002), we built a lexicon of 37454 headwords, in which each lexical unit has its own tags. This lexicon basically includes commonly used lexicon terms in daily life and journals, frequent lexicon terms in literature, frequently used dialectal lexicon terms, scientific-technique terms in popular scientific documents, commonly used expressions, special syllables only used for word composition, and abbreviations in common use. The lexicon is gradually enriched with new words encountered in processed corpora.
Before manually or automatically corpus POS tagging, the first step is tokenization, i.e. the identification of lexical units, whose definition is mentioned in the section 3.1, in the corpus. Vietnamese is monosyllable, but compound words are frequent. That does not permit a simple tokenization by blank spaces in the text. To solve this problem, we adopted finite state automata model to recognize possible segmentations for each sentence segment (delimited by punctuations). In practical use, the most probable correct segmentation is the shortest path in the graph. In ambiguous case (several paths of same length), a human intervention is involved. This simple solution proves to be adequately efficient in almost all cases. Some improvements of this method could be done in near future (e.g. reduplication form identification, disambiguation using POS information, etc.). Another tokenization approach is found in (Dinh Dien, 2001).
Next, tokenized corpus used for training the tagger is manually tagged with help of a lexical looking up tool. For experimentation, we annotated a corpus of nearly 64000 tokens (without counting punctuations) with about 10000 occurred lexical units. The one fifth of this corpus is journal text, and the rest is from literature.
Our modified tagger takes into account the built lexicon as elaborated in the section 4.1. We undertook 6 tests on two defined tagsets with increasing size of training corpus. The remaining text in the manually tagged corpus is used for evaluation purpose.
" hồi / lên / sáu / , / có / lần / tôi / đã / nhìn / thấy / một / bức / tranh / tuyệt / đẹp "
" when / go up / six / , / have / time / I / already / look / see / one-a / [classifier] / picture / extreme / beautiful "
(Once when I was six years old I saw an magnificient picture)
in which: Nt - temporal noun, Vto - transitive direction verb, Sc - cardinal number, Pp - personal pronoun, Jt - temporal adverb, Vtx - transitive verb (residual), Nc - classifier noun, No - object noun, Jd - degree adverb, Aa - quality adjective.
The experimentation confirms that, larger the corpus is, more accurate is the result. The best precision rate for the first level tagset is around 94% (9 lexical tags and 10 punctuations) with a training corpus of 50000 lexical units (60000 tokens in total). For the second level tagset, the precision rate is around 85% (48 lexical tags and 10 punctuations) with the same training corpus size. Without using the above built lexicon, these precision rates are 80% and 60% respectively. The incorrectly tag distribution seems to be correlated to tag distribution in the corpus. One small error portion is due to manually incorrect tagged corpus. Although the result is rather modest, especially in the experimentation with the second tagset, but it is not deceiving as its appearance, because the training corpus size is very small (50000 tokens in comparison with at least hundreds thousand word corpus in other works on POS tagging).
We have presented our work on Vietnamese POS tagging. Because Vietnamese researchers are very recently involved in Natural Language Processing (NLP) domain, we had to construct all necessary linguistic resources and to define all data structures from scratch. Nevertheless, we benefit from some advantages: many existent methodologies for morpho-syntactic annotation, high consciousness of standardization tendency. Our defined tagset can be easily readjusted and extended thanks to lexical descriptions base. With automatic tagging assistance, we can easily increase the size of annotated corpus. Obtained results lay the foundation of further research in NLP for Vietnamese: syntactic analysis, information retrieval, multilingual alignment, machine translation, etc.
This work would not have been possible without enthusiastic collaboration of all linguists in Vietnam Lexicography Centre. The research reported here was financially supported by Vietnamese national project KC01.
 Uỷ ban Khoa học Xã hội Việt Nam (1983), Ngữ pháp tiếng Việt (Vietnamese Grammar), Hanoi, NXB Khoa học Xã hội.
 Brill E. (1992), A Simple Rule-based Part-of-Speech tagger, Proceedings of the Third Annual Conference on Applied Natural Language Processing, ACL.
 Ide, N., Véronis, J. (1994). MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, 588-92.
 Erjavec, T., Ide, N., Petkevic, V., Véronis, J. (1996) Multext-East: Multilingual Text, Tools and Corpora for Central and Eastern European Languages. Proceedings of the First TELRI European Seminar, 87-98.  MacMahon J.G., Smith F.J. (1996), Improving statistical language model performance with automatically generated word hierarchies, Computational Linguistics, 22(2), p. 217-247.
 Abney S. (1997), Part-of-Speech Tagging and Partial Parsing, in Young S., Bloothooft G. (eds.) Corpus Based Methods in Language and Sppech Processing (p. 118-136) Text, Speech and Language Technology Series, Dodreht (The Netherlands), Kluwer Academic Publishers.
 Nguyễn Tài Cẩn (1998), Ngữ pháp tiếng Việt (Vietnamese Grammar), Hanoi, NXB Đại học Quốc gia Hà Nội.
 Hữu Đạt, Trần Trí Dõi, Đào Thanh Lan (1998), Cơ sở tiếng Việt (Basis of Vietnamese), Hanoi, NXB Giáo dục.
 Nguyễn Minh Thuyết, Nguyễn Văn Hiệp (1998), Thành phần câu tiếng Việt (Vietnamese sentence constituents), Hanoi, NXB Đại học Quốc gia Hà Nội.
 Mason O., Tufis D. (1998), Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger, Proceddings of First International Conference on Language Resources and Evaluation (LREC), Granada (Spain), 28-30 May 1998, p.589-596.
 Tufis D. (1998), Tiered Tagging, in International Journal on Information Science and Technology, vol. 1, no. 2, Editura Academiei, Bucharest, 1998.
 Diệp Quang Ban, Hoàng Văn Thung (1999), Ngữ pháp tiếng Việt (Vietnamese Grammar, vol. 1-2), Hanoi, NXB Giáo dục.
 Cao Xuân Hạo (2000), Tiếng Việt - mấy vấn đề ngữ âm, ngữ pháp, ngữ nghĩa (Vietnamese - Some Questions on Phonetics, Syntax and Semantics), Hanoi, NXB Giáo dục.
 Paroubek P., Rajman M. (2000), Etiquetage morpho-syntaxique, Ingénierie des langues (p. 131-150) Paris, HERMES Science Europe.
 Dinh Dien, Hoang Kiem, Nguyen Van Toan (2001), Vietnamese Word Segmentation, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001), Tokyo (Japan), 27-30 November 2001, p. 749-756.
 Ide N., Romary L. (2001), Standards for Language Resources, Proceedings of the IRCS Workshop on Linguistic Databases, Philapdelphia, 141-9.
 Farrar, S., W. D. Lewis, D. T. Langendoen (2002), An Ontology for Linguistic Annotation, AAAI '02 Workshop: Semantic Web Meets Language Resources.
 Hoàng Phê (2002), Từ điển tiếng Việt (Vietnamese Dictionary), Vietnam Lexicography Centre, NXB Đà Nẵng.
 Przepiórkowski A., Woliński M. (2003, to appear), The Unbearable Lightness of Tagging* A Case Study in Morphosyntactic Tagging of Polish, Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03), Budapest (Hungary), 13-14 April 2003.
(Nguồn: TALN 2003, Batz-sur-Mer, 11-14 juin 2003)
(1) LORIA BP 239, 54506 Vandoeuvre lès Nancy. email@example.com, firstname.lastname@example.org
(2) Vietnam Lexicography Centre (Vietlex). email@example.com
3 GRACE project: Grammaires et Ressources pour les Analyseurs de Corpus et leur Évaluation.
5 According to Multext model convention, the “-“ symbol will be used for irrelevant attribute.
6 “cái” is a classifier noun. For example: cái/thing bàn/table = the table, một/one cái/thing bàn/table = a table, cái/thing này/this = this thing.
Bài đăng trước: