Xử lí ngôn ngữ
| Lexical descriptions for Vietnamese language processing | |
| Thanh Bon Nguyen, Thi Minh Huyen Nguyen, Laurent Romary, Xuan Luong Vu. | |
| Abstract
Only very recently have Vietnamese researchers begun to be involved in the domain of Natural Language Processing. As there does not exist any published work in formal linguistics or any recognizable standard for Vietnamese word categories, the fundamental works in Vietnamese text analysis such as part-of-speech tagging, parsing, etc. are very difficult tasks for computer scientists. All necessary linguistic resources have to be built from scratch, and until now almost no resources are shared in public research. The aim of our project is to build a common linguistic database that is freely and easily exploitable for the automatic processing of Vietnamese. In this paper, we propose an extensible set of Vietnamese syntactic descriptions that can be used for tagset definition and corpus annotation. These descriptors are established in such a way to be a reference set proposal for Vietnamese in the context of ISO subcommittee TC37/SC4 (Language Resource Management)1.
1 Introduction
Over the last 20 years, the field of Natural Language Processing (NLP) has seen numerous achievements in domains as diverse as part-of-speech (POS) tagging, topic detection, or information retrieval. However, most of those works were carried out for occidental languages (roughly corresponding to the Indo-European family) and lose their validity when applied to other language families. Today, there clearly exists a need to develop tools and resources for those other languages. Furthermore, an issue of great interest is the reusability of these linguistic resources in an increasing number of applications, and their comparability in a multilingual framework. This paper focuses on the case of Vietnamese.
Only very recently have Vietnamese researchers begun to be involved in the domain of NLP. As there does not exist any published work in formal linguistics or any recognizable standard for Vietnamese word categories, the fundamental works in
Vietnamese text analysis such as POS tagging, parsing, etc. are very difficult tasks for computer scientists. All necessary linguistic resources have to be built from scratch, and until now almost no resources are shared in public research.
The aim of our project is to build a common linguistic database that is freely and easily exploitable for the automatic processing of Vietnamese. In this paper, we propose an extensible set of Vietnamese morpho-syntactic descriptions that can be used for tagset definition and corpus annotation. These descriptors are established in such a way to be a reference set proposal for Vietnamese in the framework of ISO subcommittee TC37/SC4 (Language Resource Management). Before detailing the set of descriptions we are proposing (Section 3), we present an overview of the specificities of the Vietnamese language and of the context of our research (Section 2).
2. Language Resources for Vietnamese
2.1. Characteristics of Vietnamese
To begin with, we present some basic characteristics of Vietnamese (Cao X. Hạo, 2000; Hữu Đạt et al., 1998).
Language family
Vietnamese is classified in the Viet-Muong group of the Mon-Khmer branch that belongs to the Austro-Asiatic language family. Vietnamese is also known to have a similarity with languages in the Tai family. The Vietnamese vocabulary features a large amount of Sino-Vietnamese words. Moreover, by being in contact with the French language, Vietnamese was enriched not only in vocabulary but also in syntax by the calque (or loan translation) of French grammar.
Language Type
Vietnamese is an isolating language, which is characterized by the following specificities:
● It is a monosyllabic language.
● Its word forms never change, which is contrary to occidental languages that make use of morphological variations (plural form, conjugation...).
● Hence, all grammatical relations are manifested by word order and tool words.
Vocabulary
Vietnamese has a special unit called “tiếng” that corresponds at the same time to a syllable with respect to phonology, a morpheme with respect to morpho-syntax, and a word with respect to sentence constituent creation. For convenience, we call these “tiếng” syllables. The Vietnamese vocabulary contains:
● Simple words, which are monosyllabic.
● Reduplicated words composed by phonetic reduplication (e.g. trắng/white - trăng trắng / whitish).
● Compound words composed by semantic coordination (e.g. quần/trousers, áo/shirt - quần áo/clothes).
● Compound words composed by semantic subordination (e.g. xe/vehicle, đạp/pedal - xe đạp/bicycle).
● Some compound words whose syllable combination is no more recognizable (bồ nông/pelican).
● Complex words phonetically transcribed from foreign languages (cà phê/coffee).
Grammar
The issue of syntactic category classification for Vietnamese is still in debate amongst the linguistic community (Cao X. Hạo, 2000; Hữu Đạt et al., 1998; Diệp Q. Ban & Hoàng V. Thung, 1999; Uỷ ban KHXHVN, 1983). That lack of consensus is due to the unclear limit between the grammatical roles of many words as well as the frequent phenomenon of syntactic category mutation. Vietnamese dictionaries (Hoàng Phê, 2002) use a set of 8 parts of speech proposed by the Vietnam Committee of Social Science (Uỷ ban KHXHVN, 1983).
As other isolating languages, the most important syntactic information source in Vietnamese is word order. The basic word order is Subject - Verb - Object. There are only prepositions but no postpositions. In a noun phrase the main noun precedes the adjectives and the genitive follows the governing noun.
The other syntactic means are tool words, the reduplication, and the intonation.
From the viewpoint of functional grammar, the syntactic structure of Vietnamese follows a topiccomment structure. It belongs to the class of topicprominent languages as referred to (Charles N. Li, Sandra A. Thompson, 1976). In these languages, topics are coded in the surface structure and they tend to control co-referentiality (cf. Cây đó lá to nên tôi không thích / Tree that leaves big so I not like, which means This tree, the leaves are big, so I don't like it); the topic-oriented “double subject” construction is a basic sentence type (cf. Tôi tên là Nam, sinh ở Hà Nội / I name be Nam, born in Hanoi, which means My name is Nam, I was born in Hanoi), while such subjectoriented constructions as the passive and “dummy” subject sentences are rare or non-existent (cf. There is a cat in the garden should be translated in Có một con mèo trong vườn / exist one
• Countability [countable (seed), partially countable, non-countable (rice)] - Note that amongst mass nouns are only material and aggregate nouns (e.g. people) that are absolutely non-countable; nouns that generally have a non-countable meaning but can directly combine with numerals in certain specific contexts are called “partially countable”.
• Unit [natural (cup), conventional (meter), collective (herd), administrative (county)] - provides attributes relevant for unit nouns, including classifier nouns.
• Meaning [object (table), plant (tree), animal (cow), part (head), material (fabric), perception (color), location (place), time (month), turn, substantivizer, abstract (feeling), other] - turn is defined for words such as lần (time in Repeat 5 times) or lượt (turn in It is my turn); substantivizer describes words used to turn a verb into a nominal group (e.g. the fact of...). This attribute reflects the combination abilities within various nouns. The specification could be finer-grained, but we have no ambition to go any further for the time being.
Verbs (V)
• Transitivity [intransitive, transitive, any]
• Grade [gradable, non-gradable] - a gradable verb can be used with an adverb of degree (e.g. very).
• Meaning [copula (be), modal (can), passive (undergo), existence (remain), transformation (become), process stage (begin), comparison (equal), opinion (think), imperative (order), giving (offer), directive movement (enter), non directive movement (go), moving (push), other transitive, other intransitive] - This Meaning attribute encodes the distinction of verb valence (number of complements) and categories (noun, verb, clause, etc.) of the complements in the verb phrases.
Adjectives (A)
• Type [qualitative (nice), quantitative (high)] - a quantitative adjective can have a complement specifying a quantity (e.g. high two meters), and in that case it cannot be used with adverbs of degree (e.g. very).
• Grade [gradable (good), non-gradable (absolute)] - cf. the Grade attribute of Verb.
Pronouns (P)
• Type [personal (he), pronominal (myself), indefinite (one), time (that moment), amount (all), demonstrative (that), interrogative (who), predicative (that), reflexive (one another)]
• Person [first, second, third]
• Number [singular, plural]
Determiners/Articles (D)
• Type [definite, indefinite]
• Number [singular, plural]
Numerals (M)
• Type [cardinal (four), approximate (one dozen), fractional (quarter), ordinal (fourth)]
Adverbs (R)
• Type [time (already), degree (very), continuity (still), negation (not), imperative, effect, other (suddenly)]
• Position [pre, post, undefined]
Adpositions (S)
• Type [locative (in), directive (across), time (since), aim (for), destination (to), relative (of), means (by)]
Conjunctions (C)
• Type [coordinating (however), consequence (if ... then), enumeration (..., ..., and ...)]
• Position [initial, non-initial] - necessary in case of discontinuous conjunctions.
Interjections (I)
• Type [exclamation, onomatopoeia]
Modal Particles (T)
• Type [global, local] - reflects the scope of a particle: whole sentence or one word only.
• Meaning [opinion, strengthening, exclamation, interrogation, call, imperative] – reflects different sentence types (exclamation, interrogation, etc.), determined by these particles.
3.3. Data examples
Making use of the descriptors presented above, we have constructed a lexicon in which with each entry is associated its lexical description. This construction, for the private layer, performed manually by the linguists of the Vietnam Lexicography Centre, based on the descriptions of each entry in the print Vietnamese dictionary (Hoàng Phê, 2002). Note that this print dictionary is previously converted to XML format from its MS Word format. Each entry in the dictionary contains distinct information about its grammatical category and its description for various meanings, with examples. With respect to the kernel layer, we first automatically get the 8 categories recorded there, and then manually process with the categories that should be revised, as described in 3.1. The data have two formats: simple text, as in the MULTEXT model, and XML format. We choose for the time being a simple XML scheme that represents explicitly the feature structure corresponding to the private layer.
Here are some entries illustrating the data encoded in XML format.
Example 1. The word chạy in three uses: 1) run in the horse runs. 2) run in run ultra-violet rays. 3) good in the sale is very good.
Example 2. The syllable hoá has the same role as the suffix ize (e.g. in industrialize) in English.
4. Conclusion
We have presented our proposal of a reference set for Vietnamese lexical descriptors by following the standardization activities of the ISO subcommittee TC 37 SC 4. These descriptors are expressed, for the time being, in a two-layer model comparable with the MULTEXT model, which is developed for various European languages. In the kernel layer, we have added the modal particle category that contains modal words appearing frequently in Vietnamese. The other categories remain the same. In the private layer, where specific features of Vietnamese are recorded, we proposed various attributes that are syntactically important for this analytic language in which the morphology does not help to analyze syntactic structures. With the help of the Vietnam Lexicography Center, we applied all these descriptions to a lexicon that contains all the entries (about 40,000) of the Vietnamese dictionary (Hoàng Phê, 2002). These resources are represented in a common format that ensures their extensibility and is widely adopted by the international research community, in the objective to share them with all the researchers in the domain of NLP. This base can help us define tagsets for various applications using morphosyntactically annotated corpora, as well as construct a syntactic lexicon for a given grammatical formalism.
Acknowledgements
This work would not have been possible without the enthusiastic collaboration of all the linguists at the Vietnam Lexicography Centre, especially Hoang T. T. Linh, Dang T. Hoa, Dao M. Thu and Pham T. Thuy. Great thanks to them!
References
[1] Cao Xuân Hạo. 2000. Tiếng Việt - mấy vấn đề ngữ âm, ngữ pháp, ngữ nghĩa (Vietnamese - Some Questions on Phonetics, Syntax and Semantics). NXB Giáo dục, Hanoi, VN.
[2] Diệp Quang Ban, Hoàng Văn Thung. 1999. Ngữ pháp tiếng Việt (Vietnamese Grammar), volume 1. NXB Giáo dục, Hanoi, VN.
[3] Dinh Dien, Hoang Kiem. 2003a. POS-Tagger for English - Vietnamese Bilingual Corpus. Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, CA.
[4] Dinh Dien, Hoang Kiem, Nguyen Van Toan. 2001. Vietnamese Word Segmentation. Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001), Tokyo, JP.
[5] Dinh Dien, Pham Phu Hoi, Ngo Quoc Hung. 2003b. Some Lexical Issues in Electronic Vietnamese Dictionary, PAPILLON-2003 Workshop on Multilingual Lexical Databases, Hokkaido University, JP.
[6] Tomaž Erjavec, Nancy Ide, Dan Tufis. 1998. Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages. Proceedings of the First International Conference on Language Resources and Evaluation, Granada, SP.
[7] Hữu Đạt, Trần Trí Dõi, Đào Thanh Lan. 1998. Cơ sở tiếng Việt (Basis of Vietnamese). NXB Giáo dục, Hanoi, VN.
[8] Hoàng Phê. 2002. Từ điển tiếng Việt (Vietnamese Dictionary). Vietnam Lexicography Centre, NXB Đà Nẵng, VN.
[9] Nancy Ide, Laurent Romary. 2001. Standards for Language Resources. Proceedings of the IRCS Workshop on Linguistic Databases, Philapdelphia, US.
[10] Nancy Ide, Jean Véronis. 1994. MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, JP.
[11] Charles N. Li, Sandra A. Thompson. 1976. Subject and Topic: A new Typology of Language. In Charles N. Li (ed.). Subject and Topic, London/New York: Academic Press, pp. 457-489.
[12] Nguyễn Tài Cẩn. 1998. Ngữ pháp tiếng Việt (Vietnamese Grammar), NXB Đại học Quốc gia, Hanoi, VN.
[13] Thi Minh Huyen Nguyen, Laurent Romary, Xuan Luong Vu. 2003. Une étude de cas pour l'étiquetage morpho-syntaxique de textes vietnamiens, The TALN Conference, Batz-sur-mer, FR.
[14] Uỷ ban Khoa học Xã hội Việt Nam. 1983. Ngữ pháp tiếng Việt (Vietnamese Grammar). NXB Khoa học Xã hội, Hanoi, VN. | |
| Chú thích: Thanh Bon Nguyen. IFI, Hanoi. Thi Minh Huyen Nguyen. Laboratory LORIA, Nancy. Laurent Romary. Laboratory LORIA, Nancy. Xuan Luong Vu. Vietnam Lexicography Centre, Hanoi. 1 http://www.tc37sc4.org 2 http://www.loria.fr/equipes/led/outils.php | |
Bài đăng trước:

Example 2. The syllable hoá has the same role as the suffix ize (e.g. in industrialize) in English.
4. Conclusion
We have presented our proposal of a reference set for Vietnamese lexical descriptors by following the standardization activities of the ISO subcommittee TC 37 SC 4. These descriptors are expressed, for the time being, in a two-layer model comparable with the MULTEXT model, which is developed for various European languages. In the kernel layer, we have added the modal particle category that contains modal words appearing frequently in Vietnamese. The other categories remain the same. In the private layer, where specific features of Vietnamese are recorded, we proposed various attributes that are syntactically important for this analytic language in which the morphology does not help to analyze syntactic structures. With the help of the Vietnam Lexicography Center, we applied all these descriptions to a lexicon that contains all the entries (about 40,000) of the Vietnamese dictionary (Hoàng Phê, 2002). These resources are represented in a common format that ensures their extensibility and is widely adopted by the international research community, in the objective to share them with all the researchers in the domain of NLP. This base can help us define tagsets for various applications using morphosyntactically annotated corpora, as well as construct a syntactic lexicon for a given grammatical formalism.
Acknowledgements
This work would not have been possible without the enthusiastic collaboration of all the linguists at the Vietnam Lexicography Centre, especially Hoang T. T. Linh, Dang T. Hoa, Dao M. Thu and Pham T. Thuy. Great thanks to them!
References
[1] Cao Xuân Hạo. 2000. Tiếng Việt - mấy vấn đề ngữ âm, ngữ pháp, ngữ nghĩa (Vietnamese - Some Questions on Phonetics, Syntax and Semantics). NXB Giáo dục, Hanoi, VN.
[2] Diệp Quang Ban, Hoàng Văn Thung. 1999. Ngữ pháp tiếng Việt (Vietnamese Grammar), volume 1. NXB Giáo dục, Hanoi, VN.
[3] Dinh Dien, Hoang Kiem. 2003a. POS-Tagger for English - Vietnamese Bilingual Corpus. Workshop: Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, CA.
[4] Dinh Dien, Hoang Kiem, Nguyen Van Toan. 2001. Vietnamese Word Segmentation. Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001), Tokyo, JP.
[5] Dinh Dien, Pham Phu Hoi, Ngo Quoc Hung. 2003b. Some Lexical Issues in Electronic Vietnamese Dictionary, PAPILLON-2003 Workshop on Multilingual Lexical Databases, Hokkaido University, JP.
[6] Tomaž Erjavec, Nancy Ide, Dan Tufis. 1998. Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages. Proceedings of the First International Conference on Language Resources and Evaluation, Granada, SP.
[7] Hữu Đạt, Trần Trí Dõi, Đào Thanh Lan. 1998. Cơ sở tiếng Việt (Basis of Vietnamese). NXB Giáo dục, Hanoi, VN.
[8] Hoàng Phê. 2002. Từ điển tiếng Việt (Vietnamese Dictionary). Vietnam Lexicography Centre, NXB Đà Nẵng, VN.
[9] Nancy Ide, Laurent Romary. 2001. Standards for Language Resources. Proceedings of the IRCS Workshop on Linguistic Databases, Philapdelphia, US.
[10] Nancy Ide, Jean Véronis. 1994. MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, JP.
[11] Charles N. Li, Sandra A. Thompson. 1976. Subject and Topic: A new Typology of Language. In Charles N. Li (ed.). Subject and Topic, London/New York: Academic Press, pp. 457-489.
[12] Nguyễn Tài Cẩn. 1998. Ngữ pháp tiếng Việt (Vietnamese Grammar), NXB Đại học Quốc gia, Hanoi, VN.
[13] Thi Minh Huyen Nguyen, Laurent Romary, Xuan Luong Vu. 2003. Une étude de cas pour l'étiquetage morpho-syntaxique de textes vietnamiens, The TALN Conference, Batz-sur-mer, FR.
[14] Uỷ ban Khoa học Xã hội Việt Nam. 1983. Ngữ pháp tiếng Việt (Vietnamese Grammar). NXB Khoa học Xã hội, Hanoi, VN.