BKD-Prosub
-NPL PROJECT-
DISCRIPTION
BKD-Prosub is a Thai Pronoun Substitute and Address Term Annotated corpus in the work of BKD (Bangkok Data) corpus collection.
Abstract
A pronoun substitute is a word or a phrase that is used to refer to the speaker or the addressee of a conversation instead of using a personal pronoun.
Pronoun substitutes are frequently and naturally used in Thai conversation. Pronoun substitutes and address terms are important linguistic cues that help us understand the intended referent of a noun (phrase) in a sentence, such as the word “girl” in “That’s a heavy load, girl,” which indicates that the addressee is someone possessing properties described by the word "girl." In natural language processing tasks such as machine translation and dialogue systems, accurate anaphora resolution is crucial for generating coherent and meaningful responses. By leveraging these linguistic cues, language models can improve their ability to accurately identify the intended referent of a pronoun and generate more accurate and contextually appropriate responses.
The development of a corpus with pronoun substitute and address term annotations aims at exploring the actual use of these linguistic elements in dialogues. To create the corpus, dialogue sentences were extracted from a collection of the scripts of popular TV dramas and novels. These sources were chosen in order to provide a comprehensive overview of conversation settings. One key feature of the corpus is the inclusion of a detailed definition of kinship relations. This was done to gain a better understanding of the usage and meaning of kinship terms in dialogues. To ensure the accuracy of the corpus annotations, each sentence was annotated by two annotators, and the results were compared. This approach helps to minimize errors and personal biases, leading to a high-quality corpus that accurately reflects the usage of pronoun substitutes and address terms in natural language conversations. The comparison of annotations can also highlight any discrepancies, which can then be resolved through further discussion and collaboration between the annotators. This rigorous process helps to ensure that the corpus is a reliable and trustworthy resource. The resulting corpus will provide a wealth of information about these linguistic elements and the way in which they are used in real-life dialogues, which makes the present corpus a useful resource for researchers in this field.
• Tagset and Criteria for Annotation •
| Main | Sub | Criteria |
|---|---|---|
| Speaker Addresscc | • Subject or object in a sentence | |
| Address Term | • Not subject nor object in a sentence • Found in either initial, medial, or final position of a sentence |
|
| Kinship Term 1 | • Original meaning • Family by blood including relations between parents • Reference point = speaker e.g. The speaker calls an elder brother as "พี่" (phii, elder sibling). |
|
| Kinship Term 2 | • Derived meaning • Not family by blood e.g. The speaker calls a senior or a clerk who look older than the speaker as "พี่" (phii, elder sibling). Rederence point = younger child |
|
| Kinship Term 3 | • Original meaning • Family by blood including relations between parents • Reference point = One of family member of the speakers (mostly the yungest family member) e.g. A mother calls her elder son as "พี่" (phii, elder sibling). Reference point = younger child e.g. A mother calls herself as "แม่" (mee, mother). Reference point = child e.g. A father calls his wife as "แม่" (mee, mother). Reference point = child |
|
| Kinship Term 4 | • Original meaning • Not family by blood • Reference point = One of non family members of the speakers (mostly the youngest person) e.g. The speaker calls elder child of the speaker's friend as "พี่" (phii, elder sibling). Reference point = younger child of the speaker's friend |
|
| Kinship Term 5 | • Thai annotation only • Derived meaning, not a family relation meaning e.g. The speaker calls an idol regardless of their ages as "พี่" (phii, elder sibling) / "น้อง" (noog, younger sibling) + "เค้า" (kaw, personal name) e.g. The speaker calls a friend with the same age as "ไอ้น้อง" (Pay noog, little brother) e.g. The speaker calls his/her mother as "พี่" (phii, elder sibling) |
|
| Title | • Equivalent to "คำนำหน้านาม" (kham namnaa naam) as described in Secrion II • Found in front of a personal name e.g. "คุณ" (khun, Mr./Ms.) / "บอส" (boot, boss)+personal name • Double titles are possible e.g. "พี่" (phii, elder sibling)+ "ทนาย" (thanaay, advocate)+personal name =kinship term + occupation term |
|
| Particle | • A part of address term expression | |
| Pronoun | • Tag as personal pronoun to mark that the world is not considered to be a pronoun substitute • According to "The Royal Institute Dictionary" [12] |
• Thai Conversational Text Collection for Pronoun Substitute and Address Term Annotation •
| Title | Type | Features | No. of Sentence | No. of Word | |
|---|---|---|---|---|---|
| 1 | nakii (1 episode) [8] นาคี | TV Drama script | Contemporary spoken Thai, Fantasy | 255 | 3,016 |
| 2 | phiphophimmaphaan (1 episode) [8] พิภพหิมพานต์ | TV Drama script | Contemporary spoken Thai, Fantasy | 207 | 2,289 |
| 3 | plerngnaakhaa (1 episode) [8] เพลิงนาคา | TV Drama script | Contemporary spoken Thai, Fantasy | 174 | 1,673 |
| 4 | Dare to love (24 episode) [9] ให้รักพิพากษา | TV Drama script | Contemporary spoken Thai, Fantasy | 9,940 | 127,220 |
| 5 | khwaamsuk khoog kathi [10] ความสุขของกระทิ | Novel | Contemporary spoken Thai | 92 | 1,459 |
| 6 | teepaankoon [11] แต่ปางก่อน | Novel | Spoken Thai from 1910 to present day including aristocratic words | 2,356 | 38,888 |
| 7 | namsaycaycin [11] น้ำใสใจจริง | Novel | Contemporary spoken Thai | 3,676 | 41,609 |
| 8 | phaathoog [11] ผ้าทอง | Novel | Contemporary spoken Thai | 3,906 | 56,188 |
| Totle | 20,606 | 272,342 |
• Format
The annotation is provided in a tab-delimited text format. The first line is a header explaining the content in each column.
• Acknowledgements
This work was supported by the Thailand Science Research and Innovation Fundamental Fund, Contract Number TUFF19/2564 and TUFF24/2565, and by JSPS KAKENHI Grant Number JP20H01255. We would like to extend our appreciation to Dr. Sorarat Jirabawornvisut, Khunying Vinita Diteeyont (aka V. Vinichaikul), and Tipthida Satthathip for the valuable contributions of their writings for this study. Additionally, we would like to acknowledge the publishers, Amarin Printing and Publishing PCL. and BEC World PCL., who have graciously allowed us to use their works in our research.
References
[1] V. Sornlertlamvanich, et al. “Collaborative Collection of Multilingual Pronoun Substitutes and Address Terms,” In Proc. 7th Int. Conf. on Business and Industrial Research (ICBIR2022), Bangkok, Thailand, May 19-20, 2022, pp. 36-40. [2] S. Wittayapanyanon. “A Review of Studies of Pronoun Substitute and Address Term,” In Southeast Asian Studies Tokyo University of Foreign Studies, No. 26, Dec 2020, pp.1-23.
License
Apache 2.0
DATA
File Download
A-all-prosub.txt
A-all-prosub.txt (0.62 Mb)
DTL-all-new-prosub.txt
DTL-all-new-prosub.txt (0.81 Mb)
LICENSE.txt
LICENSE.txt (0.01 Mb)
PS-all-no-setting-prosub.txt
PS-all-no-setting-prosub.txt (0.05 Mb)
Prosub-tagset.png
Prosub-tagset.png (0.09 Mb)
Thai-text.png
Thai-text.png (0.09 Mb)
Tags :