NLP : BKD-Prosub

  • 0

BKD-Prosub

-NPL PROJECT-

DISCRIPTION

BKD-Prosub is a Thai Pronoun Substitute and Address Term Annotated corpus in the work of BKD (Bangkok Data) corpus collection.

▶ Abstract

A pronoun substitute is a word or a phrase that is used to refer to the speaker or the addressee of a conversation instead of using a personal pronoun.

Pronoun substitutes are frequently and naturally used in Thai conversation. Pronoun substitutes and address terms are important linguistic cues that help us understand the intended referent of a noun (phrase) in a sentence, such as the word “girl” in “That’s a heavy load, girl,” which indicates that the addressee is someone possessing properties described by the word "girl." In natural language processing tasks such as machine translation and dialogue systems, accurate anaphora resolution is crucial for generating coherent and meaningful responses. By leveraging these linguistic cues, language models can improve their ability to accurately identify the intended referent of a pronoun and generate more accurate and contextually appropriate responses.

The development of a corpus with pronoun substitute and address term annotations aims at exploring the actual use of these linguistic elements in dialogues. To create the corpus, dialogue sentences were extracted from a collection of the scripts of popular TV dramas and novels. These sources were chosen in order to provide a comprehensive overview of conversation settings. One key feature of the corpus is the inclusion of a detailed definition of kinship relations. This was done to gain a better understanding of the usage and meaning of kinship terms in dialogues. To ensure the accuracy of the corpus annotations, each sentence was annotated by two annotators, and the results were compared. This approach helps to minimize errors and personal biases, leading to a high-quality corpus that accurately reflects the usage of pronoun substitutes and address terms in natural language conversations. The comparison of annotations can also highlight any discrepancies, which can then be resolved through further discussion and collaboration between the annotators. This rigorous process helps to ensure that the corpus is a reliable and trustworthy resource. The resulting corpus will provide a wealth of information about these linguistic elements and the way in which they are used in real-life dialogues, which makes the present corpus a useful resource for researchers in this field.

• Tagset and Criteria for Annotation •

MainSubCriteria
Speaker
Addresscc
• Subject or object in a sentence
Address Term• Not subject nor object in a sentence
• Found in either initial, medial, or final position of a sentence
Kinship
Term 1
• Original meaning
• Family by blood including relations between parents
• Reference point = speaker
e.g. The speaker calls an elder brother as "พี่"
(phii, elder sibling).
Kinship
Term 2
• Derived meaning
• Not family by blood
e.g. The speaker calls a senior or a clerk who look older than the speaker as "พี่" (phii, elder sibling). Rederence point = younger child
Kinship
Term 3
• Original meaning
• Family by blood including relations between parents
• Reference point = One of family member of the speakers (mostly the yungest family member)
e.g. A mother calls her elder son as "พี่" (phii, elder sibling). Reference point = younger child
e.g. A mother calls herself as "แม่" (mee, mother). Reference point = child
e.g. A father calls his wife as "แม่" (mee, mother). Reference point = child
Kinship
Term 4
• Original meaning
• Not family by blood
• Reference point = One of non family members of the speakers (mostly the youngest person)
e.g. The speaker calls elder child of the speaker's friend as "พี่" (phii, elder sibling). Reference point = younger child of the speaker's friend
Kinship
Term 5
• Thai annotation only
• Derived meaning, not a family relation meaning
e.g. The speaker calls an idol regardless of their ages as "พี่" (phii, elder sibling) / "น้อง" (noog, younger sibling) + "เค้า" (kaw, personal name)
e.g. The speaker calls a friend with the same age as "ไอ้น้อง" (Pay noog, little brother)
e.g. The speaker calls his/her mother as "พี่" (phii, elder sibling)
Title• Equivalent to "คำนำหน้านาม" (kham namnaa naam) as described in Secrion II
• Found in front of a personal name
e.g. "คุณ" (khun, Mr./Ms.) / "บอส" (boot, boss)+personal name
• Double titles are possible
e.g. "พี่" (phii, elder sibling)+ "ทนาย" (thanaay, advocate)+personal name
=kinship term + occupation term
Particle• A part of address term expression
Pronoun• Tag as personal pronoun to mark that the world is not considered to be a pronoun substitute
• According to "The Royal Institute Dictionary" [12]

• Thai Conversational Text Collection for Pronoun Substitute and Address Term Annotation •

TitleTypeFeaturesNo. of SentenceNo. of Word
1nakii (1 episode) [8] นาคีTV Drama scriptContemporary spoken Thai, Fantasy2553,016
2phiphophimmaphaan (1 episode) [8] พิภพหิมพานต์TV Drama scriptContemporary spoken Thai, Fantasy2072,289
3plerngnaakhaa (1 episode) [8] เพลิงนาคาTV Drama scriptContemporary spoken Thai, Fantasy1741,673
4Dare to love (24 episode) [9] ให้รักพิพากษาTV Drama scriptContemporary spoken Thai, Fantasy9,940127,220
5khwaamsuk khoog kathi [10] ความสุขของกระทิNovelContemporary spoken Thai921,459
6teepaankoon [11] แต่ปางก่อนNovelSpoken Thai from 1910 to present day including aristocratic words2,35638,888
7namsaycaycin [11] น้ำใสใจจริงNovelContemporary spoken Thai3,67641,609
8phaathoog [11] ผ้าทองNovelContemporary spoken Thai3,90656,188
Totle20,606272,342

• Format

The annotation is provided in a tab-delimited text format. The first line is a header explaining the content in each column.

• Acknowledgements

This work was supported by the Thailand Science Research and Innovation Fundamental Fund, Contract Number TUFF19/2564 and TUFF24/2565, and by JSPS KAKENHI Grant Number JP20H01255. We would like to extend our appreciation to Dr. Sorarat Jirabawornvisut, Khunying Vinita Diteeyont (aka V. Vinichaikul), and Tipthida Satthathip for the valuable contributions of their writings for this study. Additionally, we would like to acknowledge the publishers, Amarin Printing and Publishing PCL. and BEC World PCL., who have graciously allowed us to use their works in our research.

References

[1] V. Sornlertlamvanich, et al. “Collaborative Collection of Multilingual Pronoun Substitutes and Address Terms,” In Proc. 7th Int. Conf. on Business and Industrial Research (ICBIR2022), Bangkok, Thailand, May 19-20, 2022, pp. 36-40. [2] S. Wittayapanyanon. “A Review of Studies of Pronoun Substitute and Address Term,” In Southeast Asian Studies Tokyo University of Foreign Studies, No. 26, Dec 2020, pp.1-23.

License

Apache 2.0

DATA

File Download

A-all-prosub.txt

A-all-prosub.txt (0.62 Mb)

DTL-all-new-prosub.txt

DTL-all-new-prosub.txt (0.81 Mb)

LICENSE.txt

LICENSE.txt (0.01 Mb)

PS-all-no-setting-prosub.txt

PS-all-no-setting-prosub.txt (0.05 Mb)

Prosub-tagset.png

Prosub-tagset.png (0.09 Mb)

Thai-text.png

Thai-text.png (0.09 Mb)

Leave a Comment:

Your email address will not be published. Required fields are marked *