海大自然語言處理實驗室 - 實驗資源 NTOU NLP Lab Resources
TOCAB: NTOU Chinese Abusive Language Dataset
中文侮辱性文字處理資料集
Introduction 簡介
The TOCAB (NTOU Chinese Abusive Language) dataset is a large dataset of Chinese abusive language, created for research purpose.
There are 1,000 posts and 121,344 comment sentences in this dataset, collected PTT, a famous BBS in Taiwan.
Format 資料格式
Data in the TOCAB dataset are saved in JSON format containing the following fields:
- article: content of a PTT post
- permalink_info: the permanent ID of this post
- board: the source PTT board of this post
- comment: a set of comment sentences replied to this post
- sentence: the text of a comment sentence
- abusive: whether this comment contains abusive language (yes/no)
- abusive_class: a list of abusive classes that this comment belongs to
- abusive_vote: number of votes for this comment being abusive
- class_vote: number of votes for this comment belonging to an abusive class
- sex: related to gender, sexual orientation, or gender identity
- body_mind : related to body, outlook, age, disease, mind, or mental state
- politics: related to political party, politician, government, or supporters to a political entity
- race_nationality: related to race, nationality, region, mother tongue, or skin color
- offense: offending someone by curse, profanity, disesteem, or calling names
- other: related to minor groups, including religious entity, school, special-interest group, etc; or deprivation of other user's right to speak, asking them to shut up or get out of this discussion
An example of the TOCAB data is as follows:
{
"article": "看著凱道群魔亂舞
明目張膽的打擊言論自由...",
"permalink_info": "M.1561282423.A.D87",
"board": "Gossiping",
"comment": [
{
"sentence": "以言論自由名義製造假新聞~可笑",
"abusive": "no",
"abusive_class": [],
"abusive_vote": 0,
"class_vote": {
"sex": 0,
"body_mind": 0,
"politics": 0,
"race_nationality": 0,
"offense": 0,
"other": 0
}
},
{
"sentence": "智障嗎對支那亡國感才重啦",
"abusive": "yes",
"abusive_class": [ "body_mind", "race_nationality" ],
"abusive_vote": 5,
"class_vote": {
"sex": 0,
"body_mind": 3,
"politics": 0,
"race_nationality": 5,
"offense": 2,
"other": 0
}
}, …
], …
}, …
Download 資料下載
Currently available version:
How to Cite the Corpus 引用本資料集
Please cite the following paper when referring to the TOCAB dataset in academic publications and papers.
I Chung and Chuan-Jie Lin (2021) "TOCAB: A Dataset for Chinese Abusive Language Processing," Proceedings of the IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), IEEE International Workshop on Empirical Methods for Recognizing Inference in Text (EMRITE), August 10-12, 2021, pp. 445-452.
Acknowledgment 致謝
This research was funded by the Ministry of Science and Technology of Taiwan (grant: MOST 108-2221-E-019-045).
Contact
Please send comments and suggestions to Chuan-Jie Lin.
本著作係採用姓名標示-相同方式分享 4.0 國際授權。欲查看本授權條款副本,請到 http://creativecommons.org/licenses/by-sa/4.0/,或寫信至 Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.