海大自然語言處理實驗室 - 實驗資源 NTOU NLP Lab Resources

TOCAB: NTOU Chinese Abusive Language Dataset
中文侮辱性文字處理資料集

Introduction 簡介

The TOCAB (NTOU Chinese Abusive Language) dataset is a large dataset of Chinese abusive language, created for research purpose.

There are 1,000 posts and 121,344 comment sentences in this dataset, collected PTT, a famous BBS in Taiwan.

Format 資料格式

Data in the TOCAB dataset are saved in JSON format containing the following fields:

article: content of a PTT post
permalink_info: the permanent ID of this post
board: the source PTT board of this post
comment: a set of comment sentences replied to this post
- sentence: the text of a comment sentence
- abusive: whether this comment contains abusive language (yes/no)
- abusive_class: a list of abusive classes that this comment belongs to
- abusive_vote: number of votes for this comment being abusive
- class_vote: number of votes for this comment belonging to an abusive class

An example of the TOCAB data is as follows:

{
	"article": "看著凱道群魔亂舞
明目張膽的打擊言論自由...",
	"permalink_info": "M.1561282423.A.D87",
	"board": "Gossiping",
	"comment": [
		{
			"sentence": "以言論自由名義製造假新聞~可笑",
			"abusive": "no",
			"abusive_class": [],
			"abusive_vote": 0,
			"class_vote": {
				"sex": 0,
				"body_mind": 0,
				"politics": 0,
				"race_nationality": 0,
				"offense": 0,
				"other": 0
			}
		},
		{
			"sentence": "智障嗎對支那亡國感才重啦",
			"abusive": "yes",
			"abusive_class": [ "body_mind", "race_nationality" ],
			"abusive_vote": 5,
			"class_vote": {
				"sex": 0,
				"body_mind": 3,
				"politics": 0,
				"race_nationality": 5,
				"offense": 2,
				"other": 0
			}
		}, …
	], …
}, …

Download 資料下載

Currently available version:

TOCAB v1.0 training set

How to Cite the Corpus 引用本資料集

Please cite the following paper when referring to the TOCAB dataset in academic publications and papers.

I Chung and Chuan-Jie Lin (2021) "TOCAB: A Dataset for Chinese Abusive Language Processing," Proceedings of the IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), IEEE International Workshop on Empirical Methods for Recognizing Inference in Text (EMRITE), August 10-12, 2021, pp. 445-452.

Acknowledgment 致謝

This research was funded by the Ministry of Science and Technology of Taiwan (grant: MOST 108-2221-E-019-045).

Contact

Please send comments and suggestions to Chuan-Jie Lin.

本著作係採用姓名標示-相同方式分享 4.0 國際授權。欲查看本授權條款副本，請到 http://creativecommons.org/licenses/by-sa/4.0/，或寫信至 Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.