海大自然語言處理實驗室 - 實驗資源 NTOU NLP Lab Resources

TOCP: A Dataset for Chinese Profanity Processing
中文不雅文字處理資料集

Introduction 簡介

The TOCP (NTOU Chinese Profanity) dataset is a large dataset of Chinese profanity, created for research purpose.

There are 16,450 sentences in this dataset, collected from two mainstream social media sites: PTT and Twitch.

Format 資料格式

Data in the TOCP dataset are saved in JSON format containing the following fields:

ID: an ID for reference
original_sentence: the original text of the target sentence
source_website: the source website of the target sentence (PTT or Twitch)
source_link: URL of the PTT post or the ID of the Twitch streamer
profane_expression: a set of profane expressions appearing in this target sentence
- start: the start position (counted in Chinese characters) of this profane expression in the target sentence
- end: the end position (counted in Chinese characters) of this profane expression in the target sentence
- orginal_expression: the original text of this profane expression
- rephrased_expression: a suggestion of profanity rephrasing

An example of the TOCP data is as follows:

{
	"ID": "03166_63",
	"original_sentence": "幹你又要中離了喔？真他媽笑死，講不贏就跑這招你要",
	"source_website": "PTT",
	"source_link": "https://www.ptt.cc/bbs/Marginalman/M.1521774109.A.5D4.html",
	"profane_expression": [
		{
			"start": 0,
			"end": 1,
			"orginal_expression": "幹",
			"rephrased_expression": ""
		},
		{
			"start": 10,
			"end": 12,
			"original_expression": "他媽",
			"rephrased_expression": "的"
		}
	]
}

Download 資料下載

Currently available version:

TOCP v1.0

How to Cite the Corpus 引用本資料集

Please cite the following paper when referring to the TOCP dataset in academic publications and papers.

Hsu Yang and Chuan-Jie Lin (2020) "TOCP: A Dataset for Chinese Profanity Processing," Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2), Language Resources and Evaluation Conference (LREC 2020), Marseille, 11-16 May 2020, pages 6-12.

Acknowledgment 致謝

This research was funded by the Ministry of Science and Technology of Taiwan (grant: MOST 107-2221-E-019-038).

Contact

Please send comments and suggestions to Chuan-Jie Lin.

本著作係採用姓名標示-相同方式分享 4.0 國際授權。欲查看本授權條款副本，請到 http://creativecommons.org/licenses/by-sa/4.0/，或寫信至 Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.