The TOCP (NTOU Chinese Profanity) dataset is a large dataset of Chinese profanity, created for research purpose.
There are 16,450 sentences in this dataset, collected from two mainstream social media sites: PTT and Twitch.
Data in the TOCP dataset are saved in JSON format containing the following fields:
An example of the TOCP data is as follows:
{
"ID": "03166_63",
"original_sentence": "幹你又要中離了喔?真他媽笑死,講不贏就跑這招你要",
"source_website": "PTT",
"source_link": "https://www.ptt.cc/bbs/Marginalman/M.1521774109.A.5D4.html",
"profane_expression": [
{
"start": 0,
"end": 1,
"orginal_expression": "幹",
"rephrased_expression": ""
},
{
"start": 10,
"end": 12,
"original_expression": "他媽",
"rephrased_expression": "的"
}
]
}
Currently available version:
Please cite the following paper when referring to the TOCP dataset in academic publications and papers.
Hsu Yang and Chuan-Jie Lin (2020) "TOCP: A Dataset for Chinese Profanity Processing," Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2), Language Resources and Evaluation Conference (LREC 2020), Marseille, 11-16 May 2020, pages 6-12.
This research was funded by the Ministry of Science and Technology of Taiwan (grant: MOST 107-2221-E-019-038).
Please send comments and suggestions to Chuan-Jie Lin.