The TOCP (NTOU Chinese Profanity) dataset is a large dataset of Chinese profanity, created for research purpose.
There are 16,450 sentences in this dataset, collected from two mainstream social media sites: PTT and Twitch.
Data in the TOCP dataset are saved in JSON format containing the following fields:
An example of the TOCP data is as follows:
{ "ID": "03166_63", "original_sentence": "幹你又要中離了喔?真他媽笑死,講不贏就跑這招你要", "source_website": "PTT", "source_link": "https://www.ptt.cc/bbs/Marginalman/M.1521774109.A.5D4.html", "profane_expression": [ { "start": 0, "end": 1, "orginal_expression": "幹", "rephrased_expression": "" }, { "start": 10, "end": 12, "original_expression": "他媽", "rephrased_expression": "的" } ] }
Currently available version:
Please cite the following paper when referring to the TOCP dataset in academic publications and papers.
Hsu Yang and Chuan-Jie Lin (2020) "TOCP: A Dataset for Chinese Profanity Processing," Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2), Language Resources and Evaluation Conference (LREC 2020), Marseille, 11-16 May 2020, pages 6-12.
This research was funded by the Ministry of Science and Technology of Taiwan (grant: MOST 107-2221-E-019-038).
Please send comments and suggestions to Chuan-Jie Lin.