We have collected British English SMS Corpora. We have obtained the data from:
1. GrumbleText (http://www.grumbletext.co.uk) for SMS Spam.
GrumbleText is a website to submit complain of SMS Spam. People, who receive SMS Spam, voluntarily submit the SMS on this site. We collected the data manually from the website on June, 14 2010. We got 425 SMS Spam.
2. Caroline Tag’s PhD Theses (http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf) for SMS Ham.
Caroline Tag completed his PhD Theses titled “A CORPUS LINGUISTICS STUDY OF SMS TEXT MESSAGING“. She read 11,000 text messages, containing 190,000 words, sent by 235 people. However, not all 11,000 SMS are written on the theses. We have collected any SMS written on the theses and we got 450 SMS.
However, in order to avoid privacy problem, Caroline Tag and GrumbleText user have removed some private data such as name, address, phone number etc.
You can create your own data set from both sources, or you can download it from us. We provide the data in CSV format and 875 text files divided into 2 folders (Spam and Legitimate). Each file consists of one SMS content.
Download the corpora here.
If you use our data set, please cite our published paper:
Independent and Personal SMS Spam Filtering. in Proc. of IEEE Conference on Computer and Information Technology, Paphos, Cyprus, Aug 2011. Page 429-435. Link
Thank you and good luck.
If you have any question do not hesitate to contact me at firstname.lastname@example.org.