Jettisoning Junk Messaging in the Era of End-to-End Encryption: A Case Study of WhatsApp


Abstract

WhatsApp is a popular messaging app used by over a billion users around the globe. Due to this popularity, understanding misbehavior on WhatsApp is an important issue. The sending of unwanted junk messages by unknown contacts via WhatsApp remains understudied by researchers, in part because of the end-to-end encryption offered by the platform. We address this gap by studying junk messaging on a multilingual dataset of 2.6M messages sent to 5K public WhatsApp groups in India. We characterise both junk content and senders. We find that nearly 1 in 10 messages is unwanted content sent by junk senders, and a number of unique strategies are employed to reflect challenges faced on WhatsApp, e.g., the need to change phone numbers regularly. We finally experiment with on-device classification to automate the detection of junk, whilst respecting end-to-end encryption.

Dataset and Codes

An anonymized version of the dataset and codes for machine learning models used in our paper is available for the research community.

  1. Annotation Dataset:
    (a) Embeddings (no text and no user data) with spam and ham labels.
    (b) Spam words dictionary in multiple language.
    (c) Top URLs list in spam messages.

  2. Codes for machine learning models:
    (a) SpamAssassin: Scripts
    (b) MuRIL- Word Embedding: Python code
    (c) Metadata based Classifiers as presented in our paper.

  3. Codes: Codes and additional information of above mentioned files will be available soon (TBA). Feel free to write to us directly for collaborations.

You can find the format of the dataset from here.


Contact Us


If you are interested in using this data, please fill the form to . Request specific data to get the link where you can download the data.

We are sharing the dataset under the terms and conditions specified here below. Please note that submitting the form indicates that you accept the terms and conditions of the data. In the form, please indicate which part of the dataset you need. If you do not get any email notification for your logged request within 24 hours, please e-mail us at netsys.noreply[at]gmail.com.

Dataset Terms and Conditions

  1. You will use the data solely for the purpose of non-profit research or non-profit education.

  2. You will respect the privacy of end users and organizations that may be identified in the data. You will not attempt to reverse engineer, decrypt, de-anonymize, derive or otherwise re-identify anonymized information.

  3. You will not distribute the data beyond your immediate research group.

  4. If you create a publication using our datasets, please cite our papers as follows.


@inproceedings{agarwal2022Jettisoning,
  title={Jettisoning Junk Messaging in the Era of End-to-End Encryption: A Case Study of WhatsApp},
  author={Agarwal, Pushkal and Raman, Aravindh and Ibosiola, Damilola and Tyson, Gareth and Sastry, Nishanth and Garimella, Kiran},
  booktitle={Proceedings of the 2022 world wide web conference},
  year={2022}
}