Instant Messaging Traffic Dataset for Training AI models
- Project
- 20020 ENTA
- Type
- New service
- Description
Encrypted traffic from six widely-used Instant Messaging Applications (IMAs) is collected on Android device. These applications are: 1. Microsoft Teams, 2. Discord, 3. Facebook Messenger, 4. Signal, 5. Telegram, and 6. WhatsApp. The encrypted traffic collected from these applications are stored as individual .pcap file.
The dataset can be used for research purpose in the area of Encrypted Traffic Analysis. The dataset is available for download from IEEE Dataport: https://ieee-dataport.org/documents/encrypted-mobile-instant-messaging-traffic-dataset
- Contact
- Nur Zincir-Heywood
- nzincirh@dal.ca
- Research area(s)
- Encrypted Traffic Classification, Activity Detection, Traffic Analysis
- Technical features
The encrypted traffic collected from these applications are stored as individual .pcap file. Flows from these .pcap files using Tranalyzer tool are extracted. The flow dataset for each of the IMA are also contained with this dataset.
Encrypted mobile traffic that do not result from any IMA are also collected. Such traffic can be used to test if IMA and non-IMA traffic are distinguishable. In particular, traffic resulting from web-browsing, video streaming, and sending-emails are collected. All background traffic that does not correspond to any of these activities are saved. These sets of encrypted traffic are also saved as .pcap files and their flow dataset are contained with this dataset.
The text conversations used to generate this dataset are added for any reproduction purposes. Number of flows per case:
| (non)IMA name | Number of flows | Total Size | | ----------- | ----------- | ---------- | | Teams | 40562 | 526 Mb | | Discord | 8996 | 99 Mb | | Messenger | 8904 | 85 Mb | | Signal | 10804 | 195 Mb | | Telegram | 12740 | 154 Mb | | Whatsapp | 6636 | 17 Mb | | E-mail | 1120 | 14 Mb | | Web-Browsing | 9040 | 357 Mb | | Streaming | 1442 | 154 Mb | | Background | 9188 | 126 Mb |
File Naming Convention: All files under the name {IMA-name}_encrypted_traffic.pcap contain the encrypted traffic resulting from the corresponding IMA. Similarly, all files under the name {IMA-name}_encrypted_traffic_flows.txt contain the text-based description of flows we used to build our models. No timestamps are added as they are already contained in the .pcap files.
- Integration constraints
The data can be used for model training for Machine Learning or Deep Learning experiments.
- Targeted customer(s)
Researchers in the area of Network Traffic Analysis. Researchers can be from Academia or Industry.
- Conditions for reuse
The terms of reuse is dictated by IEEE Dataport licensing agreement.
- Confidentiality
- Public
- Publication date
- 30-07-2023
- Involved partners
- Dalhousie University (CAN)