ITEA is the Eureka Cluster on software innovation
ITEA is the Eureka Cluster on software innovation
Please note that the ITEA Office will be closed from 25 December 2024 to 1 January 2025 inclusive.
ITEA 4 page header azure circular

Instant Messaging Traffic Dataset for Training AI models

Project
20020 ENTA
Type
New service
Description

Encrypted traffic from six widely-used Instant Messaging Applications (IMAs) is collected on Android device. These applications are: 1. Microsoft Teams, 2. Discord, 3. Facebook Messenger, 4. Signal, 5. Telegram, and 6. WhatsApp. The encrypted traffic collected from these applications are stored as individual .pcap file.

The dataset can be used for research purpose in the area of Encrypted Traffic Analysis. The dataset is available for download from IEEE Dataport: https://ieee-dataport.org/documents/encrypted-mobile-instant-messaging-traffic-dataset

Contact
Nur Zincir-Heywood
Email
nzincirh@dal.ca
Research area(s)
Encrypted Traffic Classification, Activity Detection, Traffic Analysis
Technical features

The encrypted traffic collected from these applications are stored as individual .pcap file. Flows from these .pcap files using Tranalyzer tool are extracted. The flow dataset for each of the IMA are also contained with this dataset.

Encrypted mobile traffic that do not result from any IMA are also collected. Such traffic can be used to test if IMA and non-IMA traffic are distinguishable. In particular, traffic resulting from web-browsing, video streaming, and sending-emails are collected. All background traffic that does not correspond to any of these activities are saved. These sets of encrypted traffic are also saved as .pcap files and their flow dataset are contained with this dataset.

The text conversations used to generate this dataset are added for any reproduction purposes. Number of flows per case:

| (non)IMA name | Number of flows | Total Size | | ----------- | ----------- | ---------- | | Teams | 40562 | 526 Mb | | Discord | 8996 | 99 Mb | | Messenger | 8904 | 85 Mb | | Signal | 10804 | 195 Mb | | Telegram | 12740 | 154 Mb | | Whatsapp | 6636 | 17 Mb | | E-mail | 1120 | 14 Mb | | Web-Browsing | 9040 | 357 Mb | | Streaming | 1442 | 154 Mb | | Background | 9188 | 126 Mb |

File Naming Convention: All files under the name {IMA-name}_encrypted_traffic.pcap contain the encrypted traffic resulting from the corresponding IMA. Similarly, all files under the name {IMA-name}_encrypted_traffic_flows.txt contain the text-based description of flows we used to build our models. No timestamps are added as they are already contained in the .pcap files.

Integration constraints

The data can be used for model training for Machine Learning or Deep Learning experiments.

Targeted customer(s)

Researchers in the area of Network Traffic Analysis. Researchers can be from Academia or Industry.

Conditions for reuse

The terms of reuse is dictated by IEEE Dataport licensing agreement.

Confidentiality
Public
Publication date
30-07-2023
Involved partners
Dalhousie University (CAN)