Wang Chao, Alessandro Finamore, Lixuan Yang, Kevin Fauvel, Dario Rossi
Abstract
The recent success of Artificial Intelligence (AI) is rooted into several concomitant factors, namely theoretical progress coupled to practical availability of data and computing power. Therefore, it is not surprising that the lack of high quality data is often recognized as one of the major factors limiting AI research in several domains, and the networking domain is not excluded. Large companies have access to large data assets, that would constitute interesting benchmarks for algorithmic research in the broader scientific community. However, such datasets are private assets that are generally very difficult to share due to privacy or business sensitivity concerns.
Following numerous requests we received from the scientific community, we release AppClassNet, a commercial-grade dataset for benchmarking traffic classification and management methodologies. AppClassNet is significantly larger than the datasets generally available to the academic community in terms of both the number of samples and classes, and reaches scales similar to the popular ImageNet dataset commonly used in computer vision literature.
To avoid leak of user- and business-sensitive information, we opportunely anonymized the dataset, while empirically showing that it still represents a relevant benchmark for algorithmic research. In this paper, we describe the public dataset as well as the steps we took to avoid leakage of sensitive information while retaining relevance as a benchmark. We hope that AppClassNet can be instrumental for other researchers to address more complex commercial-grade problems in the broad field of traffic classification and management.