MNIST:Description: A large database of handwritten digits commonly used for training various image processing systems. Usage: Due to its simplicity and wide usage, MNIST is often used for initial testing of FL and DP algorithms. Link: MNIST Dataset
CIFAR-10 and CIFAR-100:Description: CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. CIFAR-100 is similar but has 100 classes. Usage: These datasets are used to evaluate the performance of FL algorithms on more complex image classification tasks. Link: CIFAR-10 and CIFAR-100
Fashion-MNIST:Description: A dataset of Zalando's article images, intended as a drop-in replacement for the original MNIST dataset but with more complexity. Usage: Provides a more challenging benchmark for image classification tasks in FL and DP settings. Link: Fashion-MNIST
EMNIST:Description: An extension of MNIST to handwritten letters, providing a set of handwritten character digits. Usage: Useful for tasks that go beyond digit classification to include character recognition in FL contexts. Link: EMNIST Dataset
Shakespeare:Description: A dataset derived from the works of William Shakespeare, used for text prediction tasks. Usage: Commonly used for evaluating FL algorithms in natural language processing (NLP) tasks. Link: Shakespeare Dataset
Google Landmark Dataset:Description: A large-scale dataset for landmark recognition and retrieval. Usage: Suitable for evaluating FL algorithms in large-scale image recognition tasks. Link: Google Landmark Dataset
Federated Learning (FL) with Differential Privacy (DP) is an advanced approach that allows training models across decentralized data sources while ensuring individual data privacy. To facilitate research and development in this field, several datasets are commonly used. Here are some of the best datasets for federated learning with differential privacy:
MNIST (Modified National Institute of Standards and Technology database):Description: A large database of handwritten digits commonly used for training image processing systems. Usage: Suitable for initial experimentation with FL and DP due to its simplicity and well-understood nature. Link: MNIST Dataset
CIFAR-10 and CIFAR-100 (Canadian Institute For Advanced Research):Description: Two datasets of 60,000 32x32 color images in 10 and 100 classes, respectively, with 6000 images per class. Usage: Useful for more complex image classification tasks, allowing researchers to test FL and DP on more challenging data. Link: CIFAR-10/100 Dataset
Fashion-MNIST:Description: A dataset of Zalando's article images, intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. Usage: Provides a more complex classification challenge compared to MNIST, making it suitable for evaluating FL and DP techniques. Link: Fashion-MNIST Dataset
LFW (Labeled Faces in the Wild):Description: A database of face photographs designed for studying the problem of unconstrained face recognition. Usage: Good for experimenting with privacy-preserving techniques in face recognition tasks. Link: LFW Dataset
PUMS (Public Use Microdata Sample):Description: Datasets provided by the U.S. Census Bureau that contain individual records from the census, anonymized to protect privacy. Usage: Ideal for socio-economic and demographic research under FL and DP frameworks. Link: PUMS Dataset
Texas Hospital Discharge Data:Description: Hospital discharge data including patient demographics, diagnoses, and treatments. Usage: Useful for healthcare-related federated learning scenarios, requiring robust privacy protections. Link: Texas Hospital Discharge Data
MovieLens:Description: A collection of datasets from the MovieLens website, which contains data on user ratings of movies. Usage: Suitable for recommendation system research, allowing evaluation of FL and DP methods in collaborative filtering. Link: MovieLens Dataset
FEMNIST (Federated Extended MNIST):Description: An extension of MNIST created for federated learning, containing handwritten images across different devices/users. Usage: Specifically designed for federated learning research, making it highly relevant for experiments with DP. Link: FEMNIST Dataset
When selecting datasets for federated learning with differential privacy, it’s crucial to consider the specific research goals, such as the type of data (images, text, numerical), the complexity of the task, and the need for privacy preservation. These datasets provide a good starting point for exploring various aspects of FL and DP in different application domains.