- Navigating the Sea of Datasets
- The All-in-One Bookmark: Your Gateway to Knowledge
- What to Expect in This Resource
- Where to Find Your Next Dataset
- Conclusion: Your Data Journey Begins Here
In the ever-evolving landscape of machine learning and deep learning, access to high-quality datasets is a key factor that can make or break a project. Whether you’re a seasoned data scientist or a curious beginner, having a reliable go-to list of datasets is like having a treasure trove at your fingertips.
Navigating the Sea of Datasets
The world of datasets is vast and varied, ranging from text and image datasets to those tailored for specific domains like healthcare, finance, and more. Here, we present a comprehensive list of web links to diverse datasets that cater to a myriad of machine learning and deep learning applications.
The All-in-One Bookmark: Your Gateway to Knowledge
I’ve meticulously curated a list that spans from well-known platforms like Kaggle and TensorFlow to hidden gems like the OECD Database and the UK government open data repository. This collection is for enthusiasts and professionals alike, providing a diverse array of datasets to fuel your next AI endeavor.
What to Expect in This Resource
- Efficiency: Save time searching for datasets by having a curated list at your disposal.
- Versatility: Whether you’re working on computer vision, natural language processing, or any other ML/DL task, you’ll find relevant datasets here.
- Discover New Horizons: Uncover datasets you might not have stumbled upon otherwise, broadening your scope and potential applications.
Where to Find Your Next Dataset
| Name | Description | Web Link |
|---|---|---|
| 20 Newsgroups | A collection of approximately 20,000 newsgroup documents. | 20 Newsgroups |
| Amazon Product Reviews | Large collection of Amazon product reviews. | Amazon Product Reviews |
| American Economic Association (AEA) | Economic datasets provided by the American Economic Association. | American Economic Association (AEA) |
| AWS datasets | Public datasets provided by Amazon Web Services. | AWS datasets |
| Awesome public datasets | Curated list of public datasets on GitHub. | Awesome public datasets |
| BDD Data | Berkeley DeepDrive Dataset for autonomous vehicles. | BDD Data |
| Blog Authorship Corpus | A dataset of blog posts for authorship attribution research. | Blog Authorship Corpus |
| BuzzFeed News Datasets | Datasets released by BuzzFeed News. | BuzzFeed News Datasets |
| Cancer Imaging Archive | Medical imaging datasets for cancer research. | Cancer Imaging Archive |
| CelebFaces | Large-scale celebrity faces dataset. | CelebFaces |
| Chars74k | A labeled dataset of 74,000 images of characters. | |
| ChestX-Det Dataset | Large-scale dataset for chest X-ray detection. | ChestX-Det Dataset |
| Chexpert Competitions Dataset | Chest X-ray images for disease detection. | Chexpert Competitions Dataset |
| CIFAR-10 | 60,000 32×32 color images in 10 different classes, with 6,000 images per class. | CIFAR-10 |
| Cityscapes Dataset | High-quality semantic understanding of urban street scenes. | Cityscapes Dataset |
| COIL100 | A dataset for object recognition containing 100 different objects. | COIL100 |
| Comma Dataset | Dataset for developing algorithms for self-driving cars. | Comma Dataset |
| Common crawl data | Copy of the internet available for research purposes. | Common crawl data |
| CompCars | Dataset for car recognition and fine-grained categorization. | CompCars |
| Components.one | Some random datasets | Components.one |
| COVID Datasets | Collection of datasets related to COVID-19. | COVID Datasets JohnHopkinsUniversity OurWorldInData COVID X-Ray COVID-19 image dataset COVID-19 CT scans |
| Crowdflower Twitter Airline Sentiment | Tweets about major U.S. airlines labeled for sentiment analysis. | Twitter Airline Sentiment |
| Cycling data | Dataset for Santander bicycle rentals in London | Cycling data |
| Data USA | Visualizations and data about the United States. | Data USA |
| Dataset search engine from Google | Search engine specifically for datasets provided by Google. | Dataset search engine from Google |
| Datasets by QuantumStat | Various datasets provided by QuantumStat. | QuantumStat Datasets |
| Datasets for streaming | Curated list of public streaming datasets on GitHub. | Datasets for streaming |
| Datasets from Azure | Public datasets provided by Microsoft Azure. | Datasets from Azure |
| Datasets from BigQuery | Public datasets provided by Google BigQuery. | Datasets from BigQuery |
| Deeplake | Platform for AI-powered image and video analysis. | Deeplake |
| DeepLesion | CT images for lesion detection. | DeepLesion |
| Drive.ai Dataset | Dataset for autonomous vehicles provided by Drive.ai. | Drive.ai Dataset |
| English Corpora Wiki | Information and links to various English corpora for linguistic research. | English Corpora Wiki |
| Enron Email Dataset | A dataset of emails from the Enron Corporation. | Enron Email Dataset |
| European Data Portal | Single access point to datasets from European countries. | European Data Portal |
| European statistics datasets | Datasets provided by Eurostat, the statistical office of the European Union. | European statistics datasets |
| Facial Recognition Technology Database | A facial recognition dataset from NIST. | FERET |
| Fashion-MNIST | Fashion product images dataset. | Fashion-MNIST |
| Financial Times | Financial news and market data. | Financial Times Markets Data |
| Fishnet Open Images Dataset | FishNet.ai dataset for fish species recognition. | Fishnet Open Images Dataset |
| Flowers | Oxford 102 Flowers dataset for flower recognition. | Flowers |
| Free Music Archive (FMA) | A dataset for music analysis containing audio and metadata for over 100,000 tracks. | FMA – Free Music Archive |
| Github archive | Archive of GitHub activity data. | Github archive |
| Global Financial Development (GFD) | World Bank’s Global Financial Development Database. | Global Financial Development (GFD) |
| Google AudioSet | Large-scale dataset for audio event recognition. | Google AudioSet |
| Google Books Ngram Viewer | Ngram datasets from a large collection of books scanned by Google. | Google Books Ngram Viewer |
| Google Cloud Public Datasets | Various datasets provided by Google Cloud Platform. | Google Cloud Public Datasets |
| Google Open Images | A dataset of images with labeled objects, publicly released by Google. | Google Open Images |
| GroupLens MovieLens Dataset | Movie ratings and recommendations dataset. | MovieLens |
| HCI Traffic Light Recognition Dataset | Dataset for traffic light recognition in autonomous vehicles. | Traffic Light Recognition Dataset |
| Household Objects | Annotated Image Dataset of Household Objects by the RoboFeiHome Team. | Household Objects |
| ImageClef | Image retrieval and classification evaluation initiative. | ImageClef |
| ImageNET | An image database organized according to the WordNet hierarchy. | ImageNET |
| IMDB Movie Reviews Dataset | Large dataset of movie reviews labeled for sentiment analysis. | IMDb Movie Reviews Dataset |
| IMF Data | Economic and financial data from the International Monetary Fund. | IMF Data |
| Indoor Scene Recognition | Dataset for indoor scene recognition from MIT. | Indoor Scene Recognition |
| ISMIR Tempo Contest Dataset | Dataset for tempo estimation in music. | ISMIR Tempo Contest Dataset |
| India government open data | Open data provided by the UK government. | India government open data |
| Indoor Scene Recognition | Dataset for indoor scene recognition collected by the MIT Vision group. | Indoor Scene Recognition |
| Jester Jokes Dataset | Dataset of jokes labeled with their humorousness. | Jester Jokes Dataset |
| Kaggle | Data science competitions and datasets. | Kaggle |
| Kdnuggets | Collection of data repositories from agriculture and finance to government. | Kdnuggets part 1 |
| Kdnuggets | Collection of data repositories from healthcare to transportation. | Kdnuggets part 2 |
| Kairos Facial Recognition Databases | Facial recognition datasets by Kairos. | Kairos Databases |
| Kinetics-700 | Large-scale video dataset for action recognition. | Kinetics-700 |
| Labeled Faces in the Wild (LFW) | A dataset of labeled images of faces collected from the wild. | Labeled Faces in the Wild (LFW) |
| Labelme | Image datasets labeled using the Labelme annotation tool. | Labelme |
| Legal Case Reports Dataset | Legal case reports dataset for text analysis. | Legal Case Reports Dataset |
| Lego Bricks | Dataset of LEGO brick images. | Lego Bricks |
| LISA Traffic Light Dataset | Traffic light dataset for autonomous vehicles. | LISA Traffic Light Dataset |
| LSUN | Large-scale scene understanding dataset. | LSUN |
| MedPix | Medical image database for teaching and research. | MedPix |
| Microsoft COCO | Common Objects in Context dataset for computer vision. | Microsoft COCO |
| Microsoft Research Papers Dataset | A dataset of academic papers from Microsoft Research. | Microsoft Research Papers Dataset |
| Million songs dataset | Dataset for music analysis. | Million songs dataset |
| MIMIC-III Clinical Database | Medical Information Mart for Intensive Care III (MIMIC-III) clinical database. | MIMIC-III |
| MNIST | Handwritten digit database. | MNIST |
| Mozilla Common Voice Dataset | Multilingual dataset of human voices for training speech recognition systems. | Mozilla Common Voice |
| MovieLens | Movie ratings dataset for collaborative filtering research. | MovieLens |
| MS-Celeb1M | Microsoft CelebFaces Dataset. | |
| MURA Dataset | Musculoskeletal Radiographs dataset for abnormality detection. | MURA Dataset |
| MUSAN Dataset | Music, speech, and noise dataset for music analysis. | MUSAN Dataset |
| NASA MRNet Dataset | Knee MRI dataset for knee abnormalities detection. | NASA MRNet Dataset |
| NASA’s EarthData | Earth observation datasets provided by NASA. | NASA’s EarthData |
| Natural Images Dataset | A collection of natural images for object recognition. | Natural Images Dataset |
| Natural Language Toolkit (NLTK) Corpora | Datasets for natural language processing from the NLTK library. | NLTK Corpora |
| New Zealand AI Public Datasets | Public datasets curated by New Zealand AI community. | NZ AI Public Datasets |
| NIH Chest X-rays | Chest X-ray images for disease classification. | NIH Chest X-rays |
| NLP Progress Data | A collection of datasets for natural language processing tasks. | NLP Progress Data |
| NLP Stanford Sentiment Analysis Code | Code for Stanford’s sentiment analysis tools. | Stanford Sentiment Code |
| NLP Stanford Sentiment140 Dataset | A collection of tweets labeled for sentiment analysis. | Sentiment140 Twitter Dataset |
| OECD Database | Data from the Organisation for Economic Co-operation and Development. | OECD Database |
| Open Neuro | A platform for sharing and analyzing neuroscience datasets. | Open Neuro |
| OpenML | Platform for sharing and organizing machine learning datasets. | OpenML |
| Open Images Dataset (Waymo) | Dataset for autonomous vehicles by Waymo (Google’s self-driving car project). | Waymo Open Images |
| Our World in Data | A global dataset covering a wide range of topics related to our world. | Our World in Data |
| Oxford-IIIT Pet Images Dataset | Dataset of images of pets for fine-grained classification. | Oxford-IIIT Pet Images Dataset |
| Oxford RobotCar Dataset | Dataset for autonomous vehicles from Oxford’s RobotCar project. | Oxford RobotCar Dataset |
| Papers with code | Repository of research papers with associated code implementations and datasets. | Papers with code |
| Places | Large-scale scene-centric dataset, covering a wide | Places |
| ProPublica Datasets | Datasets released by ProPublica for investigative journalism. | ProPublica Datasets |
| Public datasets offered by different GCP services | Various datasets provided by Google Cloud Platform services. | Public datasets by GCP |
| Quantitative Plant dataset | Datasets for plant science research. | Quantitative Plant dataset |
| Reddit Jeopardy Questions Dataset | Dataset containing 200,000 Jeopardy questions in a JSON file. | Jeopardy Questions Dataset |
| Re3Data | A global registry of research data repositories. | Re3Data |
| Scikit-Learn | Simple and efficient tools for data mining and data analysis. | Scikit-Learn |
| SIIM Medical Images Dataset | Medical imaging dataset for various radiology tasks. | SIIM Medical Images |
| SMS Spam Collection Dataset | A dataset of SMS messages labeled as spam or non-spam. | SMS Spam Collection Dataset |
| Snap Amazon Product Reviews Dataset | Amazon product reviews dataset. | Amazon Product Reviews |
| Soccer Data | Various datasets related to soccer and football. | Soccer Data |
| Stanford Dogs Dataset | Large dataset of annotated images of 120 breeds of dogs. | Stanford Dogs Dataset |
| SWC Speaker Recognition Dataset | Speaker verification dataset for speech recognition. | SWC Speaker Recognition Dataset |
| The UK Data Service | Data collections for research and teaching purposes about economic and social data from the Economic and Social Data Service (ESDS), Census Programme, and others. Some international data sets are included as well. | The UK Data Service |
| The US National Center for Education Statistics | Data on educational institutions and education demographics in the U.S. and internationally. | The US National Center for Education Statistics |
| TensorFlow | An open-source machine learning framework. | TensorFlow |
| Twitter Airline Sentiment | Tweets about major U.S. airlines labeled for sentiment analysis. | Twitter Airline Sentiment |
| Quandl | Financial and economic data. | Quandl |
| UC Irvine Machine Learning Repository | Diverse datasets for machine learning. | UCI ML Repository |
| UCI Legal Case Reports Dataset | Legal case reports dataset from the UCI Machine Learning Repository. | Legal Case Reports Dataset |
| UK government open data | Open data provided by the UK government. | UK government open data |
| US government open data | Open data provided by the US government. | US government open data |
| US Healthcare Data | Statistics and datasets about population health, diseases, drugs, and health plans collected from the FDA and USDA Food composition databases. | US Healthcare Data |
| Visual Data | A search engine for computer vision datasets. | Visual Data |
| Visual Genome | Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. | Visual Genome |
| VisualQA | A dataset for visual question answering. | VisualQA |
| VGG | Visual Geometry Group’s datasets for computer vision. | VGG |
| VGG Face2 | VGGFace2 dataset for face recognition. | VGG Face2 |
| VoxForge | Open speech dataset that allows users to share speech data. | VoxForge |
| xView | High-resolution overhead imagery dataset for object detection in complex environments. | xView |
| Wikipedia | The free encyclopedia with structured data. | Wikipedia |
| World Bank | The open data contains data concerning population demographics, macroeconomic data, and key indicators for development. | World Bank |
| Yelp Dataset | User reviews and ratings for businesses on Yelp. | Yelp Dataset |
| YFCC100M | Yahoo Flickr Creative Commons 100 Million dataset. | YFCC100M |
| YouTube-8M | A large-scale video dataset. | YouTube-8M |
Conclusion: Your Data Journey Begins Here
As you embark on your machine learning and deep learning projects, let this resource be your guiding light. Bookmark this page, and let the power of diverse, high-quality datasets elevate your work to new heights.
Happy exploring, and may your data be ever insightful!