12 Amazing Sources to Find Free Datasets for Your Next ML Project

Or Hillel
Startups Nation
Published in
4 min readNov 15, 2023

--

Find Free Datasets for Your Next ML Project

With the abundance of free datasets available online, ranging from extensive governmental and economic records to niche areas like Major League Baseball statistics and video game sales, the potential for insightful data science projects is vast. This guide aims to help you navigate these resources, whether you’re assembling a portfolio project or honing your SQL and data analysis skills.

12 Sources for Free Datasets Anyone Can Use

Iguazio: Top 22 Free Healthcare Datasets for Machine Learning

Provides an overview of 22 open datasets crucial for the development and training of machine learning models in healthcare. These datasets are described as valuable starting points for data scientists and engineers, especially given their open and free nature which can sometimes be challenging to find​.

Tableau: Free Public Data Sets For Analysis

It highlights the importance of data in decision-making, emphasizing its role in providing insights and understanding the implications of choices at a granular level. One specific example provided is a COVID-19 data visualization, which serves as a representative of the kinds of visualizations possible with these free data sets.

Interview Query: 90+ Free Datasets for Data Science

Provides a comprehensive overview and categorization of various free datasets useful for data science projects. These datasets cover a broad spectrum of subjects, ranging from governmental and economic data to more specific topics such as Major League Baseball (MLB) statistics and video game sales.

Iguazio: Best 10 Free Datasets for Manufacturing

10 excellent open manufacturing datasets and data sources for manufacturing data for ML models. This list highlights the significance of open and free datasets for machine learning as crucial for data scientists and engineers working on developing and training ML models for manufacturing, especially given the challenges in accessing manufacturing data

365 Data Science: Top 10 Free Dataset Resources for Data Science Projects in 2023

The page from 365 Data Science provides a comprehensive list of the top 10 free dataset resources for data science projects in 2023. The article guides readers, especially beginners, through various online resources where they can find free datasets for their projects. These resources include well-known platforms such as Kaggle, Google Dataset Search, GitHub, World Bank Open Data, Data.world, DataHub, Humanitarian Data Exchange, FiveThirtyEight, UCI Machine Learning Repository, and Academic Torrents Data

Iguazio: Best 13 Free Financial Datasets for Machine Learning

Provides a curated list of 13 open financial and economic datasets. These datasets are valuable resources for data scientists and engineers working on developing and training machine learning models in the finance sector.

Harvard College: Harvard DataVerse

Find datasets across research fields, preview metadata, and download files from Harvard Datavers.

Iguazio: 23 Best Free NLP Datasets for Machine Learning

These datasets are categorized into various groups, including Q&A, Reviews and Ratings, Sentiment Analysis, Synonyms, Emails, Long-form Content, and Audio. They are intended for data scientists and professionals to use in training their NLP models for a variety of applications.

Column Five: 100+ of the Best Free Data Sources For Your Next Project

Recognizing the challenge of finding such data, the page offers a curated list of over 100 free data sources from reputable organizations worldwide. These sources are categorized for easy access, aiding users in quickly finding the specific data they need for their projects​.

Yeshiva University Libraries: Datasets for Computer Science Capstone Projects

The page from Yeshiva University Libraries provides a comprehensive list of datasets for Computer Science Capstone Projects. It includes various categories like General Datasets, Subject-Specific Datasets, Datasets Provided by Cloud Providers, and additional options for further exploration.

G2: 50 Best Open Data Sources Ready to be Used Right Now

These sources are categorized under various headers including government and global data, financial and economic data, crime and drug data, health and scientific data, academic data, environmental data, business directory data, media and journalism, marketing and social media, and miscellaneous data.

Wikipedia: List of Datasets for Machine-learning Research

The Wikipedia page provides an extensive compilation of datasets applied in ML research. They include high-quality labeled training datasets for supervised and semi-supervised learning algorithms, which are typically challenging and expensive to produce due to the extensive time required for data labeling.

Should You Trust This Data Source?

Reputation: A source’s credibility can often be gauged by its reputation. Government and academic institutions typically provide trustworthy data.

Transparency: Reliable sources are usually transparent about their data collection and update methods. A lack of this information can be a red flag.

Updates: Regular updates to a dataset can be a good sign of its reliability and relevance.

Can the Dataset Be Inaccurate?

Data Collection Methods: Investigate how the data was collected. Biases in the methodology can lead to skewed results.

Historical Changes: Remember that data is a snapshot of the past. As conditions change, so does the relevancy of historical data.

Cross-Verification: Whenever possible, verify the data with other credible sources to ensure accuracy.

Using Free Datasets for Projects

When choosing a dataset for your project, consider its relevance to your goals. Are you trying to demonstrate a specific skill, like data cleaning or complex SQL queries? Pick datasets that allow you to showcase these abilities. Working with a variety of datasets can also broaden your experience and enhance your analytical skills.

--

--

Or Hillel
Startups Nation

Helps executive teams, marketers and data analysts leverage innovative digital strategies and emerging technologies to outsmart their competitors.