28

The Best (FREE) Data Repositories for Aspiring Data Scientists in 2020

 4 years ago
source link: https://towardsdatascience.com/the-best-free-data-repositories-for-aspiring-data-scientists-in-2020-886d8785ebac?gi=aa4fca623c20
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Best (FREE) Data Repositories for Aspiring Data Scientists

A Quick Reference guide to data for any and every industry imaginable

UvIzy2a.jpg

Earlier this week, Google announced that its Dataset Search engine is now out of beta . This is a great accomplishment for the world and an invaluable tool for any aspiring Data Scientist in 2020.

In honor of the news, I thought I’d put together a list of my favorite data repositories that I’ve used in the past to create a quick reference guide for any and all aspiring Data Scientists. No matter what industry you want to get into, there’s definitely a dataset for it here :)

Awesome Public Datasets

Awesome Public Datasets is a repository on GitHub of high quality topic-centric public data sources. They are collected and tidied from blogs, answers, and user responses. Almost all of these are free with a few exceptions here and there

Data is Plural

Date is Plural is a weekly newsletter of useful/curious datasets. You can find a huge archive of datasets on their google doc. Just hit ctrl + f for a topic you’d like to look into and see the dozens of results that pop up.

Data World

Data World is an open data repository containing data contributed by thousands of users and organizations all across the world.

What I love about this is site is that it contains really hard to find data from. In particular, the healthcare field is one of the more difficult industries to get publicly available data from(due to privacy concerns). But luckily, Data World has 3667 free health datasets you can use for your next project .

Google Data Set Search

A data set search engine… powered by Google. No further explanation needed.

Kaggle

Kaggle enables data scientists and other developers to engage in running machine learning contests, write and share code, and to host datasets. The types of data science problems posted on Kaggle can be anything from attempting to predict cancer occurrence by examining patient records to analyzing sentiment to evoke by movie reviews and how this affects audience reaction.

Makeover Monday

This repository is mostly for data visualizations, but I think what they do is a lot of fun.

Makeover Monday was an initiative started in the first week of 2016, between Andy Kriebel (Head Coach, the Information Lab UK — @ vizwizbi ) and Andy Cotgreave (Tableau Evangelist — @ acotgreave ).

Every week, usually on a Sunday, Andy K will post (via blog and twitter) an original visualization to be “made over”. Some are awful, some are already great in which case the challenge is to present a different angle on the original

When complete, post a link to the visualisation and/or a picture, using the hashtag #MakeoverMonday. All the individual screenshots are compiled into one big Pinterest collage of combined visualizations

r/datasets/

A place to share, find, and discuss Datasets. You can request datasets from other subsribers as well as share and contribute your own.

UCI Machine Learning Repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited “papers” in all of computer science.

United States Government

Under the terms of the 2013 Federal Open Data Policy , newly-generated government data is required to be made available in open, machine-readable formats, while continuing to ensure privacy and security.

That’s going to be all for now. Please feel free to bookmark this article and use it as a quick reference for your data pursuits.

Did I miss your favorite repository? Let me know below so I can add it to the guide. Until next time everyone, happy coding.

EbAF7bQ.png!web

My name is Kishen Sharma and I am a Data Scientist based in the Bay Area. I create content to educate and motivate aspiring Data Scientists all across the world.

Links to my blog and social media : https://linktr.ee/keesh_codes


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK