Thi's avatar
HomeAboutNotesBlogTopicsToolsReading
About|My sketches |Cooking |Cafe icon Support Thi
πŸ’Œ [email protected]

Datasets Collection

Anh-Thi Dinh
Data Science
Left aside

Articles

  • Elite Data Science -- Datasets for Data Science and Machine Learning

Create artificial dataset

  • sklearn dataset module: from sklearn import datasets. This contains also some popular reference datasets.

Source of datasets

  • awesome-public-datasets β€” A topic-centric list of HQ open datasets.
  • Built-in datasets in Scikit-Learn.
  • BuzzFeedNews/everything β€” data from BuzzFeed.
  • COCO -- Common Objects in Context.
  • Data Hub Datasets collection β€” high quality data and datasets organized by topic.
  • data.gov β€” a large dataset aggregator and the home of the US Government’s open data.
  • data.world -- The Cloud-Native Data Catalog.
  • FiveThirtyEight β€” hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
  • Google Dataset Search.
  • Google Trends Datastore
  • Google AI Datasets β€” In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
  • Kaggle Datasets.
  • NLP-progress.
  • Open Images V6
  • Quandl β€” your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.
  • r/datasets.
  • Stanford Large Network Dataset Collection.
  • UCI
  • TensorFlow Datasets
  • The Yahoo Webscope Program
  • torchvision.datasets
  • WHU-RS Datasets -- Dataset Collection by Group of Photogrammetry and Computer Vision (GPCV) at Wuhan University.

Specific Datasets

  • COCO Dataset -- a large-scale object detection, segmentation, and captioning dataset.
  • Dataset samples from Machine Learning Mastery.
  • Fruit-Images-Dataset β€” A dataset of images containing fruits and vegetables.
  • google-landmark -- Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
  • ImageNet -- ImageNet is an image database organized according to the WordNet hierarchy.
  • Insight - BBC News Datasets
  • Large-scale CelebFaces Attributes (CelebA) Dataset
  • Large Movie Review Dataset (IMDB)
  • MIT Places Database for Scene Recognition.
  • Sarcasm detection dataset.
  • UEA & UCR Time Series Classification Repository
  • WordNet -- A Lexical Database for English.

Vietnamese

  • IWSLT'15 English-Vietnamese data (small from Stanford).
  • NLP-progress - Vietnamese
  • PhoBERT -- Pre-trained language models for Vietnamese.
  • PhoW2V (2020): Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese.
  • ViText2SQL (EMNLP 2020 Findings): A dataset for Vietnamese Text2SQL semantic parsing.
  • VnCoreNLP (NAACL 2018): A Vietnamese NLP pipeline of word (and sentence) segmentation, POS tagging, named entity recognition and dependency parsing.

Sample datasets

  • Iris flower dataset (from sklearn.datasets import load_iris).
  • Labeled Faces in the Wild Home (from sklearn.datasets import fetch_lfw_people).
  • pydatafaker -- A python package to create fake data with relationships between tables.
  • The digits dataset (sklearn.datasets.load_digits).

Tools

  • TimeSynth -- A Multipurpose Library for Synthetic Time Series Generation in Python.
β—†Articlesβ—†Create artificial datasetβ—†Source of datasetsβ—†Specific Datasetsβ—‹Vietnameseβ—†Sample datasetsβ—†Tools
About|My sketches |Cooking |Cafe icon Support Thi
πŸ’Œ [email protected]