Datasets Collection

Anh-Thi Dinh


Create artificial dataset

Source of datasets

  • COCO -- Common Objects in Context.
  • — a large dataset aggregator and the home of the US Government’s open data.
  • FiveThirtyEight — hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
  • Google AI Datasets — In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
  • Quandl — your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.
  • WHU-RS Datasets -- Dataset Collection by Group of Photogrammetry and Computer Vision (GPCV) at Wuhan University.

Specific Datasets

  • COCO Dataset -- a large-scale object detection, segmentation, and captioning dataset.
  • google-landmark -- Dataset with 5 million images depicting human-made and natural landmarks spanning 200 thousand classes.
  • ImageNet -- ImageNet is an image database organized according to the WordNet hierarchy.
  • WordNet -- A Lexical Database for English.


  • PhoBERT -- Pre-trained language models for Vietnamese.
  • PhoW2V (2020): Pre-trained Word2Vec syllable- and word-level embeddings for Vietnamese.
  • ViText2SQL (EMNLP 2020 Findings): A dataset for Vietnamese Text2SQL semantic parsing.
  • VnCoreNLP (NAACL 2018): A Vietnamese NLP pipeline of word (and sentence) segmentation, POS tagging, named entity recognition and dependency parsing.

Sample datasets

  • pydatafaker -- A python package to create fake data with relationships between tables.


  • TimeSynth -- A Multipurpose Library for Synthetic Time Series Generation in Python.
Loading comments...