These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the...
257 KB (14,298 words) - 20:40, 25 September 2024
Democracy-Dictatorship Index (redirect from DD dataset)
index of democracy and dictatorship or simply the DD index or the DD datasets was the binary measure of democracy and dictatorship first proposed by...
32 KB (1,708 words) - 21:58, 25 September 2024
The Pile is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed...
13 KB (1,268 words) - 09:47, 17 July 2024
MNIST database (redirect from MNIST dataset)
original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken...
22 KB (2,049 words) - 11:38, 29 September 2024
This is a list of datasets for machine learning research. It is part of the list of datasets for machine-learning research. These datasets consist primarily...
102 KB (6,344 words) - 22:03, 27 September 2024
A national lidar dataset refers to a high-resolution lidar dataset comprising most—and ideally all—of a nation's terrain. Datasets of this type typically...
4 KB (97 words) - 08:12, 31 May 2024
EPSG Geodetic Parameter Dataset (also EPSG registry) is a public registry of geodetic datums, spatial reference systems, Earth ellipsoids, coordinate...
5 KB (450 words) - 10:37, 21 July 2024
Apache Spark (redirect from Resilient distributed dataset)
followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged...
30 KB (2,735 words) - 05:25, 30 September 2024
Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use. The company launched...
4 KB (383 words) - 21:45, 14 August 2023
database ReAnalysis) is a global oceanographic temperature and salinity dataset produced and maintained by the French institute IFREMER. Most of those...
7 KB (571 words) - 21:48, 25 September 2023
V-Dem Institute (redirect from V-Party Dataset)
datasets that describe qualities of different governments, annually published and publicly available for free. These datasets are a popular dataset among...
12 KB (764 words) - 22:42, 19 September 2024
80 Million Tiny Images (redirect from Tiny Images dataset)
80 Million Tiny Images is a dataset intended for training machine learning systems. It contains 79,302,017 32×32 pixel color images, scaled down from...
3 KB (366 words) - 08:48, 23 May 2024
coordinating efforts across multiple agencies towards a National LIDAR Dataset. The first meeting, a National LIDAR Initiative Strategy Meeting, was held...
18 KB (395 words) - 00:58, 20 June 2024
training dataset size, and training cost. In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost...
37 KB (4,946 words) - 03:17, 1 October 2024
problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data)...
42 KB (5,623 words) - 18:40, 25 June 2024
Iris flower data set (redirect from Iris dataset)
The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in R base and Python in the machine learning...
18 KB (930 words) - 11:28, 29 September 2024
COVID-19 datasets are public databases for sharing case data and medical information related to the COVID-19 pandemic. Johns Hopkins Coronavirus Resource...
13 KB (880 words) - 04:28, 26 July 2024
Large language model (section Dataset preprocessing)
became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus"), upon which they trained statistical language models...
156 KB (13,394 words) - 11:01, 4 October 2024
ImageNet (category Datasets in computer vision)
these whitens the input data. There are various subsets of the ImageNet dataset used in various context, sometimes referred to as "versions". One of the...
17 KB (1,841 words) - 04:07, 19 September 2024
Enron Corpus (redirect from Enron email dataset)
machine learning. Klimt, Bryan; Yiming Yang (2004). "The Enron Corpus: A New Dataset for Email Classification Research". pp. 217–226. CiteSeerX 10.1.1.61.1645...
7 KB (715 words) - 20:35, 28 September 2024
method of measuring how many different types (e.g. species) there are in a dataset (e.g. a community). Some more sophisticated indices also account for the...
24 KB (3,321 words) - 15:34, 18 August 2024
Data set (IBM mainframe) (redirect from Partitioned dataset)
IBM mainframe computers in the S/360 line, a data set (IBM preferred) or dataset is a computer file having a record organization. Use of this term began...
14 KB (1,576 words) - 20:08, 17 May 2024
Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very...
10 KB (759 words) - 16:26, 3 October 2024
LAION (section Image datasets)
open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scraped from the web...
11 KB (993 words) - 06:10, 31 August 2024
Training, validation, and test data sets (redirect from Dataset (machine learning))
a sheep if located on a grassland. Statistical classification List of datasets for machine learning research Hierarchical classification Ron Kohavi; Foster...
20 KB (2,209 words) - 21:56, 5 September 2024
BioGRID (redirect from General Repository for Interaction Datasets)
The Biological General Repository for Interaction Datasets (BioGRID) is a curated biological database of protein-protein interactions, genetic interactions...
19 KB (1,911 words) - 21:45, 26 August 2024
In health informatics, a national minimum dataset is a database of health encounters held by a central repository. "Minimum" implies that the data fields...
1 KB (92 words) - 17:23, 20 August 2023
articles and then request a dataset containing word and n-gram frequencies and basic metadata. They are notified when the dataset is ready and may download...
33 KB (2,992 words) - 03:41, 9 September 2024
Llama (language model) (section Training datasets)
Chinchilla-optimal dataset for Llama 3 8B is 200 billion tokens, but performance continued to scale log-linearly to the 75-times larger dataset of 15 trillion...
35 KB (3,612 words) - 04:02, 29 September 2024