The LAION2B and projections

About the LAION5B

The LAION dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set.

A more detailed description can be found here:

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.

The challenge use a 100M subset of the English subset, often called LAION2B, our objects are not marked as NFSW. We use 768-dimensional vector embeddings.

Some notes about the data


768d clip embeddings (clip768)

laion2B-en-clip768v2-n=100M.h5100M subset147G9d8ee3347b1edf136b3ef38162ac05c3
laion2B-en-clip768v2-n=10M.h510M subset, for developing purposes15Gc05e4b1d2b2a0c7663ac9767753e25e1
laion2B-en-clip768v2-n=300K.h5300K subset, for developing purposes440Md238b4b037c32bae41e497f95dffa895

Gold standard files

gold-standard-dbsize=100M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5gold standard for the 100M subset (public queries 2024)77Mf88c534ee03f1b2d0adbb972c90bb970
gold-standard-dbsize=10M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5gold standard for the 10M subset (public queries 2024)77M342794391dafed7bd90dabb740fc15ba
gold-standard-dbsize=1M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5gold standard for the 1M subset (public queries 2024)77Mf8e15cec8172451919a01aaa9815919d
gold-standard-dbsize=300K–public-queries-2024-laion2B-en-clip768v2-n=10k.h5gold standard for the 300K subset (public queries)77Md73ec238f8b4321778f4cbb80a562d23

Public queries

public-queries-2024-laion2B-en-clip768v2-n=10k.h5public queries 2024 (this query set correspond to the 2023 private query set)30Mf8f3e61bd22d7d64234a0f587ead9fcf

For instance, you can download the and prepare the development data using the following commands from a typical linux terminal:

mkdir data2024  # we will use this directory name for the evaluation, so it is good idea to use the same structure
cd data2024
curl -O
curl -O  # this url will be updated soon
curl -O # this url will be updated soon
# curl -O
# curl -O
# curl -O
# curl -O
CC BY-SA 4.0 sisap challenge committee. Last modified: April 08, 2024. Website built with Franklin.jl and the Julia programming language.