The LAION2B and projections

About the LAION5B

The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.

A more detailed description can be found here:

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.

The English subset, often called LAION2B, contains over 2 billion objects.

Subset of the challenge

The dataset is divided into parts containing close to 1M vectors. We selected the first 112 parts (0000 to 0111); we used the first part to extract the public query set and the rest to extract the database. The subset use approximately 160GB of space and its associated metadata 20GB (the first 112 parts). Embeddings are distributed using single precision (16bits) floating point vectors bundled in the NumPy data-specific format .npz. They can be loaded on most platforms due to the format's popularity.

The challenge has three subsets:

All parts should be concatenated in order and also removing NSFW entries (marked in metadata files).

Subsets

We provide access to different subsets of the dataset and also created three different lower-dimensional projections that can be used. In particular, we computed two PCA projections using 32 and 96 dimensions and one more projection into binary sketches designed to work with bit-level hamming distance (using 1024 bits). Find below the URLs to download these bundles.

768d clip embeddings (clip768)

datasetdescriptionsizemd5
laion2B-en-clip768v2-n=100M.h5100M subset147G9d8ee3347b1edf136b3ef38162ac05c3
laion2B-en-clip768v2-n=30M.h530M subset44G15a24d28d2304e14711e23baf7fe86a4
laion2B-en-clip768v2-n=10M.h510M subset15Gc05e4b1d2b2a0c7663ac9767753e25e1
laion2B-en-clip768v2-n=300K.h5300K subset, for developing purposes440Md238b4b037c32bae41e497f95dffa895
laion2B-en-clip768v2-n=100K.h5100K subset, for developing purposes147Mdaef38a64e3cd1c5233231f8be882a64
public-queries-10k-clip768v2.h510k public query set (original 768d embeddings)30M257b9eb3f7f25776e0d33b22451b7b32
private-queries-10k-clip768v2.h510k private query set (original 768d embeddings)30Mf8f3e61bd22d7d64234a0f587ead9fcf

32d PCA projections (pca32)

datasetdescriptionsizemd5
laion2B-en-pca32v2-n=100M.h5100M subset13G02c5726ba41cbfd3320d75ad113ef008
laion2B-en-pca32v2-n=30M.h530M subset3.7Gcf34551e4a80689a155052de640874b1
laion2B-en-pca32v2-n=10M.h510M subset1.3G799dfd317976012a9b768aea123ce6b0
laion2B-en-pca32v2-n=300K.h5300K subset, for developing purposes37Maeffa3290eedd6063f138d5a81489128
laion2B-en-pca32v2-n=100K.h5100K subset, for developing purposes13M45a6c4e3774430d6318f808b43053895
public-queries-10k-pca32v2.h510k public query set for 32d PCA projection1.3M8c0fa4fff523d6263a246f7553d2b92f
private-queries-10k-pca32v2.h510k private query set for 32d PCA projection1.3M57dc078229325b6c161521512585738e

96d PCA projections (pca96)

datasetdescriptionsizemd5
laion2B-en-pca96v2-n=100M.h5100M subset37G715c1f5bfa3da61eaf5e2e8735052043
laion2B-en-pca96v2-n=30M.h530M subset11G17b783ca3714b4b8084d93d59bac4611
laion2B-en-pca96v2-n=10M.h510M subset3.7G4f2520b152929bcd34fb3912d4db025e
laion2B-en-pca96v2-n=300K.h5300K subset, for developing purposes110M97faba380163a5ec2e1a441c3a6d21b6
laion2B-en-pca96v2-n=100K.h5100K subset, for developing purposes37M73d464eccd6a6695d1f78f67bfbc7b46
public-queries-10k-pca96v2.h510k public query set for 96d PCA projection3.7Mf7d0b77f336f8f63803ddb59b4d4b8ed
private-queries-10k-pca96v2.h510k private query set for 96d PCA projection3.7M301330e6d3963dd2db923fd4e858aa4e

1024-bit binary sketches (hamming)

datasetdescriptionsizemd5
laion2B-en-hammingv2-n=100M.h5100M subset13G36030a46f0792d8c520b85a39ea64dfc
laion2B-en-hammingv2-n=30M.h530M subset3.7G9f438fd469e21313684f191d375c63ed
laion2B-en-hammingv2-n=10M.h510M subset1.3G13a28c054a351c2b2cdd8fd918b006ed
laion2B-en-hammingv2-n=300K.h5300K subset, for developing purposes37M03533c23fcc18c806cd42653e46fda89
laion2B-en-hammingv2-n=100K.h5100K subset, for developing purposes13M0dcb6fc72284439f67debcb34080b282
public-queries-10k-hammingv2.h510k public query set for 1024-bit binary sketch projection1.3Mcd93f7bf61a436b5a45d0b3e1a002667
private-queries-10k-pca96v2.h510k private query set for 1024-bit binary sketch projection3.7M301330e6d3963dd2db923fd4e858aa4e

Gold standard list (computed with 32-bit floating point arithmetic, 100 nearest neighbors)

datasetdescriptionsizemd5
laion2B-en-public-gold-standard-v2-100M.h5100M gold standard7.7M35de58992c6446c85c56e710b144c90c
laion2B-en-public-gold-standard-v2-30M.h530M gold standard7.7M1726691372d2f62d7b0b97d8bf4f6189
laion2B-en-public-gold-standard-v2-10M.h510M gold standard7.7Mb68b17693253d95e1fc94c217af25e95
laion2B-en-public-gold-standard-v2-300K.h5300K gold standard7.7M258654f2a34a1bdbfa031862b4e6cfae
laion2B-en-public-gold-standard-v2-100K.h5100K gold standard7.7Mfe39725772f487e4c86af68e18e87c88

Gold standard for public queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

datasetdescriptionsizemd5
laion2B-en-public-gold-standard-v2-100M-F64-IEEE754.h5100M gold standard77M59321e7e33b5469a5b435ff11305257f
laion2B-en-public-gold-standard-v2-30M-F64-IEEE754.h530M gold standard77Ma445f32702aa43a176b56c54bf3f03f9
laion2B-en-public-gold-standard-v2-10M-F64-IEEE754.h510M gold standard77M45b05e4d60b8a66088b378ae7e0d278f
laion2B-en-public-gold-standard-v2-300K-F64-IEEE754.h5300K gold standard77M5d635f26630cced971358fd76f37c32e

Gold standard for private queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

datasetdescriptionsizemd5
laion2B-en-private-gold-standard-v2-10M-F64-IEEE754.h510M private gold standard783Kf384beecb5dddcddca8efc00a7fcd911
laion2B-en-private-gold-standard-v2-30M-F64-IEEE754.h530M private gold standard783K3b43d7b1251bd1387419a245bec8ba55
laion2B-en-private-gold-standard-v2-100M-F64-IEEE754.h5100M private gold standard783K0ab272fd7b0eee8beec378e67da85b65

Associated captions and image urls (tabular delimited files)

datasetdescriptionsizemd5
meta-10M.tsvmetadata for the 10M subset1.8Ga9abbe13fb19207fb240f74fc03e2476
meta-30M.tsvmetadata for the 30M subset5.2Ga3205400411f6b82c8748e1a187d87fb
meta-100M.tsvmetadata for the 100M subset18G323d0cf4cf22ae6edbc71e18e8110100

For instance, you can download the 10M subset and the query set using the following commands from a typical linux terminal:

curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=10M.h5
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/public-queries-10k-clip768v2.h5

People likes demonstrations, and that it is were metadata comes again. You can also download a subset of the associated metadata using the next command:

curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/meta-10M.tsv

Please review the simple jupyter-based demo to see how it can be used.

The -C - flags can be added if you need to resume a broken download.

Metadata for 100K and 300K does not correspond to first 100K and 300K elements of large subsets. More precisely, 100K and 300K subsets include registers with NSFW missing values while large subsets remove missing values.

Note that our projection models were trained with our 10M subset. Other approaches may vary the resulting quality.

Note: Projections will reduce the result's quality concerning the original embeddings, but you can use these datasets to fast prototype your solution and for hyperparameter optimization. Please email us if you are interested in the associated metadata (which can also be obtained as described in the rest of the document.)

The original dataset can be downloaded and processed to get different subsets as described in the downloading and preprocessing LAION page. We encourage challenge participants to use the provided bundles for consistency reasons.
CC BY-SA 4.0 sisap challenge committee. Last modified: August 22, 2023. Website built with Franklin.jl and the Julia programming language.