challenge
© sisap challenge committee.
The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.
A more detailed description can be found here:
The English subset, often called LAION2B, contains over 2 billion objects.
The dataset is divided into parts containing close to 1M vectors. We selected the first 112 parts (0000 to 0111); we used the first part to extract the public query set and the rest to extract the database. The subset use approximately 160GB of space and its associated metadata 20GB (the first 112 parts). Embeddings are distributed using single precision (16bits) floating point vectors bundled in the NumPy data-specific format .npz
. They can be loaded on most platforms due to the format's popularity.
The challenge has three subsets:
10M subset: concatenation of 1-11 parts.
30M subset: concatenation of 1-33 parts.
100M subset: concatenation of 1-111 parts.
public queries: computed from part 0.
All parts should be concatenated in order and also removing NSFW entries (marked in metadata files).
Note 1: You will get 768 dimensional 16-bit floating point vectors that may be changed to a 32-bit format to get full speed on legacy hardware.
Note 2: Our gold-standards were computed using -normalized vectors (i.e., unitary norms) and the as distance function.
Note 3: Our gold-standard .h5
files contain the 100 nearest neighbors of each query using two associated matrices knns
and dists
, i.e., columns correspond to queries and rows to nearest neighbors for each query.
The knns
identifiers start indexing on 1.
The dists
contains raw distance values for each corresponding query and object, i.e., 1-\cos(\cdot, \cdot)
; please consider that this is not a proper metric distance. People using metric properties can use the angle with minor changes.
We provide access to different subsets of the dataset and also created three different lower-dimensional projections that can be used. In particular, we computed two PCA projections using 32 and 96 dimensions and one more projection into binary sketches designed to work with bit-level hamming distance (using 1024 bits). Find below the URLs to download these bundles.
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-clip768v2-n=100M.h5 | 100M subset | 147G | 9d8ee3347b1edf136b3ef38162ac05c3 |
laion2B-en-clip768v2-n=30M.h5 | 30M subset | 44G | 15a24d28d2304e14711e23baf7fe86a4 |
laion2B-en-clip768v2-n=10M.h5 | 10M subset | 15G | c05e4b1d2b2a0c7663ac9767753e25e1 |
laion2B-en-clip768v2-n=300K.h5 | 300K subset, for developing purposes | 440M | d238b4b037c32bae41e497f95dffa895 |
laion2B-en-clip768v2-n=100K.h5 | 100K subset, for developing purposes | 147M | daef38a64e3cd1c5233231f8be882a64 |
public-queries-10k-clip768v2.h5 | 10k public query set (original 768d embeddings) | 30M | 257b9eb3f7f25776e0d33b22451b7b32 |
private-queries-10k-clip768v2.h5 | 10k private query set (original 768d embeddings) | 30M | f8f3e61bd22d7d64234a0f587ead9fcf |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-pca32v2-n=100M.h5 | 100M subset | 13G | 02c5726ba41cbfd3320d75ad113ef008 |
laion2B-en-pca32v2-n=30M.h5 | 30M subset | 3.7G | cf34551e4a80689a155052de640874b1 |
laion2B-en-pca32v2-n=10M.h5 | 10M subset | 1.3G | 799dfd317976012a9b768aea123ce6b0 |
laion2B-en-pca32v2-n=300K.h5 | 300K subset, for developing purposes | 37M | aeffa3290eedd6063f138d5a81489128 |
laion2B-en-pca32v2-n=100K.h5 | 100K subset, for developing purposes | 13M | 45a6c4e3774430d6318f808b43053895 |
public-queries-10k-pca32v2.h5 | 10k public query set for 32d PCA projection | 1.3M | 8c0fa4fff523d6263a246f7553d2b92f |
private-queries-10k-pca32v2.h5 | 10k private query set for 32d PCA projection | 1.3M | 57dc078229325b6c161521512585738e |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-pca96v2-n=100M.h5 | 100M subset | 37G | 715c1f5bfa3da61eaf5e2e8735052043 |
laion2B-en-pca96v2-n=30M.h5 | 30M subset | 11G | 17b783ca3714b4b8084d93d59bac4611 |
laion2B-en-pca96v2-n=10M.h5 | 10M subset | 3.7G | 4f2520b152929bcd34fb3912d4db025e |
laion2B-en-pca96v2-n=300K.h5 | 300K subset, for developing purposes | 110M | 97faba380163a5ec2e1a441c3a6d21b6 |
laion2B-en-pca96v2-n=100K.h5 | 100K subset, for developing purposes | 37M | 73d464eccd6a6695d1f78f67bfbc7b46 |
public-queries-10k-pca96v2.h5 | 10k public query set for 96d PCA projection | 3.7M | f7d0b77f336f8f63803ddb59b4d4b8ed |
private-queries-10k-pca96v2.h5 | 10k private query set for 96d PCA projection | 3.7M | 301330e6d3963dd2db923fd4e858aa4e |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-hammingv2-n=100M.h5 | 100M subset | 13G | 36030a46f0792d8c520b85a39ea64dfc |
laion2B-en-hammingv2-n=30M.h5 | 30M subset | 3.7G | 9f438fd469e21313684f191d375c63ed |
laion2B-en-hammingv2-n=10M.h5 | 10M subset | 1.3G | 13a28c054a351c2b2cdd8fd918b006ed |
laion2B-en-hammingv2-n=300K.h5 | 300K subset, for developing purposes | 37M | 03533c23fcc18c806cd42653e46fda89 |
laion2B-en-hammingv2-n=100K.h5 | 100K subset, for developing purposes | 13M | 0dcb6fc72284439f67debcb34080b282 |
public-queries-10k-hammingv2.h5 | 10k public query set for 1024-bit binary sketch projection | 1.3M | cd93f7bf61a436b5a45d0b3e1a002667 |
private-queries-10k-pca96v2.h5 | 10k private query set for 1024-bit binary sketch projection | 3.7M | 301330e6d3963dd2db923fd4e858aa4e |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-public-gold-standard-v2-100M.h5 | 100M gold standard | 7.7M | 35de58992c6446c85c56e710b144c90c |
laion2B-en-public-gold-standard-v2-30M.h5 | 30M gold standard | 7.7M | 1726691372d2f62d7b0b97d8bf4f6189 |
laion2B-en-public-gold-standard-v2-10M.h5 | 10M gold standard | 7.7M | b68b17693253d95e1fc94c217af25e95 |
laion2B-en-public-gold-standard-v2-300K.h5 | 300K gold standard | 7.7M | 258654f2a34a1bdbfa031862b4e6cfae |
laion2B-en-public-gold-standard-v2-100K.h5 | 100K gold standard | 7.7M | fe39725772f487e4c86af68e18e87c88 |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-public-gold-standard-v2-100M-F64-IEEE754.h5 | 100M gold standard | 77M | 59321e7e33b5469a5b435ff11305257f |
laion2B-en-public-gold-standard-v2-30M-F64-IEEE754.h5 | 30M gold standard | 77M | a445f32702aa43a176b56c54bf3f03f9 |
laion2B-en-public-gold-standard-v2-10M-F64-IEEE754.h5 | 10M gold standard | 77M | 45b05e4d60b8a66088b378ae7e0d278f |
laion2B-en-public-gold-standard-v2-300K-F64-IEEE754.h5 | 300K gold standard | 77M | 5d635f26630cced971358fd76f37c32e |
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-private-gold-standard-v2-10M-F64-IEEE754.h5 | 10M private gold standard | 783K | f384beecb5dddcddca8efc00a7fcd911 |
laion2B-en-private-gold-standard-v2-30M-F64-IEEE754.h5 | 30M private gold standard | 783K | 3b43d7b1251bd1387419a245bec8ba55 |
laion2B-en-private-gold-standard-v2-100M-F64-IEEE754.h5 | 100M private gold standard | 783K | 0ab272fd7b0eee8beec378e67da85b65 |
For instance, you can download the 10M subset and the query set using the following commands from a typical linux terminal:
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=10M.h5
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/public-queries-10k-clip768v2.h5
Note that our projection models were trained with our 10M subset. Other approaches may vary the resulting quality.
Note: Projections will reduce the result's quality concerning the original embeddings, but you can use these datasets to fast prototype your solution and for hyperparameter optimization. Please email us if you are interested in the associated metadata (which can also be obtained as described in the rest of the document.)