challenge
© sisap challenge committee.
The LAION dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set.
A more detailed description can be found here:
The challenge use a 100M subset of the English subset, often called LAION2B, our objects are not marked as NFSW. We use 768-dimensional vector embeddings.
You will get 768 dimensional 16-bit floating point vectors that may be changed to a 32-bit format to get full speed on legacy hardware.
Our gold-standards were computed using dot product as similarity; vectors are almost -normalized, so you can use the cosine distance or the angle distance as well to get a good aproximation.
Our gold-standard .h5
files contain the 1000 nearest neighbors of each query using two associated matrices knns
and dists
, i.e., columns correspond to queries and rows to nearest neighbors for each query.
The knns
identifiers start indexing on 1.
The dists
contains raw similarity values for each corresponding query and object; please consider that this is not a proper metric distance. People using metric properties can use the angle with minor changes. We will not check distance values for the final ranking.
Please use "Save link as" option if you have problems when downloading
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-clip768v2-n=100M.h5 | 100M subset | 147GB | 9d8ee3347b1edf136b3ef38162ac05c3 |
laion2B-en-clip768v2-n=10M.h5 | 10M subset, for developing purposes | 15GB | c05e4b1d2b2a0c7663ac9767753e25e1 |
laion2B-en-clip768v2-n=300K.h5 | 300K subset, for developing purposes | 440MB | d238b4b037c32bae41e497f95dffa895 |
dataset | description | size | md5 |
---|---|---|---|
gold-standard-dbsize=100M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5 | gold standard for the 100M subset (public queries 2024) | 77MB | f88c534ee03f1b2d0adbb972c90bb970 |
gold-standard-dbsize=10M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5 | gold standard for the 10M subset (public queries 2024) | 77MB | 342794391dafed7bd90dabb740fc15ba |
gold-standard-dbsize=1M–public-queries-2024-laion2B-en-clip768v2-n=10k.h5 | gold standard for the 1M subset (public queries 2024) | 77MB | f8e15cec8172451919a01aaa9815919d |
gold-standard-dbsize=300K–public-queries-2024-laion2B-en-clip768v2-n=10k.h5 | gold standard for the 300K subset (public queries) | 77MB | d73ec238f8b4321778f4cbb80a562d23 |
dataset | description | size | md5 |
---|---|---|---|
public-queries-2024-laion2B-en-clip768v2-n=10k.h5 | public queries 2024 (this query set correspond to the 2023 private query set) | 30MB | f8f3e61bd22d7d64234a0f587ead9fcf |
dataset | description | size | md5 |
---|---|---|---|
gold-standard-dbsize=100M–private-queries-2024-laion2B-en-clip768v2-n=10k-epsilon=0.2.h5 | gold standard for the 100M dataset | 77MB | 0086340c91b9414e8e52095e1f41f49c |
dataset | description | size | md5 |
---|---|---|---|
private-queries-2024-laion2B-en-clip768v2-n=10k-epsilon=0.2.h5 | private queries 2024 | 30MB | 3a6757bfd9a5525fe1064bad9bdfc1dd |
For instance, you can download the and prepare the development data using the following commands from a typical linux terminal:
mkdir data2024 # we will use this directory name for the evaluation, so it is good idea to use the same structure
cd data2024
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=300K.h5
curl -O http://ingeotec.mx/~sadit/sisap2024-data/public-queries-2024-laion2B-en-clip768v2-n=10k.h5 # this url will be updated soon
curl -O http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=300K--public-queries-2024-laion2B-en-clip768v2-n=10k.h5 # this url will be updated soon
# curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=10M.h5
# curl -O http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=10M--public-queries-2024-laion2B-en-clip768v2-n=10k.h5
# curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=100M.h5
# curl -O http://ingeotec.mx/~sadit/sisap2024-data/gold-standard-dbsize=100M--public-queries-2024-laion2B-en-clip768v2-n=10k.h5