challenge
© sisap challenge committee.
The LAION2B deep features dataset contains 768-dimensional vectors using 16-bit floating point numbers, using dot product as a similarity measure. This task asks for solutions that work on a subset of this dataset. Any transformation to the dataset to load, index, and solve nearest neighbor queries is allowed. Transformations include, but are not limited to, packing into different data types, dimensional reduction, locality-sensitive hashing, product quantization, and binary sketches. Indexing algorithms can be an original part of the contribution, or alternatively, already published indexes can be used. The entire pipeline should be included and tested in GitHub's Actions (GHA) to reproduce the results; the hyperparameters must be clearly specified on the GHA for all the reported subsets.
Input queries: Query set, retrieve neighbors.
Input dataset: A subset of the LAION2B dataset, i.e., see clip768 files. More detailed, 300K subset for partial ranking and 10M, 30M, and 100M for final ranking.
Output: two matrices in a single HDF5 files, one for object identifiers (key knns, integers) and other for distances (key dists, floating point numbers). Note that results should follow column-major order, i.e., the th columns should contain identifiers and distances of the approximate nearest neighbors of the th query sorted by the distance in ascending order, plus additional metadata. Also, knns identifiers must follow 1-based indexing (0-based indexing solutions must add 1 to evaluate correctly). See https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation for more details.
Ranking: Best queries-per-second for results having a recall bigger than w.r.t a public gold standard (partial ranking, working with GitHub Actions) and w.r.t. a private gold standard (final ranking).
We will reproduce your results post-challenge and produce rankings on quality and search time using a different query set. So, it is essential that your solution can be replicated.
Please visit examples section for working examples. In particular, please follow the guidelines of https://github.com/sisap-challenges/sisap23-laion-challenge-evaluation to simplify evaluating your solution. Note that the available examples can be used as starting points.
Search methods working on main memory could take advantage of fast RAM memory accesses instead of slower secondary memory accesses. Nevertheless, large high-dimensional datasets may limit what can be maintained in the main memory simultaneously, even on high-end hardware systems. Binary sketches under the Hamming distance are compact representations of objects, they have a small memory footprint, and distances can be computed efficiently using instructions like SIMD-popcount
, available in almost any modern processor.
This task asks for methods that use a dataset and a distance function to project it into binary sketches where the Hamming distance can measure its dissimilarity. The goal is to provide fast methods to achieve high-quality binary mappings under the mentioned terms.
Input queries: Idem Task A.
Input dataset: Idem Task A.
Output: Idem Task A, with queries solved with an exact algorithm and the Hamming distance (i.e., brute force). Additionally, two matrices of binary sketches of 1024-bits packed into 16 unsigned 64-bit integers, using a .h5
file using a group named queries and db for the binary sketches of the queries and the database, respectively.
Ranking: Recall of the computed nearest neighbor versus the gold standard. Partial ranking will be computed w.r.t. the public gold standard. The final ranking will be computed w.r.t. a private gold standard.
N- Note 1: It is well known other approaches like product quantization, locality-sensitive hashing, and dimensional reduction methods also aim at reducing the memory footprint. However, they are beyond the scope of Task B.
Note 2: There is a complementary task below.
This task asks for solutions for fast indexing binary sketches under the Hamming distance. Participants can use their solutions on Task B to compute the projections or a set of binary sketches added with a variant of the brief permutation binary sketches.[1]
Input queries: Idem Task A.
Input dataset: Idem task A.
Output: Similar to Task B, but we expect that teams use an index working with the Hamming distance instead of brute force.
Ranking: Best queries-per-second for results having a recall bigger than of our baseline recall[1] w.r.t a public gold standard (partial ranking, working with GitHub Actions) and w.r.t. a private gold standard (final ranking).
[1] | Santoyo, F., Chávez, E., & Téllez, E. S. (2014). A compressed index for hamming distances. In Similarity Search and Applications: 7th International Conference, SISAP 2014, Los Cabos, Mexico, October 29-31, 2014. Proceedings 7 (pp. 113-126). Springer International Publishing. |
We will impose hard wall times (e.g., 12hrs) for construction and searching for verification and reproduction (final ranking). We expect that people already specify hyperparameters in the solutions for each subset and task.