FiS Project – Data Science and Engineering Blog

久しぶりの投稿となります。Rust 製の近似最近傍探索エンジン Qdrant (v1.0.1) を使ってみたので備忘録を残しておきます。

Amazon Customer Reviews Dataset という Amazon.com の製品レビューデータセットに含まれるレビュー文を Sentence-BERT で埋め込みベクトルに符号化し Qdrant にデータ投入, ベクトル検索を行ってみたという内容です。

実行環境

実行環境は macOS Monterey, Docker (Qdrant), Python 3.8.16 で Python のパッケージ管理は poetry で行う。

$ poetry add pandas qdrant-client sentence-transformers jupyterlab pyarrow

pyproject.toml は以下。

[tool.poetry]
name = "qdrant-example"
version = "0.1.0"
description = ""
authors = []

[tool.poetry.dependencies]
python = ">=3.8,<=3.11"
pandas = "^1.5.3"
qdrant-client = "^1.0.0"
sentence-transformers = "^2.2.2"
jupyterlab = "^3.6.1"
pyarrow = "^11.0.0"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

Amazon Customer Reviews Dataset

Amazon Customer Reviews Dataset は Amazon.com の製品レビューデータセットで Amazon S3 から DL できる。データセットの製品カテゴリは40種類以上あるが, 今回は Qdrant の簡易検証が目的のため, 全カテゴリのデータでなく Toys カテゴリの一部のデータのみを DL する。

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Toys/
2018-04-09 15:39:59  127618720 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
...
2018-04-09 15:40:03  126858343 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

$ aws s3 cp s3://amazon-reviews-pds/parquet/product_category=Toys/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet .

データセットには以下の15列がある。

marketplace
customer_id
review_id
product_id
product_parent
product_title
star_rating
helpful_votes
total_votes
vine
verified_purchase
review_headline
review_body
review_date
year

pandas で英語圏 (UK or US) の先頭10,000 行を抽出した。

In [1]: import pandas as pd

In [2]: df = pd.read_parquet("data/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet")

In [3]: df_en = df[df["marketplace"].isin(["UK", "US"])].head(10000).reset_index(drop=True)

In [4]: df_en
Out[4]:
     marketplace customer_id  ... review_date  year
0             UK    17537021  ...  2015-05-03  2015
1             UK    32767579  ...  2015-05-03  2015
2             US    15351528  ...  2012-06-30  2012
3             US    13023551  ...  2015-05-03  2015
4             US    11451833  ...  2012-06-30  2012
...          ...         ...  ...         ...   ...
9995          US    12035422  ...  2012-09-16  2012
9996          US    43262931  ...  2015-05-13  2015
9997          US    44581097  ...  2012-09-16  2012
9998          US     2582516  ...  2015-05-13  2015
9999          US    26449708  ...  2012-09-16  2012

[10000 rows x 15 columns]

In [5]: review_sentences = df_en["review_body"].values

In [6]: review_sentences
Out[6]:
array(["We love Schleich's range of animals and dinosaurs. This wolf doesn't disappoint.",
       'Great game, speedy delivery.',
       "Demolition Dummy a.k.a Crash Test Dummy... in Lego Form! My first from Series 1. It is a pretty decent mini figure that I've obtained. Not knowing what I was getting, I have to admit that I was expecting the Zombie, but it'll do for my collection. Now I can make crash tests with it lol. Overall, it's great.",
       ...,
       'My son loves it! He was able to build it by himself and plays with it regularly. He is wanting every Lego star Wars kit he sees!',
       'This is the perfect gift for my son-in-law in Australia.',
       'After seeing all the positive reviews I decided to purchase this copter. Since I have received this heli, the only real flying I have done with it is retrieving it after it quickly crashes 2 or 3 mins later. It will only take remote signals if I am within 1 or 2 feet. The copter itself may be ok but the remote is junk!!!'],
      dtype=object)

Sentence-BERT

Sentence-BERT でレビュー文 (review_body 列) を埋め込みベクトルに符号化する。[1]
事前学習済みモデルは汎用モデルとして設計されている all-MiniLM-L6-v2 を用いる。[2]

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(review_sentences)

Qdrant

Qdrant はスケーラビリティを備えた Rust 製の近似最近傍探索エンジンで HNSW アルゴリズム [3] を独自にカスタマイズしたアルゴリズムを実装している。

Qdrant の Docker イメージを取得する。(投稿時点は v1.0.1)

$ docker pull qdrant/qdrant

Docker コンテナを起動, Qdrant サーバが起動したことを確認する。

$ mkdir -p data/qdrant/storage data/qdrant/snapshots
$ docker run -p 6333:6333 \
  -v $(pwd)/data/qdrant/storage:/qdrant/storage \
  -v $(pwd)/data/qdrant/snapshots:/snapshots \
  qdrant/qdrant

Qdrant ではベクトルに Payload と呼ばれる追加情報を付与したデータ構造を Point と言う。また, Point の集合を Collection と言う。
RDB におけるテーブルが Collection, 行が Point と捉えると直感的に分かり易いかもしれない。

Payload のデータは辞書となっており, 検索の際に Filtering することができる。例えば, クエリのレビュアーの年齢に近いレビューの上位n件を抽出したいというような場合に役立つ。

Python の Qdrant クライアントである qdrant-client を用いて, Qdrant サーバに接続し Collection を作成する。ベクトルは 384 次元, ベクトル間の類似度を測る距離はコサイン距離。

client = QdrantClient(host="localhost", port=6333)

client.recreate_collection(
    collection_name="amazon-reviews",
    vectors_config=models.VectorParams(
        size=embeddings.shape[1], distance=models.Distance.COSINE
    ),
    on_disk_payload=True,
)

上記実行時の Qdrant サーバの log は以下で, recreate_collection() により DELETE で Collection が削除されてから PUT で再作成されている。

[2023-02-10T14:34:14.016Z INFO  actix_web::middleware::logger] 172.17.0.1 "DELETE /collections/amazon-reviews HTTP/1.1" 200 72 "-" "python-httpx/0.23.3" 0.007637
[2023-02-10T14:34:15.155Z INFO  actix_web::middleware::logger] 172.17.0.1 "PUT /collections/amazon-reviews HTTP/1.1" 200 71 "-" "python-httpx/0.23.3" 1.102767

Qdrant にデータ投入する。最初に 10,000 件を Batch でデータ投入を試みたが, どうしても time-out してしまうため, 今回はチャンクに分割しチャンク単位で投入した。

from typing import List, Any
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import PointStruct

def make_chunks(lst: List[Any], n=100) -> List[List[Any]]:
    n = max(1, n)
    return (lst[i : i + n] for i in range(0, len(lst), n))

points = []

for vector, row in zip(embeddings, df_en.itertuples()):
    payload = {
        "customer_id": row.customer_id,
        "review_id": row.review_id,
        "product_id": row.product_id,
        "review_body": row.review_body,
        "star_rating": row.star_rating,
    }
    points.append(PointStruct(id=row.Index+1, vector=vector.tolist(), payload=payload))

points_list = make_chunks(points, n=100)

for points in points_list:
    operation_info = client.upsert(
        collection_name="amazon-reviews",
        points=points
    )
    if operation_info.status.value != "completed":
        print(f"Failed: {operation_info.status.value}")

Collection の情報を取得し, 10,000 件の points が格納されていることを確認。

client.get_collection(collection_name="amazon-reviews")

Search

投入した 10,000 件とは異なるレビュー文をテストデータとして近似最近傍探索 (Approximate Nearest Neighbor search; ANN) を行ってみる。

df_en_test = df[df["marketplace"].isin(["UK", "US"])].tail(10).reset_index(drop=True)
review_test_sentences = df_en_test["review_body"].values
embeddings_test = model.encode(review_test_sentences)

テストデータ内の文 ‘Small but nice’ の近傍に位置する文 (ベクトル) を調べる。

search_result = client.search(
    collection_name="amazon-reviews",
    query_vector=embeddings_test[7].tolist(),
    limit=10,
)
for i, point in enumerate(search_result):
    print(f"[{i+1}] score: {point.score}, id: {point.id}, star_rating: {point.payload['star_rating']}, review_body: {point.payload['review_body']}\b")

スコアの上位10件は以下で, 製品のサイズに対するレビューが多くの占めておりある程度は意味的にも類似した文が検索できている。

[1] score: 0.87326646, id: 234, star_rating: 5, review_body: A little small, but nice
[2] score: 0.85846907, id: 3376, star_rating: 3, review_body: Very smal
[3] score: 0.8049094, id: 5944, star_rating: 5, review_body: very nice so small in siz
[4] score: 0.7880233, id: 1944, star_rating: 5, review_body: medium sized and goo
[5] score: 0.7797675, id: 3834, star_rating: 2, review_body: Small, not as nice as the old fashioned one
[6] score: 0.7217399, id: 7843, star_rating: 4, review_body: Smaller than what I though
[7] score: 0.71946704, id: 1227, star_rating: 2, review_body: tin
[8] score: 0.71628106, id: 8291, star_rating: 5, review_body: Small but adorable. Good size for little hands
[9] score: 0.71628106, id: 8485, star_rating: 5, review_body: Small but adorable. Good size for little hands
[10] score: 0.7013992, id: 4257, star_rating: 3, review_body: A little smaller than I wanted them to be

Snapshots

せっかくなので, スナップショットの作成 & リストア手順も確認しておく。
Python 経由だとスナップショットの作成時に time-out してしまったため, 以降は CLI から操作を行う。
POST でスナップショットを作成。

$ QDRANT_COLLECTION_NAME="amazon-reviews"
$ curl -X POST http://localhost:6333/collections/$QDRANT_COLLECTION_NAME/snapshots

{"result":{"name":"amazon-reviews-17322170446177638066-2023-02-10-22-20-43.snapshot","creation_time":"2023-02-10T22:21:05","size":86245376},"status":"ok","time":22.791109188}

GET でスナップショット一覧を取得。

$ curl  http://localhost:6333/collections/$QDRANT_COLLECTION_NAME/snapshots | jq
{
  "result": [
    {
      "name": "amazon-reviews-17322170446177638066-2023-02-10-22-20-43.snapshot",
      "creation_time": "2023-02-10T22:21:05",
      "size": 86245376
    }
  ],
  "status": "ok",
  "time": 0.017437166
}

スナップショット名を指定して DL を実行。

$ QDRANT_SNAPSHOT_INPUT_NAME="amazon-reviews-17322170446177638066-2023-02-10-22-20-43.snapshot"
$ QDRANT_SNAPSHOT_OUTPUT_PATH="./data/qdrant/snapshots/$QDRANT_SNAPSHOT_INPUT_NAME"
$ curl http://localhost:6333/collections/$QDRANT_COLLECTION_NAME/snapshots/$QDRANT_SNAPSHOT_INPUT_NAME --output $QDRANT_SNAPSHOT_OUTPUT_PATH
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 82.2M  100 82.2M    0     0  55.2M      0  0:00:01  0:00:01 --:--:-- 55.5M

起動中の Docker コンテナを停止/削除して, Docker コンテナ起動時の qdrant コマンドの snapshot オプションに取得したスナップショット名と Collection 名を指定して実行。

$ rm -rf data/qdrant/storage/*
$ docker run -p 6333:6333 \
  -v $(pwd)/data/qdrant/storage:/qdrant/storage \
  -v $(pwd)/data/qdrant/snapshots:/snapshots \
  qdrant/qdrant ./qdrant --snapshot /snapshots/$QDRANT_SNAPSHOT_INPUT_NAME:$QDRANT_COLLECTION_NAME

Collection の情報を取得。10,000 件の points が格納されており, リストアに成功していることが確認できる。

$ curl http://localhost:6333/collections/$QDRANT_COLLECTION_NAME | jq
{
  "result": {
    "status": "green",
    "optimizer_status": "ok",
    "vectors_count": 10000,
    "indexed_vectors_count": 0,
    "points_count": 10000,
    "segments_count": 5,
    "config": {
      "params": {
        "vectors": {
          "size": 384,
          "distance": "Cosine"
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": null,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 1
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      }
    },
    "payload_schema": {}
  },
  "status": "ok",
  "time": 8.0758e-05
}

上記は単一ノードでデプロイした場合のリストア方法で, クラスタ時のリストア方法は方法が異なる。実運用ではベクトル空間全体を入れ替えて分散化する状況はありそう。

おわりに

今回は Amazon Customer Reviews Dataset データセットを題材に Qdrant で近似近傍探索を試してみました。

[1] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
[2] MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
[3] Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs