Quick Start

Quick Start

This document explains how to install Aeca SDK and its usage.

Installing Aeca and Python SDK

Install the SDK via pip:

pip install aeca

Then run the Aeca server using Docker:

mkdir data
docker run --rm -it --name aeca \
    -p 10080:10080 -v $(pwd)/data:/app/data \
    aeca/aeca-server server

Upon execution, it will be ready to operate through port 10080.

execution results
Aeca v1.1.0

[2024-04-17T08:08:01.138] [general] [info] [1] Initializing gRPC service...
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing KeyValueDBService...
[2024-04-17T08:08:01.138] [general] [info] [1] KeyValueDBService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing DocumentDBService...
[2024-04-17T08:08:01.138] [general] [debug] [159] Long-running query monitor is enabled.
[2024-04-17T08:08:01.138] [general] [info] [1] DocumentDBService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing FTSAnalysisPipelineService...
[2024-04-17T08:08:01.138] [general] [info] [1] FTSAnalysisPipelineService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing SentenceTransformerService...
[2024-04-17T08:08:01.138] [general] [debug] [160] Long-running query monitor is enabled.
[2024-04-17T08:08:01.139] [general] [info] [1] SentenceTransformerService has been successfully initialized.
[2024-04-17T08:08:01.143] [general] [info] [1] gRPC service has been successfully initialized.
[2024-04-17T08:08:01.143] [general] [info] [1] Server listening on 0.0.0.0:10080 (Insecure)

Defining Schema

Create a DocumentDB object for server connection.

from aeca import Channel, DocumentDB
 
channel = Channel("localhost", 10080)
doc_db = DocumentDB(channel)

You plan to store data defined as below:

  • doc_id: document id
  • content: text data
  • embedding: embedding vector for the content

To achieve this, define as follows and create a collection named test.

indexes = [
    {
        "name": "__primary_key__",
        "fields": [
            "doc_id"
        ],
        "unique": True,
        "index_type": "kPrimaryKey"
    },
    {
        "name": "sk_fts",
        "fields": [
            "doc_id",
            "content",
            "embedding"
        ],
        "unique": False,
        "index_type": "kFullTextSearchIndex",
        "options": {
            "doc_id": {
                "analyzer": {
                    "type": "KeywordAnalyzer"
                },
                "index_options": "doc_freqs"
            },
            "content": {
                "analyzer": {
                    "type": "StandardCJKAnalyzer",
                    "options": {
                        "tokenizer": "icu",
                        "ngram_filter": {
                            "min_size": 2,
                            "max_size": 4
                        }
                    }
                },
                "index_options": "offsets"
            },
            "embedding": {
                "analyzer": {
                    "type": "DenseVectorAnalyzer",
                    "options": {
                        "index_type": "HNSW",
                        "dims": 768,
                        "m": 64,
                        "ef_construction": 200,
                        "ef_search": 32,
                        "metric": "inner_product",
                        "normalize": True,
                        "shards": 1
                    }
                },
                "index_options": "doc_freqs"
            }
        }
    }
]
doc_db.create_collection("test", indexes=indexes)

The indexes in the above code are defined with two indices. The first index is the primary key using the doc_id field.

    {
        "name": "__primary_key__",
        "fields": [
            "doc_id"
        ],
        "unique": True,
        "index_type": "kPrimaryKey"
    },

Next is the index for Full-Text Search (FTS). Different analyzer are defined for each field. doc_id is defined with KeywordAnalyzer for exact matching, and content with StandardAnalyzer for FTS. Lastly, embedding is configured with DenseVectorAnalyzer for vector search.

    {
        "name": "sk_fts",
        "fields": [
            "doc_id",
            "content",
            "embedding"
        ],
        "unique": False,
        "index_type": "kFullTextSearchIndex",
        "options": {
            "doc_id": {
                "analyzer": {
                    "type": "KeywordAnalyzer"
                },
                "index_options": "doc_freqs"
            },
            "content": {
                "analyzer": {
                    "type": "StandardCJKAnalyzer",
                    "options": {
                        "tokenizer": "icu",
                        "ngram_filter": {
                            "min_size": 2,
                            "max_size": 4
                        }
                    }
                },
                "index_options": "offsets"
            },
            "embedding": {
                "analyzer": {
                    "type": "DenseVectorAnalyzer",
                    "options": {
                        "index_type": "HNSW",
                        "dims": 768,
                        "m": 64,
                        "ef_construction": 200,
                        "ef_search": 32,
                        "metric": "inner_product",
                        "normalize": True,
                        "shards": 1
                    }
                },
                "index_options": "doc_freqs"
            }
        }
    }

Detailed information about schema definition can be found in Schema for Index Definition.

Data Input

Now that the collection for data input is ready. Aeca is prepared with features for ML model serving, and you can directly use models from Huggingface hub (opens in a new tab) with SentenceTransformers (opens in a new tab) as shown below.

from aeca import SentenceTransformerEncoder
 
encoder = SentenceTransformerEncoder(channel, "ko-sbert-sts")
 
sentences = [
    "남자가 달걀을 그릇에 깨어 넣고 있다.",
    "한 남자가 기타를 치고 있다.",
    "목표물이 총으로 맞고 있다.",
    "남자가 냄비에 국물을 부어 넣고 있다."
]
embeddings = encoder.encode(sentences)
 
docs = []
for doc_id, (sentence, embedding) in enumerate(zip(sentences, embeddings)):
    doc = {
        "doc_id": doc_id,
        "content": sentence,
        "embedding": embedding.tolist()
    }
    docs.append(doc)
doc_db.insert("test", docs)

Create ID, sentences, and embedded vectors, and insert the data using the insert function.

Detailed information about data input can be found in Data Input and Management.

Search

After the data is inserted, it's ready for search. Now, write appropriate query syntax considering the defined schema.

query = "그 여자는 달걀 하나를 그릇에 깨어 넣었다."
query_embedding = encoder.encode(query)
query_embed = ", ".join([str(e) for e in query_embedding[0]])
search_query = {
    "$search": {
        "query": f"(content:({query}))^0.2 AND (embedding:[{query_embed}])^20"
    },
    "$hint": "sk_fts",
    "$project": [
        "doc_id",
        "content",
        {"_meta": "$unnest"}
    ]
}
 
df = doc_db.find("test", search_query)
print(df)

You created an index for FTS named sk_fts. Define $search to perform an FTS search. The syntax used here is similar to lucene (opens in a new tab).

    "$search": {
        "query": f"(content:({query}))^0.2 AND (embedding:[{query_embed}])^20"
    },

If you break down the query content:

  • content:({query}): Searches the text data in the content field, split into ngrams by StandardCJKAnalyzer
  • embedding:[{query_embed}]: Searches for similar vectors using the query_embed generated from query with HNSW (opens in a new tab).
  • ^0.2^20: Assign boosting scores for each item. In this example, results from vector search are given more importance.
  • AND: Combines FTS and vector search for a hybrid search.

Next, select the fields to be outputted using $project. _meta is specified to output the internally calculated scores, and this data is hierarchical, so it's transformed into a one-dimensional field using $unnest.

    "$project": [
        "doc_id",
        "content",
        {"_meta": "$unnest"}
    ]

Now, pass the created query to the find function and execute to get the following results.

   _meta.doc_id  _meta.score  content                          doc_id
0             0     0.313446  남자가 달걀을 그릇에 깨어 넣고 있다.    0

Detailed information about search can be found in Search.

Cleanup

Now, the data that has been inserted and searched can be deleted usingdrop_collection

doc_db.drop_collection("test")