Quick Start
This document explains how to install Aeca SDK and its usage.
Installing Aeca and Python SDK
Install the SDK via pip:
pip install aeca
Then run the Aeca server using Docker:
mkdir data
docker run --rm -it --name aeca \
-p 10080:10080 -v $(pwd)/data:/app/data \
aeca/aeca-server server
Upon execution, it will be ready to operate through port 10080
.
Aeca v1.1.0
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing gRPC service...
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing KeyValueDBService...
[2024-04-17T08:08:01.138] [general] [info] [1] KeyValueDBService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing DocumentDBService...
[2024-04-17T08:08:01.138] [general] [debug] [159] Long-running query monitor is enabled.
[2024-04-17T08:08:01.138] [general] [info] [1] DocumentDBService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing FTSAnalysisPipelineService...
[2024-04-17T08:08:01.138] [general] [info] [1] FTSAnalysisPipelineService has been successfully initialized.
[2024-04-17T08:08:01.138] [general] [info] [1] Initializing SentenceTransformerService...
[2024-04-17T08:08:01.138] [general] [debug] [160] Long-running query monitor is enabled.
[2024-04-17T08:08:01.139] [general] [info] [1] SentenceTransformerService has been successfully initialized.
[2024-04-17T08:08:01.143] [general] [info] [1] gRPC service has been successfully initialized.
[2024-04-17T08:08:01.143] [general] [info] [1] Server listening on 0.0.0.0:10080 (Insecure)
Defining Schema
Create a DocumentDB object for server connection.
from aeca import Channel, DocumentDB
channel = Channel("localhost", 10080)
doc_db = DocumentDB(channel)
You plan to store data defined as below:
- doc_id: document id
- content: text data
- embedding: embedding vector for the content
To achieve this, define as follows and create a collection named test
.
indexes = [
{
"name": "__primary_key__",
"fields": [
"doc_id"
],
"unique": True,
"index_type": "kPrimaryKey"
},
{
"name": "sk_fts",
"fields": [
"doc_id",
"content",
"embedding"
],
"unique": False,
"index_type": "kFullTextSearchIndex",
"options": {
"doc_id": {
"analyzer": {
"type": "KeywordAnalyzer"
},
"index_options": "doc_freqs"
},
"content": {
"analyzer": {
"type": "StandardCJKAnalyzer",
"options": {
"tokenizer": "icu",
"ngram_filter": {
"min_size": 2,
"max_size": 4
}
}
},
"index_options": "offsets"
},
"embedding": {
"analyzer": {
"type": "DenseVectorAnalyzer",
"options": {
"index_type": "HNSW",
"dims": 768,
"m": 64,
"ef_construction": 200,
"ef_search": 32,
"metric": "inner_product",
"normalize": True,
"shards": 1
}
},
"index_options": "doc_freqs"
}
}
}
]
doc_db.create_collection("test", indexes=indexes)
The indexes
in the above code are defined with two indices. The first index is the primary key using the doc_id field.
{
"name": "__primary_key__",
"fields": [
"doc_id"
],
"unique": True,
"index_type": "kPrimaryKey"
},
Next is the index for Full-Text Search (FTS). Different analyzer are defined for each field. doc_id
is defined with KeywordAnalyzer for exact matching, and content
with StandardAnalyzer for FTS. Lastly, embedding
is configured with DenseVectorAnalyzer for vector search.
{
"name": "sk_fts",
"fields": [
"doc_id",
"content",
"embedding"
],
"unique": False,
"index_type": "kFullTextSearchIndex",
"options": {
"doc_id": {
"analyzer": {
"type": "KeywordAnalyzer"
},
"index_options": "doc_freqs"
},
"content": {
"analyzer": {
"type": "StandardCJKAnalyzer",
"options": {
"tokenizer": "icu",
"ngram_filter": {
"min_size": 2,
"max_size": 4
}
}
},
"index_options": "offsets"
},
"embedding": {
"analyzer": {
"type": "DenseVectorAnalyzer",
"options": {
"index_type": "HNSW",
"dims": 768,
"m": 64,
"ef_construction": 200,
"ef_search": 32,
"metric": "inner_product",
"normalize": True,
"shards": 1
}
},
"index_options": "doc_freqs"
}
}
}
Detailed information about schema definition can be found in Schema for Index Definition.
Data Input
Now that the collection for data input is ready. Aeca is prepared with features for ML model serving, and you can directly use models from Huggingface hub (opens in a new tab) with SentenceTransformers (opens in a new tab) as shown below.
from aeca import SentenceTransformerEncoder
encoder = SentenceTransformerEncoder(channel, "ko-sbert-sts")
sentences = [
"남자가 달걀을 그릇에 깨어 넣고 있다.",
"한 남자가 기타를 치고 있다.",
"목표물이 총으로 맞고 있다.",
"남자가 냄비에 국물을 부어 넣고 있다."
]
embeddings = encoder.encode(sentences)
docs = []
for doc_id, (sentence, embedding) in enumerate(zip(sentences, embeddings)):
doc = {
"doc_id": doc_id,
"content": sentence,
"embedding": embedding.tolist()
}
docs.append(doc)
doc_db.insert("test", docs)
Create ID, sentences, and embedded vectors, and insert the data using the insert function.
Detailed information about data input can be found in Data Input and Management.
Search
After the data is inserted, it's ready for search. Now, write appropriate query syntax considering the defined schema.
query = "그 여자는 달걀 하나를 그릇에 깨어 넣었다."
query_embedding = encoder.encode(query)
query_embed = ", ".join([str(e) for e in query_embedding[0]])
search_query = {
"$search": {
"query": f"(content:({query}))^0.2 AND (embedding:[{query_embed}])^20"
},
"$hint": "sk_fts",
"$project": [
"doc_id",
"content",
{"_meta": "$unnest"}
]
}
df = doc_db.find("test", search_query)
print(df)
You created an index for FTS named sk_fts
. Define $search
to perform an FTS search. The syntax used here is similar to lucene (opens in a new tab).
"$search": {
"query": f"(content:({query}))^0.2 AND (embedding:[{query_embed}])^20"
},
If you break down the query content:
content:({query})
: Searches the text data in the content field, split into ngrams by StandardCJKAnalyzerembedding:[{query_embed}]
: Searches for similar vectors using thequery_embed
generated fromquery
with HNSW (opens in a new tab).^0.2
와^20
: Assign boosting scores for each item. In this example, results from vector search are given more importance.AND
: Combines FTS and vector search for a hybrid search.
Next, select the fields to be outputted using $project
. _meta is specified to output the internally calculated scores, and this data is hierarchical, so it's transformed into a one-dimensional field using $unnest
.
"$project": [
"doc_id",
"content",
{"_meta": "$unnest"}
]
Now, pass the created query to the find function and execute to get the following results.
_meta.doc_id _meta.score content doc_id
0 0 0.313446 남자가 달걀을 그릇에 깨어 넣고 있다. 0
Detailed information about search can be found in Search.
Cleanup
Now, the data that has been inserted and searched can be deleted usingdrop_collection
doc_db.drop_collection("test")