This code defines a class, MilvusDBHandler
, that:
- Uses Milvus as the database to store and retrieve embeddings.
- Leverages HuggingFaceEmbeddings to transform text into vector embeddings before storing them in the database.
- Provides methods to add, retrieve, and search text and metadata.
- Milvus: A vector database optimized for similarity search, which stores high-dimensional vector embeddings.
- HuggingFaceEmbeddings: A module that transforms raw text into numerical embeddings (vectors).
self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
self.vector_db = Milvus(embedding_function=self.embeddings, connection_args=self.connection,
collection_name=self.collection_name,
auto_id=True,
drop_old=False)
self.embeddings
: Initializes the HuggingFace embedding model (e.g.,all-MiniLM-L6-v2
).self.vector_db
: Connects to Milvus and binds the embedding function (HuggingFaceEmbeddings
) to automatically transform raw text into vectors.
doc = Document(page_content=summary, metadata=metadata)
self.vector_db.add_documents([doc])
- Input: A text
summary
and its associatedmetadata
. - Steps:
- Wraps the text and metadata in a
Document
object. - Calls
self.vector_db.add_documents
, which:- Automatically transforms the text (
page_content
) into a vector embedding using the boundembedding_function
(HuggingFaceEmbeddings
). - Stores the vector, along with its metadata, in Milvus.
- Automatically transforms the text (
- Wraps the text and metadata in a
search_results = self.vector_db.similarity_search(search_query, k=top_k)
return [result.metadata for result in search_results]
- Input: A
search_query
(text) and the number of top results (top_k
). - Steps:
- Transforms the
search_query
into a vector embedding usingHuggingFaceEmbeddings
. - Queries Milvus to find the most similar vectors to the query vector.
- Returns the
metadata
of the closest matches.
- Transforms the
results = self.vector_db.col.query(expr=filter, output_fields=fields)
return [{k: v for k, v in result.items() if k != 'pk'} for result in results]
- Input: A filter expression and fields to retrieve.
- Steps:
- Executes a query on the Milvus collection directly to retrieve stored metadata or other fields.
- Filters out internal fields like
pk
(primary key).
This does not deal with vector data directly; it retrieves the metadata you stored alongside vectors.
def drop_data(self):
if self.vector_db.col:
self.vector_db.col.delete(expr="pk > 0")
def drop_collection(self):
self.vector_db = Milvus(..., drop_old=True)
drop_data
: Deletes all entries from the current collection.drop_collection
: Deletes the entire collection and reinitializes it.
Yes, there is an underlying transformation happening automatically:
- Text Embedding:
- When you call
add_documents
orsimilarity_search
, the text is transformed into a vector representation using theHuggingFaceEmbeddings
model. - These vector embeddings are stored in Milvus.
- When you call
- Metadata:
- Metadata is stored alongside the vectors, enabling you to retrieve context or additional information about the vectorized entries.
- Vectors: The actual text (
page_content
) is transformed into vector embeddings, which are stored in Milvus. - Metadata: The metadata is stored as auxiliary data linked to the vectors.
- The raw text is not stored in Milvus unless explicitly added as metadata.
summarydb.add_summary("A man wearing a yellow vest...", {"chunkIdx": 1, "start_time": 0, "end_time": 30})
- The summary text is converted to a vector embedding.
- The embedding and metadata (
chunkIdx
,start_time
,end_time
) are stored in Milvus.
chunks = summarydb.search("people in yellow vest and jumping?")
- The query text is transformed into a vector embedding.
- Milvus performs similarity search to find the closest stored vectors.
- Returns the metadata of the closest matches.
print(summarydb.get_text_data(fields=["text", "start_time", "end_time"], filter="0 <= start_time and end_time <= 70"))
- Queries metadata of summaries where the start and end times are between 0 and 70.
If you want to customize the transformation process or storage behavior:
- Modify the
HuggingFaceEmbeddings
model or replace it with a custom embedding function. - Adjust how metadata is handled in the
add_summary
andget_text_data
methods.
With the information provided, there is sufficient detail to understand the metadata
structure derived from vars(chunk)
and its purpose. Here’s a structured breakdown:
The vars()
function in Python returns the __dict_
847F
_
attribute of an object, which is a dictionary containing all the attributes and their current values for the given object.
Given the definition of ChunkInfo
, calling vars(chunk)
will produce a dictionary with the following keys and their values:
streamId
(str
): The unique identifier for the stream this chunk belongs to.chunkIdx
(int
): The index of the chunk, indicating its position in the sequence.file
(str
): The file path or stream URL associated with this chunk.pts_offset_ns
(int
): The presentation timestamp offset for this chunk in nanoseconds.start_pts
(int
): The start presentation timestamp of this chunk in nanoseconds.end_pts
(int
): The end presentation timestamp of this chunk in nanoseconds.start_ntp
(str
): The RFC3339 string representation of the start time in Network Time Protocol (NTP) format.end_ntp
(str
): The RFC3339 string representation of the end time in NTP format.
For a specific chunk:
chunk = ChunkInfo()
chunk.streamId = "stream123"
chunk.chunkIdx = 5
chunk.file = "/path/to/video.mp4"
chunk.pts_offset_ns = 0
chunk.start_pts = 1000000000 # 1 second
chunk.end_pts = 2000000000 # 2 seconds
chunk.start_ntp = "2024-12-20T10:00:00.000Z"
chunk.end_ntp = "2024-12-20T10:00:01.000Z"
print(vars(chunk))
Output:
{
"streamId": "stream123",
"chunkIdx": 5,
"file": "/path/to/video.mp4",
"pts_offset_ns": 0,
"start_pts": 1000000000,
"end_pts": 2000000000,
"start_ntp": "2024-12-20T10:00:00.000Z",
"end_ntp": "2024-12-20T10:00:01.000Z"
}
The metadata
is constructed as a combination of vars(chunk)
and additional fields. Specifically:
metadata = vars(chunk_) | {
"start_ntp_float": start_ntp_as_float,
"end_ntp_float": end_ntp_as_float
}
vars(chunk_)
: Includes all attributes of theChunkInfo
object as shown above."start_ntp_float"
: Thestart_ntp
value converted to a UNIX timestamp (float)."end_ntp_float"
: Theend_ntp
value converted to a UNIX timestamp (float).
Using the previous example chunk
:
start_ntp_as_float = 1734746400.0 # UNIX timestamp for 2024-12-20T10:00:00.000Z
end_ntp_as_float = 1734746401.0 # UNIX timestamp for 2024-12-20T10:00:01.000Z
metadata = vars(chunk) | {
"start_ntp_float": start_ntp_as_float,
"end_ntp_float": end_ntp_as_float
}
Result:
{
"streamId": "stream123",
"chunkIdx": 5,
"file": "/path/to/video.mp4",
"pts_offset_ns": 0,
"start_pts": 1000000000,
"end_pts": 2000000000,
"start_ntp": "2024-12-20T10:00:00.000Z",
"end_ntp": "2024-12-20T10:00:01.000Z",
"start_ntp_float": 1734746400.0,
"end_ntp_float": 1734746401.0
}
The metadata
provides a detailed description of a video chunk for storage in the Milvus vector database. It serves multiple roles:
- Identification: Attributes like
streamId
,chunkIdx
, andfile
uniquely identify the chunk. - Time Information: Timestamps (
start_pts
,end_pts
,start_ntp
,end_ntp
) and their float equivalents (start_ntp_float
,end_ntp_float
) allow precise querying and filtering. - Playback/Reference: The
file
path and time offsets (pts_offset_ns
) facilitate playback or reprocessing of specific chunks.
Yes, the information is sufficient to:
- Understand the structure and purpose of
metadata
. - Detail the use of
vars(chunk)
and its output. - Explain how
metadata
is extended with additional fields likestart_ntp_float
.
If you need further insights into how this metadata integrates into downstream operations or the Milvus database, feel free to ask!