8000 GitHub - yachiz/learning_note: just a place i put something great I learn from anything!
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

yachiz/learning_note

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

1. Overview

This code defines a class, MilvusDBHandler, that:

  • Uses Milvus as the database to store and retrieve embeddings.
  • Leverages HuggingFaceEmbeddings to transform text into vector embeddings before storing them in the database.
  • Provides methods to add, retrieve, and search text and metadata.

2. Key Components

Milvus and Vector Embeddings

  • Milvus: A vector database optimized for similarity search, which stores high-dimensional vector embeddings.
  • HuggingFaceEmbeddings: A module that transforms raw text into numerical embeddings (vectors).

Initialization

self.embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
self.vector_db = Milvus(embedding_function=self.embeddings, connection_args=self.connection,
                        collection_name=self.collection_name,
                        auto_id=True,
                        drop_old=False)
  • self.embeddings: Initializes the HuggingFace embedding model (e.g., all-MiniLM-L6-v2).
  • self.vector_db: Connects to Milvus and binds the embedding function (HuggingFaceEmbeddings) to automatically transform raw text into vectors.

3. How the Code Works

Adding Text (add_summary)

doc = Document(page_content=summary, metadata=metadata)
self.vector_db.add_documents([doc])
  • Input: A text summary and its associated metadata.
  • Steps:
    1. Wraps the text and metadata in a Document object.
    2. Calls self.vector_db.add_documents, which:
      • Automatically transforms the text (page_content) into a vector embedding using the bound embedding_function (HuggingFaceEmbeddings).
      • Stores the vector, along with its metadata, in Milvus.

Searching for Similar Text (search)

search_results = self.vector_db.similarity_search(search_query, k=top_k)
return [result.metadata for result in search_results]
  • Input: A search_query (text) and the number of top results (top_k).
  • Steps:
    1. Transforms the search_query into a vector embedding using HuggingFaceEmbeddings.
    2. Queries Milvus to find the most similar vectors to the query vector.
    3. Returns the metadata of the closest matches.

Retrieving Stored Data (get_text_data)

results = self.vector_db.col.query(expr=filter, output_fields=fields)
return [{k: v for k, v in result.items() if k != 'pk'} for result in results]
  • Input: A filter expression and fields to retrieve.
  • Steps:
    1. Executes a query on the Milvus collection directly to retrieve stored metadata or other fields.
    2. Filters out internal fields like pk (primary key).

This does not deal with vector data directly; it retrieves the metadata you stored alongside vectors.


Dropping Data or Collections

def drop_data(self):
    if self.vector_db.col:
        self.vector_db.col.delete(expr="pk > 0")

def drop_collection(self):
    self.vector_db = Milvus(..., drop_old=True)
  • drop_data: Deletes all entries from the current collection.
  • drop_collection: Deletes the entire collection and reinitializes it.

4. Underlying Transformation: Text to Vector

Yes, there is an underlying transformation happening automatically:

  • Text Embedding:
    • When you call add_documents or similarity_search, the text is transformed into a vector representation using the HuggingFaceEmbeddings model.
    • These vector embeddings are stored in Milvus.
  • Metadata:
    • Metadata is stored alongside the vectors, enabling you to retrieve context or additional information about the vectorized entries.

5. Does it Store Text or Vectors?

  • Vectors: The actual text (page_content) is transformed into vector embeddings, which are stored in Milvus.
  • Metadata: The metadata is stored as auxiliary data linked to the vectors.
  • The raw text is not stored in Milvus unless explicitly added as metadata.

6. Example Execution

Adding Summaries

summarydb.add_summary("A man wearing a yellow vest...", {"chunkIdx": 1, "start_time": 0, "end_time": 30})
  • The summary text is converted to a vector embedding.
  • The embedding and metadata (chunkIdx, start_time, end_time) are stored in Milvus.

Searching

chunks = summarydb.search("people in yellow vest and jumping?")
  • The query text is transformed into a vector embedding.
  • Milvus performs similarity search to find the closest stored vectors.
  • Returns the metadata of the closest matches.

Retrieving Metadata

print(summarydb.get_text_data(fields=["text", "start_time", "end_time"], filter="0 <= start_time and end_time <= 70"))
  • Queries metadata of summaries where the start and end times are between 0 and 70.

7. How to Modify It

If you want to customize the transformation process or storage behavior:

  • Modify the HuggingFaceEmbeddings model or replace it with a custom embedding function.
  • Adjust how metadata is handled in the add_summary and get_text_data methods.

With the information provided, there is sufficient detail to understand the metadata structure derived from vars(chunk) and its purpose. Here’s a structured breakdown:


What is vars(chunk)?

The vars() function in Python returns the __dict_ 847F _ attribute of an object, which is a dictionary containing all the attributes and their current values for the given object.

For ChunkInfo Instances:

Given the definition of ChunkInfo, calling vars(chunk) will produce a dictionary with the following keys and their values:

  • streamId (str): The unique identifier for the stream this chunk belongs to.
  • chunkIdx (int): The index of the chunk, indicating its position in the sequence.
  • file (str): The file path or stream URL associated with this chunk.
  • pts_offset_ns (int): The presentation timestamp offset for this chunk in nanoseconds.
  • start_pts (int): The start presentation timestamp of this chunk in nanoseconds.
  • end_pts (int): The end presentation timestamp of this chunk in nanoseconds.
  • start_ntp (str): The RFC3339 string representation of the start time in Network Time Protocol (NTP) format.
  • end_ntp (str): The RFC3339 string representation of the end time in NTP format.

Example Output:

For a specific chunk:

chunk = ChunkInfo()
chunk.streamId = "stream123"
chunk.chunkIdx = 5
chunk.file = "/path/to/video.mp4"
chunk.pts_offset_ns = 0
chunk.start_pts = 1000000000  # 1 second
chunk.end_pts = 2000000000    # 2 seconds
chunk.start_ntp = "2024-12-20T10:00:00.000Z"
chunk.end_ntp = "2024-12-20T10:00:01.000Z"

print(vars(chunk))

Output:

{
    "streamId": "stream123",
    "chunkIdx": 5,
    "file": "/path/to/video.mp4",
    "pts_offset_ns": 0,
    "start_pts": 1000000000,
    "end_pts": 2000000000,
    "start_ntp": "2024-12-20T10:00:00.000Z",
    "end_ntp": "2024-12-20T10:00:01.000Z"
}

What is metadata?

The metadata is constructed as a combination of vars(chunk) and additional fields. Specifically:

Code Context:

metadata = vars(chunk_) | {
    "start_ntp_float": start_ntp_as_float,
    "end_ntp_float": end_ntp_as_float
}
  • vars(chunk_): Includes all attributes of the ChunkInfo object as shown above.
  • "start_ntp_float": The start_ntp value converted to a UNIX timestamp (float).
  • "end_ntp_float": The end_ntp value converted to a UNIX timestamp (float).

Expanded Metadata Example:

Using the previous example chunk:

start_ntp_as_float = 1734746400.0  # UNIX timestamp for 2024-12-20T10:00:00.000Z
end_ntp_as_float = 1734746401.0    # UNIX timestamp for 2024-12-20T10:00:01.000Z

metadata = vars(chunk) | {
    "start_ntp_float": start_ntp_as_float,
    "end_ntp_float": end_ntp_as_float
}

Result:

{
    "streamId": "stream123",
    "chunkIdx": 5,
    "file": "/path/to/video.mp4",
    "pts_offset_ns": 0,
    "start_pts": 1000000000,
    "end_pts": 2000000000,
    "start_ntp": "2024-12-20T10:00:00.000Z",
    "end_ntp": "2024-12-20T10:00:01.000Z",
    "start_ntp_float": 1734746400.0,
    "end_ntp_float": 1734746401.0
}

Purpose of metadata

The metadata provides a detailed description of a video chunk for storage in the Milvus vector database. It serves multiple roles:

  1. Identification: Attributes like streamId, chunkIdx, and file uniquely identify the chunk.
  2. Time Information: Timestamps (start_pts, end_pts, start_ntp, end_ntp) and their float equivalents (start_ntp_float, end_ntp_float) allow precise querying and filtering.
  3. Playback/Reference: The file path and time offsets (pts_offset_ns) facilitate playback or reprocessing of specific chunks.

Sufficient Information for Analysis?

Yes, the information is sufficient to:

  1. Understand the structure and purpose of metadata.
  2. Detail the use of vars(chunk) and its output.
  3. Explain how metadata is extended with additional fields like start_ntp_float.

If you need further insights into how this metadata integrates into downstream operations or the Milvus database, feel free to ask!

About

just a place i put something great I learn from anything!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0