Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

Bug: Duplicate Text (Chunks) in Payloads Due to Shared Metadata Reference in build_payloads Function #88

@ghost

Description

Description of the Issue :

The build_payloads function here is intended to generate unique payloads for each document chunk, but currently, all payloads contain the same text (the last chunk in doc.chunks) despite having unique IDs. This is because doc.metadata is directly referenced and updated in each iteration, causing all payloads to share the same modified metadata.

See the example in the attached screenshot.

Steps to Reproduce:

Call the build_payloads function with a Document object containing multiple chunks.
Observe that the payloads list contains different IDs but identical text (matching the last chunk).

Expected Behavior: Each payload should contain the unique text for its corresponding chunk, along with the associated metadata.

Actual Behavior: All payloads contain the same text, resulting in incorrect data.

Additional Context: This issue occurs because dictionaries in Python are mutable, and assigning payload = doc.metadata results in modifying the original doc.metadata in place.

Example

News Document from Alpaca

image

Payload Uploaded to Qdrant

image

Qdrant Query used to identify the issue.

POST collections/alpaca_news/points/scroll
{
  "filter": {
    "must": [
      {
        "key": "date",
        "match": {
          "text": "2024-01-01T13:15:32+00:00"
        }
      }
    ]
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions