-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH]: Force statistics wrapper callers to specify an output collection #5961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Explicit statistics output collection parameter introduced Updates the statistics utilities to require callers to explicitly supply a statistics output collection name when attaching or retrieving statistics. Corresponding integration tests now provide explicit names and include minor typing cleanups, ensuring the new API contract is exercised. Key Changes• Changed Affected Areas• This summary was automatically generated by @propel-code-bot |
29a7d6c to
cb626b1
Compare
| stats_collection_model = collection._client.get_collection( | ||
| name=af.output_collection, | ||
| name=stats_collection_name, | ||
| tenant=collection.tenant, | ||
| database=collection.database, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Logic] The previous implementation of get_statistics used get_statistics_fn(collection) to verify that the statistics function was actually attached to the collection before attempting to fetch the results. This check has been removed.
While the goal is to make the stats_collection_name explicit, removing this validation weakens the function's contract. It's now possible to call get_statistics on a collection where statistics are not enabled, or to pass a stats_collection_name that doesn't correspond to the one configured for the collection. This could lead to confusing CollectionNotFound errors or incorrect empty results if an unrelated collection with that name exists.
Consider reintroducing a check to ensure that statistics are enabled for the given collection, for example by calling get_statistics_fn(collection) before this block. This would make the function more robust and its behavior clearer.
Context for Agents
The previous implementation of `get_statistics` used `get_statistics_fn(collection)` to verify that the statistics function was actually attached to the collection before attempting to fetch the results. This check has been removed.
While the goal is to make the `stats_collection_name` explicit, removing this validation weakens the function's contract. It's now possible to call `get_statistics` on a collection where statistics are not enabled, or to pass a `stats_collection_name` that doesn't correspond to the one configured for the collection. This could lead to confusing `CollectionNotFound` errors or incorrect empty results if an unrelated collection with that name exists.
Consider reintroducing a check to ensure that statistics are enabled for the given collection, for example by calling `get_statistics_fn(collection)` before this block. This would make the function more robust and its behavior clearer.
File: chromadb/utils/statistics.py
Line: 186dae1739 to
33d24e6
Compare
cb626b1 to
9a11f92
Compare
33d24e6 to
0d7c297
Compare
9a11f92 to
8822537
Compare
0d7c297 to
a1e139d
Compare
8822537 to
5e6d93b
Compare
5e6d93b to
954de4c
Compare
a1e139d to
c10e9be
Compare
bca5dfd to
5e6d93b
Compare
| def get_statistics( | ||
| collection: "Collection", key: Optional[str] = None | ||
| collection: "Collection", stats_collection_name: str, key: Optional[str] = None | ||
| ) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Reliability] The previous implementation of get_statistics verified that a statistics function was attached to the collection by calling get_statistics_fn. This check has been removed, which could lead to confusing behavior. For example, a user could accidentally pass the name of a regular collection to stats_collection_name and receive either an error about malformed statistics data or nonsensical results.
Consider reintroducing a validation step to ensure that a statistics function is attached and that it points to the provided stats_collection_name. This would make the function more robust and user-friendly.
For example, you could add a check at the beginning of the function:
from chromadb.errors import NotFoundError
try:
af = get_statistics_fn(collection)
if af.output_collection != stats_collection_name:
raise ValueError(
f"Statistics function for collection '{collection.name}' outputs to "
f"'{af.output_collection}', not '{stats_collection_name}'."
)
except NotFoundError:
raise ValueError(
f"Statistics are not enabled for collection '{collection.name}'. "
f"Please call \`attach_statistics_function\` first."
) from NoneThis would restore the safety of the previous implementation while still respecting the new explicit stats_collection_name parameter.
Context for Agents
The previous implementation of `get_statistics` verified that a statistics function was attached to the collection by calling `get_statistics_fn`. This check has been removed, which could lead to confusing behavior. For example, a user could accidentally pass the name of a regular collection to `stats_collection_name` and receive either an error about malformed statistics data or nonsensical results.
Consider reintroducing a validation step to ensure that a statistics function is attached and that it points to the provided `stats_collection_name`. This would make the function more robust and user-friendly.
For example, you could add a check at the beginning of the function:
```python
from chromadb.errors import NotFoundError
try:
af = get_statistics_fn(collection)
if af.output_collection != stats_collection_name:
raise ValueError(
f"Statistics function for collection '{collection.name}' outputs to "
f"'{af.output_collection}', not '{stats_collection_name}'."
)
except NotFoundError:
raise ValueError(
f"Statistics are not enabled for collection '{collection.name}'. "
f"Please call \`attach_statistics_function\` first."
) from None
```
This would restore the safety of the previous implementation while still respecting the new explicit `stats_collection_name` parameter.
File: chromadb/utils/statistics.py
Line: 1255e6d93b to
41e5172
Compare
Merge activity
|

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_