Skip to content

Conversation

@dluc
Copy link
Collaborator

@dluc dluc commented Dec 2, 2025

Summary

Implements embedding generation infrastructure with caching support:

Core Components

  • IEmbeddingGenerator: Core interface for all embedding providers
  • EmbeddingConstants: Known model dimensions (OpenAI, Ollama, HuggingFace models)
  • CachedEmbeddingGenerator: Decorator pattern for transparent caching

Cache System

  • IEmbeddingCache: Cache interface with SQLite implementation
  • EmbeddingCacheKey: SHA256 hashing (privacy: input text never stored)
  • CachedEmbedding: Model with provider/model metadata and timestamps
  • CacheModes: ReadWrite, ReadOnly, WriteOnly options
  • Composite primary key: provider, model, dimensions, normalized, text_hash for easier debugging and selective queries

Embedding Providers

  • OllamaEmbeddingGenerator: Local models via Ollama API
  • OpenAIEmbeddingGenerator: OpenAI API (text-embedding-3-small/large, ada-002)
  • AzureOpenAIEmbeddingGenerator: Azure OpenAI deployments
  • HuggingFaceEmbeddingGenerator: HuggingFace Inference API

Security

  • Input text is never stored in cache - only SHA256 hash
  • Cache key includes: hash + provider + model + dimensions + normalization flag

Test Plan

  • 133 new tests covering all components
  • Test coverage: 86.33% (above 80% threshold)
  • All 770 tests pass (0 skipped)
  • format.sh passes
  • build.sh passes with 0 warnings, 0 errors
  • coverage.sh passes

dluc added 2 commits December 2, 2025 16:44
Implement embedding generation infrastructure with caching support:

Core Components:
- IEmbeddingGenerator interface for all providers
- EmbeddingConstants with known model dimensions
- CachedEmbeddingGenerator decorator for transparent caching

Cache System:
- IEmbeddingCache interface with SQLite implementation
- EmbeddingCacheKey with SHA256 hashing (privacy: text never stored)
- CachedEmbedding model with provider/model metadata
- CacheModes enum: ReadWrite, ReadOnly, WriteOnly

Embedding Providers:
- OllamaEmbeddingGenerator for local models
- OpenAIEmbeddingGenerator for OpenAI API
- AzureOpenAIEmbeddingGenerator for Azure OpenAI
- HuggingFaceEmbeddingGenerator for HuggingFace Inference API

Configuration:
- HuggingFaceEmbeddingsConfig with JSON discriminator
- Updated EmbeddingsTypes enum with HuggingFace value

Test coverage: 86.33% (133 new tests)
Changed from single TEXT key to composite primary key with individual
columns (provider, model, dimensions, normalized, text_hash) for:
- Easier debugging when inspecting SQLite database contents
- Selective SELECT/DELETE with individual discriminators
- Better query flexibility for cache management

Added indexes on provider and (provider, model) for common query patterns.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive embedding generation and caching infrastructure for the KernelMemory project, adding support for multiple embedding providers with transparent caching capabilities.

Key Changes:

  • Adds four embedding provider implementations (OpenAI, Azure OpenAI, Ollama, HuggingFace)
  • Implements SQLite-based caching with SHA256 hashing for privacy (input text never stored)
  • Provides decorator pattern for transparent caching with configurable modes (ReadWrite, ReadOnly, WriteOnly)

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
tests/Core.Tests/GlobalUsings.cs Adds CA2000 and CA1861 suppressions for test code
tests/Core.Tests/Embeddings/Providers/*.cs Comprehensive unit tests for all four embedding providers
tests/Core.Tests/Embeddings/EmbeddingConstantsTests.cs Tests for known model dimensions and defaults
tests/Core.Tests/Embeddings/EmbeddingCacheKeyTests.cs Tests for SHA256 hashing and cache key generation
tests/Core.Tests/Embeddings/CachedEmbeddingTests.cs Tests for cached embedding model
tests/Core.Tests/Embeddings/CachedEmbeddingGeneratorTests.cs Tests for caching decorator with mixed cache hits/misses
tests/Core.Tests/Embeddings/Cache/SqliteEmbeddingCacheTests.cs Integration tests for SQLite cache with all modes
tests/Core.Tests/Config/*.cs Tests for HuggingFace config and cache modes enum
src/Core/Embeddings/Providers/*.cs Implementation of OpenAI, Ollama, HuggingFace, and Azure OpenAI generators
src/Core/Embeddings/IEmbeddingGenerator.cs Core interface for all embedding providers
src/Core/Embeddings/EmbeddingConstants.cs Known model dimensions and default configurations
src/Core/Embeddings/CachedEmbeddingGenerator.cs Decorator for transparent caching with mode support
src/Core/Embeddings/Cache/*.cs Cache interface, SQLite implementation, cache key, and cached embedding models
src/Core/Config/Enums/*.cs CacheModes enum and HuggingFace addition to EmbeddingsTypes
src/Core/Config/Embeddings/*.cs HuggingFace configuration and type discriminator updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Changes:
- Allow null text in EmbeddingCacheKey.Create(), normalize to empty string
- Remove unnecessary indexes (idx_timestamp, idx_provider, idx_model)
- Rename 'normalized' column to 'is_normalized' for clarity
- Rename 'timestamp' column to 'created_at' for clarity
- Simplify timestamp documentation (debugging only, no invalidation)
@dluc dluc force-pushed the embeddings-feature branch from b9ec398 to 12eb472 Compare December 2, 2025 17:05
dluc added 3 commits December 2, 2025 18:09
Embeddings are immutable - same text always produces the same vector.
No need for cache invalidation or timestamps. Simplifies the schema.
CachedEmbedding now only stores the vector:
- Removed TokenCount (token counting should be done before embedding)
- Removed timestamp (embeddings are immutable, no invalidation needed)
- StoreAsync signature simplified: (key, vector, ct)

Final schema:
  provider, model, dimensions, is_normalized, text_length, text_hash, vector
  PRIMARY KEY (provider, model, dimensions, is_normalized, text_hash)
@dluc dluc merged commit 1bf7b42 into microsoft:main Dec 2, 2025
3 checks passed
@dluc dluc deleted the embeddings-feature branch December 2, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant