feat: add embedding generators and cache #1111

dluc · 2025-12-02T15:44:57Z

Summary

Implements embedding generation infrastructure with caching support:

Core Components

IEmbeddingGenerator: Core interface for all embedding providers
EmbeddingConstants: Known model dimensions (OpenAI, Ollama, HuggingFace models)
CachedEmbeddingGenerator: Decorator pattern for transparent caching

Cache System

IEmbeddingCache: Cache interface with SQLite implementation
EmbeddingCacheKey: SHA256 hashing (privacy: input text never stored)
CachedEmbedding: Model with provider/model metadata and timestamps
CacheModes: ReadWrite, ReadOnly, WriteOnly options
Composite primary key: provider, model, dimensions, normalized, text_hash for easier debugging and selective queries

Embedding Providers

OllamaEmbeddingGenerator: Local models via Ollama API
OpenAIEmbeddingGenerator: OpenAI API (text-embedding-3-small/large, ada-002)
AzureOpenAIEmbeddingGenerator: Azure OpenAI deployments
HuggingFaceEmbeddingGenerator: HuggingFace Inference API

Security

Input text is never stored in cache - only SHA256 hash
Cache key includes: hash + provider + model + dimensions + normalization flag

Test Plan

133 new tests covering all components
Test coverage: 86.33% (above 80% threshold)
All 770 tests pass (0 skipped)
format.sh passes
build.sh passes with 0 warnings, 0 errors
coverage.sh passes

Implement embedding generation infrastructure with caching support: Core Components: - IEmbeddingGenerator interface for all providers - EmbeddingConstants with known model dimensions - CachedEmbeddingGenerator decorator for transparent caching Cache System: - IEmbeddingCache interface with SQLite implementation - EmbeddingCacheKey with SHA256 hashing (privacy: text never stored) - CachedEmbedding model with provider/model metadata - CacheModes enum: ReadWrite, ReadOnly, WriteOnly Embedding Providers: - OllamaEmbeddingGenerator for local models - OpenAIEmbeddingGenerator for OpenAI API - AzureOpenAIEmbeddingGenerator for Azure OpenAI - HuggingFaceEmbeddingGenerator for HuggingFace Inference API Configuration: - HuggingFaceEmbeddingsConfig with JSON discriminator - Updated EmbeddingsTypes enum with HuggingFace value Test coverage: 86.33% (133 new tests)

Changed from single TEXT key to composite primary key with individual columns (provider, model, dimensions, normalized, text_hash) for: - Easier debugging when inspecting SQLite database contents - Selective SELECT/DELETE with individual discriminators - Better query flexibility for cache management Added indexes on provider and (provider, model) for common query patterns.

src/Core/Embeddings/Cache/EmbeddingCacheKey.cs

src/Core/Embeddings/Cache/SqliteEmbeddingCache.cs

Copilot

Pull request overview

This PR implements a comprehensive embedding generation and caching infrastructure for the KernelMemory project, adding support for multiple embedding providers with transparent caching capabilities.

Key Changes:

Adds four embedding provider implementations (OpenAI, Azure OpenAI, Ollama, HuggingFace)
Implements SQLite-based caching with SHA256 hashing for privacy (input text never stored)
Provides decorator pattern for transparent caching with configurable modes (ReadWrite, ReadOnly, WriteOnly)

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 26 comments.

Show a summary per file

File	Description
`tests/Core.Tests/GlobalUsings.cs`	Adds CA2000 and CA1861 suppressions for test code
`tests/Core.Tests/Embeddings/Providers/*.cs`	Comprehensive unit tests for all four embedding providers
`tests/Core.Tests/Embeddings/EmbeddingConstantsTests.cs`	Tests for known model dimensions and defaults
`tests/Core.Tests/Embeddings/EmbeddingCacheKeyTests.cs`	Tests for SHA256 hashing and cache key generation
`tests/Core.Tests/Embeddings/CachedEmbeddingTests.cs`	Tests for cached embedding model
`tests/Core.Tests/Embeddings/CachedEmbeddingGeneratorTests.cs`	Tests for caching decorator with mixed cache hits/misses
`tests/Core.Tests/Embeddings/Cache/SqliteEmbeddingCacheTests.cs`	Integration tests for SQLite cache with all modes
`tests/Core.Tests/Config/*.cs`	Tests for HuggingFace config and cache modes enum
`src/Core/Embeddings/Providers/*.cs`	Implementation of OpenAI, Ollama, HuggingFace, and Azure OpenAI generators
`src/Core/Embeddings/IEmbeddingGenerator.cs`	Core interface for all embedding providers
`src/Core/Embeddings/EmbeddingConstants.cs`	Known model dimensions and default configurations
`src/Core/Embeddings/CachedEmbeddingGenerator.cs`	Decorator for transparent caching with mode support
`src/Core/Embeddings/Cache/*.cs`	Cache interface, SQLite implementation, cache key, and cached embedding models
`src/Core/Config/Enums/*.cs`	CacheModes enum and HuggingFace addition to EmbeddingsTypes
`src/Core/Config/Embeddings/*.cs`	HuggingFace configuration and type discriminator updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/Core.Tests/Embeddings/Cache/SqliteEmbeddingCacheTests.cs

tests/Core.Tests/Embeddings/Providers/AzureOpenAIEmbeddingGeneratorTests.cs

tests/Core.Tests/Embeddings/Providers/OpenAIEmbeddingGeneratorTests.cs

Changes: - Allow null text in EmbeddingCacheKey.Create(), normalize to empty string - Remove unnecessary indexes (idx_timestamp, idx_provider, idx_model) - Rename 'normalized' column to 'is_normalized' for clarity - Rename 'timestamp' column to 'created_at' for clarity - Simplify timestamp documentation (debugging only, no invalidation)

Embeddings are immutable - same text always produces the same vector. No need for cache invalidation or timestamps. Simplifies the schema.

CachedEmbedding now only stores the vector: - Removed TokenCount (token counting should be done before embedding) - Removed timestamp (embeddings are immutable, no invalidation needed) - StoreAsync signature simplified: (key, vector, ct) Final schema: provider, model, dimensions, is_normalized, text_length, text_hash, vector PRIMARY KEY (provider, model, dimensions, is_normalized, text_hash)

dluc added 2 commits December 2, 2025 16:44