Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@
"twemoji": "14.0.2",
"uslug": "1.0.4",
"uuid": "9.0.0",
"weaviate-client": "^3.10.0",
"validate.js": "0.13.1",
"winston": "3.8.2",
"xss": "1.0.14",
Expand Down
176 changes: 176 additions & 0 deletions server/modules/search/weaviate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Weaviate Search Module

Semantic search engine for Wiki.js using Weaviate vector database.

## Features

- **Hybrid Search**: Combined BM25 keyword + semantic vector search
- **Multiple Search Modes**: hybrid, bm25, nearText
- **Incremental Rebuild**: Only reindex changed pages (hash-based detection)
- **Orphan Cleanup**: Automatically removes deleted pages from index
- **Rate Limit Handling**: Exponential backoff with configurable batch delays
- **Result Caching**: Configurable TTL with cluster-wide invalidation
- **Result Highlighting**: Query terms highlighted with `<mark>` tags
- **Search Analytics**: Track top searches and zero-result queries
- **Health Monitoring**: Periodic health checks with metrics

## Requirements

- Wiki.js 2.5+
- Weaviate 1.32+
- Node.js 18+
- Weaviate class must be pre-created with vectorizer configured

## Installation

Copy the module to your Wiki.js installation:

```bash
cp -r server/modules/search/weaviate/ /path/to/wikijs/server/modules/search/
```

Restart Wiki.js - the module will appear in Administration > Search.

## Configuration

### Connection

| Setting | Description | Default |
|---------|-------------|---------|
| host | Weaviate hostname (without protocol) | localhost |
| httpPort | HTTP/REST API port | 18080 |
| httpSecure | Use HTTPS | true |
| grpcPort | gRPC port | 50051 |
| grpcSecure | Use TLS for gRPC | true |
| skipTLSVerify | Skip TLS certificate verification (see warning below) | false |
| apiKey | Authentication key | - |
| timeout | Connection timeout (ms) | 10000 |

> **Warning**: `skipTLSVerify` sets `NODE_TLS_REJECT_UNAUTHORIZED=0` globally due to weaviate-client limitations. This affects all HTTPS connections in the process.

### Schema

| Setting | Description | Default |
|---------|-------------|---------|
| className | Weaviate collection name | WikiPage |

### Search

| Setting | Description | Default |
|---------|-------------|---------|
| searchType | hybrid / bm25 / nearText | hybrid |
| alpha | Hybrid balance: 0=keyword, 1=semantic | 0.5 |
| searchLimit | Max results per query | 50 |
| cacheTtl | Cache duration (seconds) | 300 |
| boostTitle | Title field boost | 3 |
| boostDescription | Description field boost | 2 |
| boostTags | Tags field boost | 1 |

### Indexing

| Setting | Description | Default |
|---------|-------------|---------|
| batchSize | Documents per batch | 100 |
| batchDelayMs | Delay between batches (ms) | 1000 |
| maxBatchBytes | Max batch size (bytes) | 10MB |
| forceFullRebuild | Delete all before rebuild | false |
| debugSql | Log sync table SQL queries | false |

## Weaviate Schema

Create this class in Weaviate **before** enabling the module:

```json
{
"class": "WikiPage",
"vectorizer": "text2vec-transformers",
"properties": [
{ "name": "pageId", "dataType": ["int"], "indexFilterable": true },
{ "name": "title", "dataType": ["text"], "indexSearchable": true },
{ "name": "description", "dataType": ["text"], "indexSearchable": true },
{ "name": "content", "dataType": ["text"], "indexSearchable": true },
{ "name": "path", "dataType": ["text"], "indexFilterable": true },
{ "name": "locale", "dataType": ["text"], "indexFilterable": true },
{ "name": "tags", "dataType": ["text[]"], "indexSearchable": true }
]
}
```

## Search Types

### Hybrid (recommended)

Combined keyword + semantic search. Adjust `alpha`:

| Alpha | Behavior |
|-------|----------|
| 0.0 | Pure BM25 keyword |
| 0.5 | Balanced (default) |
| 1.0 | Pure semantic |

### BM25

Pure keyword search with configurable field boosting.

### NearText

Pure semantic search based on embedding similarity.

## Rebuild Modes

### Incremental (default)

- Compares content hash to detect changes
- Only reindexes modified pages
- Tracks sync status in `weaviate_sync_status` table
- Cleans orphan pages after rebuild

### Full (`forceFullRebuild: true`)

- Deletes all documents from Weaviate
- Reindexes all pages from scratch
- Use when schema changes or index is corrupted

Both modes use streaming to avoid memory issues with large wikis.

## API Reference

| Method | Description |
|--------|-------------|
| `query(q, opts)` | Search for pages |
| `created(page)` | Index a new page |
| `updated(page)` | Update indexed page |
| `deleted(pageId)` | Remove from index |
| `renamed(page)` | Handle path change |
| `rebuild()` | Rebuild index |
| `isHealthy()` | Health check |
| `getStats()` | Index statistics |
| `getMetrics()` | Module metrics |
| `getSearchAnalytics(opts)` | Search analytics |
| `suggest(prefix, opts)` | Auto-complete |

## Troubleshooting

### Class not found

The module does not create the class. Create it manually with your vectorizer.

### Connection timeout

1. Verify both HTTP and gRPC ports are accessible
2. Check TLS certificates
3. Check firewall rules

### Empty results after rebuild

1. Check logs for indexing errors
2. Verify vectorizer is configured
3. Try full rebuild (`forceFullRebuild: true`)

### Rate limiting during rebuild

Increase `batchDelayMs` (e.g., 2000ms) or decrease `batchSize`.

## License

AGPL-3.0 (same as Wiki.js)
147 changes: 147 additions & 0 deletions server/modules/search/weaviate/definition.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
key: weaviate
title: Weaviate
description: Weaviate vector search engine with semantic search capabilities.
author: YM
logo: https://weaviate.io/img/site/weaviate-logo-light.png
website: https://weaviate.io
isAvailable: true
props:
# === Connection ===
host:
type: String
title: Host
hint: 'Weaviate hostname without protocol (e.g., weaviate.example.com)'
default: 'localhost'
order: 1
httpPort:
type: Number
title: HTTP Port
hint: 'Weaviate HTTP port'
default: 18080
order: 2
httpSecure:
type: Boolean
title: HTTP Secure (HTTPS)
hint: 'Use HTTPS for HTTP connections'
default: true
order: 3
grpcPort:
type: Number
title: gRPC Port
hint: 'Weaviate gRPC port'
default: 50051
order: 4
grpcSecure:
type: Boolean
title: gRPC Secure (TLS)
hint: 'Use TLS for gRPC connections'
default: true
order: 5
skipTLSVerify:
type: Boolean
title: Skip TLS Verification
hint: 'Skip TLS certificate verification (for self-signed certs)'
default: false
order: 6
apiKey:
type: String
title: API Key
hint: 'API key for authentication'
sensitive: true
order: 7
timeout:
type: Number
title: Timeout (ms)
hint: 'Connection timeout in milliseconds (default: 10000)'
default: 10000
order: 8

# === Schema ===
className:
type: String
title: Class Name
hint: 'Weaviate class name (must exist with vectorizer pre-configured)'
default: 'WikiPage'
order: 9

# === Search Settings ===
searchType:
type: String
title: Search Type
hint: 'Type of search to perform'
default: 'hybrid'
enum:
- 'nearText'
- 'bm25'
- 'hybrid'
order: 10
alpha:
type: Number
title: Hybrid Alpha
hint: '0 = keyword only, 1 = semantic only'
default: 0.5
order: 11
searchLimit:
type: Number
title: Search Results Limit
hint: 'Maximum number of search results to return (default: 50)'
default: 50
order: 12
cacheTtl:
type: Number
title: Cache TTL (seconds)
hint: 'Time to live for search cache in seconds (default: 300 = 5 minutes)'
default: 300
order: 13

# === Boost Settings ===
boostTitle:
type: Number
title: Title Boost
hint: 'Boost factor for title field (default: 3)'
default: 3
order: 14
boostDescription:
type: Number
title: Description Boost
hint: 'Boost factor for description field (default: 2)'
default: 2
order: 15
boostTags:
type: Number
title: Tags Boost
hint: 'Boost factor for tags field (default: 1)'
default: 1
order: 16

# === Indexing Settings ===
batchSize:
type: Number
title: Batch Size
hint: 'Number of documents per batch during indexing (default: 100)'
default: 100
order: 17
batchDelayMs:
type: Number
title: Batch Delay (ms)
hint: 'Delay between batches to avoid rate limiting (default: 1000)'
default: 1000
order: 18
maxBatchBytes:
type: Number
title: Max Batch Size (bytes)
hint: 'Maximum batch size in bytes (default: 10485760 = 10MB)'
default: 10485760
order: 19
forceFullRebuild:
type: Boolean
title: Force Full Rebuild
hint: 'If enabled, rebuild will delete all and reindex. Otherwise, incremental rebuild (only changed pages).'
default: false
order: 20
debugSql:
type: Boolean
title: Debug SQL Queries
hint: 'Log all SQL queries to the sync status table (for debugging)'
default: false
order: 21
Loading