-
Notifications
You must be signed in to change notification settings - Fork 26
feat(ci): add Pinecone docs ingestion to deploy workflow #1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a new TypeScript ingestion script that uploads local docs into a Pinecone assistant with configurable behavior (file filtering, MDX conversion, retries, optional deletion), and updates the CI workflow to install dependencies and run that script during deployment using environment variables. Changes
Sequence Diagram(s)sequenceDiagram
participant GHA as GitHub Actions
participant Installer as Install deps (fast-glob, tsx, pinecone)
participant Ingest as ingest-docs-pinecone.ts
participant Pinecone as Pinecone Assistant API
Note over GHA,Installer: CI run for docs deploy
GHA->>Installer: install dependencies
Installer-->>GHA: deps installed
GHA->>Ingest: run TSX script with env vars
Note over Ingest,Pinecone: Ingestion flow
Ingest->>Ingest: scan DOCS_DIR (fast-glob)
Ingest->>Ingest: filter/size-check files
alt CONVERT_MDX = true
Ingest->>Ingest: convert .mdx -> temp .md
end
alt DELETE_ALL = true
Ingest->>Pinecone: list & delete remote records matching source_path
Pinecone-->>Ingest: deletion ack
end
loop for each file
Ingest->>Pinecone: upsert file (with metadata), retry on transient errors
Pinecone-->>Ingest: upsert ack / error
end
Ingest-->>GHA: logs summary (uploaded vs skipped)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Potential attention areas:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
.github/workflows/deploy-docs.yaml (1)
32-40: Consider adding error handling for the ingestion step.If the Pinecone ingestion fails, the workflow continues to deploy. This could result in deployed docs being out of sync with the Pinecone index.
Consider adding
continue-on-error: false(which is the default) and potentially adding a retry mechanism or failure notification.scripts/ingest-docs-pinecone.ts (3)
1-22: Consider standardizing comment language.The comments mix Portuguese ("obrigatórios", "por tamanho") and English. For consistency and broader accessibility, consider using English throughout.
50-53: Validate numeric environment variables.Using
Number()to parse environment variables will silently returnNaNfor invalid input, which could cause unexpected behavior or runtime errors. For example,Number("abc")returnsNaN, which would then be used in chunking calculations.Consider adding validation:
function parseNumber(value: string | undefined, defaultValue: number, name: string): number { if (!value) return defaultValue; const parsed = Number(value); if (isNaN(parsed)) { throw new Error(`Invalid number for ${name}: ${value}`); } return parsed; } const MAX_FILE_MB = parseNumber(process.env.MAX_FILE_MB, 2, "MAX_FILE_MB"); const CHUNK_SIZE = parseNumber(process.env.CHUNK_SIZE, 2800, "CHUNK_SIZE"); const CHUNK_OVERLAP = parseNumber(process.env.CHUNK_OVERLAP, 300, "CHUNK_OVERLAP"); const BATCH_SIZE = parseNumber(process.env.BATCH_SIZE, 500, "BATCH_SIZE");
118-122: Consider adding error handling for batch upserts.The batch upsert loop doesn't handle potential failures from individual upsert operations. If an upsert fails, the entire script will fail without indicating which batch caused the issue.
Consider wrapping the upsert in a try-catch to log which batch failed:
for (let i = 0; i < records.length; i += BATCH_SIZE) { const batch = records.slice(i, i + BATCH_SIZE); try { await index.namespace(NAMESPACE).upsertRecords(batch); console.log(`Upserted ${Math.min(i + BATCH_SIZE, records.length)}/${records.length}`); } catch (error) { console.error(`Failed to upsert batch ${i}-${i + BATCH_SIZE}:`, error); throw error; } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/deploy-docs.yaml(1 hunks)scripts/ingest-docs-pinecone.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/data-flow.mdc)
**/*.{ts,tsx}: MCP tools must use Zod schemas for input and, when applicable, output validation
Register tools with server.registerTool providing description, inputSchema.shape, and optional outputSchema.shape
In every MCP tool handler, perform authorization checks first and then call context.resourceAccess.grant() before business logic
Name tools using the {RESOURCE}_{ACTION} pattern (e.g., AGENTS_CREATE, THREADS_LIST)
Group related tools into typed collections (e.g., GLOBAL_TOOLS, WORKSPACE_TOOLS) and export them as const
Always check workspace/team access before executing operations that touch workspace resources
Return proper authorization errors and handle forbidden operations explicitlyPrefer specific TypeScript types over any
**/*.{ts,tsx}: Write concise, maintainable, and technically accurate TypeScript code
Use functional and declarative programming patterns; avoid classes
Favor iteration and modularization to follow DRY and avoid code duplication
Use descriptive variable names with auxiliary verbs (e.g., isLoading, hasError)
Organize files so each contains only related content (components, subcomponents, helpers, static content, types)
Prefer interfaces over types for object shapes
Avoid enums; use maps (objects/records) instead
Use the function keyword for pure functions to benefit from hoisting and clarity
Ensure dependency arrays use stable references; do not inline new objects/arrays/functions
Memoize options/objects used in hooks (e.g., useMemo for options passed to useMemo/useCallback)
Favor named exports for functions
Files:
scripts/ingest-docs-pinecone.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/data-flow.mdc)
Log authorization failures to aid debugging and auditing
Files:
scripts/ingest-docs-pinecone.ts
scripts/**
📄 CodeRabbit inference engine (AGENTS.md)
Keep infrastructure helper scripts under scripts/
Files:
scripts/ingest-docs-pinecone.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx,js,jsx}: Biome formatting: use two-space indentation and double quotes
Keep imports sorted
Name hooks and utility functions using camelCase
Files:
scripts/ingest-docs-pinecone.ts
{apps/api,scripts}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use explicit file extensions in imports where Deno requires them
Files:
scripts/ingest-docs-pinecone.ts
🔇 Additional comments (8)
.github/workflows/deploy-docs.yaml (2)
41-46: LGTM!The deploy step is correctly configured with the deployment token from secrets.
29-30: The original review comment's concern is not valid.The two install steps are intentionally in different contexts:
- Line 24-27 installs docs app dependencies to
docs/node_modules- Line 29-30 installs root-level script dependencies to root
node_modulesThe script
scripts/ingest-docs-pinecone.tsis located at the repository root and requires its dependencies installed there. This is correct monorepo behavior with no risk of version conflicts—each context has its own isolatednode_modules.Minor note: The step name "Install rag-docs script dependencies" is misleading since the script is actually
ingest-docs-pinecone.ts.Likely an incorrect or invalid review comment.
scripts/ingest-docs-pinecone.ts (6)
58-61: LGTM!The extension filtering logic is correct and straightforward.
78-81: LGTM!The text splitter is correctly initialized with the configured chunk size and overlap.
87-90: Verify deleteAll behavior.The script calls
deleteAll()on a namespace, which will remove all existing documents. Ensure this aligns with your intended behavior, especially since it runs on every deployment (per the workflow configuration).Consider adding a confirmation log or safety check before deletion, especially if this runs automatically.
92-96: LGTM!The file discovery and filtering logic is correct. Using fast-glob with the normalized pattern and then filtering by extension is a clean approach.
100-114: LGTM!The file processing logic is well-structured:
- Skips empty/large files gracefully
- Normalizes paths for consistency
- Chunks documents with metadata
- Creates unique IDs using path and chunk index
127-130: LGTM!The top-level error handling correctly logs errors and exits with a non-zero code, which will properly fail the CI workflow if something goes wrong.
| PINECONE_INDEX_NAME: ${{ secrets.PINECONE_INDEX_NAME }} | ||
| NAMESPACE: docs | ||
| DOCS_DIR: docs/view/src/content/pt-br | ||
| DELETE_ALL: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider DELETE_ALL default in production.
Setting DELETE_ALL: "true" means every deployment wipes all existing documents from Pinecone before re-ingesting. This could cause temporary data unavailability and potential data loss if the ingestion step fails partway through.
Consider either:
- Setting
DELETE_ALL: "false"and relying on upserts to update existing chunks - Moving deletion to a separate manual workflow
- Adding a conditional based on branch or tag
🤖 Prompt for AI Agents
.github/workflows/deploy-docs.yaml around line 39: the workflow currently sets
DELETE_ALL: "true" which force-deletes all Pinecone documents on every
deployment; change this to a safer default and/or add conditional control: set
DELETE_ALL: "false" by default, or gate deletion behind a branch/tag condition
(e.g., only true for main/release or when a manual input/secret is present), or
move the delete step to a separate manual workflow; implement one of these
options so routine CI deploys use upserts without wiping data unless explicitly
requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better than nothing, we can make it good later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viktormarinho Sounds good! Feel free to circle back to this when you're ready to make it more robust. 👍
| const PINECONE_API_KEY = process.env.PINECONE_API_KEY!; | ||
| const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME!; | ||
|
|
||
| if (!PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY"); | ||
| if (!PINECONE_INDEX_NAME) throw new Error("Define PINECONE_INDEX_NAME (index name)"); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove redundant non-null assertions.
The non-null assertion operator (!) on lines 37-38 tells TypeScript these values are never null/undefined, but then lines 40-41 immediately check if they're falsy. This is redundant.
Apply this diff to validate before assignment:
-const PINECONE_API_KEY = process.env.PINECONE_API_KEY!;
-const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME!;
-
-if (!PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY");
-if (!PINECONE_INDEX_NAME) throw new Error("Define PINECONE_INDEX_NAME (index name)");
+if (!process.env.PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY");
+if (!process.env.PINECONE_INDEX_NAME) throw new Error("Define PINECONE_INDEX_NAME (index name)");
+
+const PINECONE_API_KEY = process.env.PINECONE_API_KEY;
+const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const PINECONE_API_KEY = process.env.PINECONE_API_KEY!; | |
| const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME!; | |
| if (!PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY"); | |
| if (!PINECONE_INDEX_NAME) throw new Error("Define PINECONE_INDEX_NAME (index name)"); | |
| if (!process.env.PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY"); | |
| if (!process.env.PINECONE_INDEX_NAME) throw new Error("Define PINECONE_INDEX_NAME (index name)"); | |
| const PINECONE_API_KEY = process.env.PINECONE_API_KEY; | |
| const PINECONE_INDEX_NAME = process.env.PINECONE_INDEX_NAME; |
🤖 Prompt for AI Agents
In scripts/ingest-docs-pinecone.ts around lines 37 to 42, remove the redundant
non-null assertions on PINECONE_API_KEY and PINECONE_INDEX_NAME and instead
validate the environment variables before assigning: read the raw values from
process.env without "!", check if each is present (throwing the existing errors
if missing), and only then assign the consts (or use a guaranteed-typed value).
This eliminates the pointless "!" while preserving the runtime checks and
type-safety.
scripts/ingest-docs-pinecone.ts
Outdated
| async function readIfSmall(file: string): Promise<string | null> { | ||
| try { | ||
| const stat = await fs.stat(file); | ||
| if (stat.size > bytesLimit) { | ||
| console.warn(`Skip por tamanho (${(stat.size / 1024 / 1024).toFixed(2)}MB): ${file}`); | ||
| return null; | ||
| } | ||
| const raw = await fs.readFile(file, "utf8"); | ||
| if (!raw.trim()) return null; | ||
| return raw; | ||
| } catch { | ||
| return null; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Improve error handling in readIfSmall.
The catch block silently swallows all errors (line 73-74), which could mask legitimate issues like permission errors, disk failures, or file system problems. While returning null for unreadable files is acceptable, the errors should be logged for debugging.
Apply this diff to add error logging:
} catch {
+ console.warn(`Failed to read file: ${file}`);
return null;
}Or for more detailed logging:
- } catch {
+ } catch (error) {
+ console.warn(`Failed to read file: ${file}`, error);
return null;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function readIfSmall(file: string): Promise<string | null> { | |
| try { | |
| const stat = await fs.stat(file); | |
| if (stat.size > bytesLimit) { | |
| console.warn(`Skip por tamanho (${(stat.size / 1024 / 1024).toFixed(2)}MB): ${file}`); | |
| return null; | |
| } | |
| const raw = await fs.readFile(file, "utf8"); | |
| if (!raw.trim()) return null; | |
| return raw; | |
| } catch { | |
| return null; | |
| } | |
| } | |
| async function readIfSmall(file: string): Promise<string | null> { | |
| try { | |
| const stat = await fs.stat(file); | |
| if (stat.size > bytesLimit) { | |
| console.warn(`Skip por tamanho (${(stat.size / 1024 / 1024).toFixed(2)}MB): ${file}`); | |
| return null; | |
| } | |
| const raw = await fs.readFile(file, "utf8"); | |
| if (!raw.trim()) return null; | |
| return raw; | |
| } catch (error) { | |
| console.warn(`Failed to read file: ${file}`, error); | |
| return null; | |
| } | |
| } |
🤖 Prompt for AI Agents
In scripts/ingest-docs-pinecone.ts around lines 63 to 76, the readIfSmall
function currently swallows all errors in its catch block; update the catch to
log the error (including the file path and error message/stack) before returning
null so permission/IO errors are visible during debugging, e.g., capture the
caught error and call console.error (or the project's logger) with a clear
message and the error, then return null as before.
scripts/ingest-docs-pinecone.ts
Outdated
| // - Embedding: Pinecone automatically generates via integrated embedding | ||
| // - Upsert: uses upsertRecords() sending text + metadata | ||
| // | ||
| // ENV obrigatórios: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pt-br -> en
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/ingest-docs-pinecone.ts (1)
26-31: Remove redundant non-null assertions.The non-null assertions on lines 26-27 followed by validation checks on lines 29-30 are redundant. Validate before assignment instead.
Apply this diff:
-const PINECONE_API_KEY = process.env.PINECONE_API_KEY!; -const PINECONE_ASSISTANT_NAME = process.env.PINECONE_ASSISTANT_NAME!; - -if (!PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY"); -if (!PINECONE_ASSISTANT_NAME) throw new Error("Define PINECONE_ASSISTANT_NAME (assistant name)"); +if (!process.env.PINECONE_API_KEY) throw new Error("Define PINECONE_API_KEY"); +if (!process.env.PINECONE_ASSISTANT_NAME) throw new Error("Define PINECONE_ASSISTANT_NAME (assistant name)"); + +const PINECONE_API_KEY = process.env.PINECONE_API_KEY; +const PINECONE_ASSISTANT_NAME = process.env.PINECONE_ASSISTANT_NAME;.github/workflows/deploy-docs.yaml (1)
38-38: Reconsider DELETE_ALL in production deployments.Setting
DELETE_ALL: "true"on every deployment wipes all existing documents before re-ingesting, which could cause temporary data unavailability if ingestion fails mid-process. Consider making deletion opt-in or conditional.
🧹 Nitpick comments (4)
scripts/ingest-docs-pinecone.ts (4)
20-23: Add explicit file extensions to imports.Scripts under
scripts/**are executed in a Deno-like environment where explicit file extensions are required. The imports fromnode:fs/promisesandnode:pathare correct, but third-party imports should follow this pattern when applicable.As per coding guidelines
67-69: Log errors for debugging.The catch block silently swallows all errors, which could mask permission errors, disk failures, or other file system issues. Add error logging to aid debugging.
Apply this diff:
- } catch { + } catch (error) { + console.warn(`Failed to stat file: ${file}`, error); return false; }
130-135: Log cleanup failures for debugging.The catch block in the
finallysection silently swallows errors when deleting temporary files. While it's acceptable to continue on cleanup failures, logging these errors helps diagnose file system issues.Apply this diff:
if (tempFile) { try { await fs.unlink(tempFile); - } catch { + } catch (error) { + console.warn(`Failed to delete temporary file: ${tempFile}`, error); } }
139-198: Consider adding progress feedback for deletion phase.The deletion phase (lines 150-169) iterates through all existing files but only logs when a file is deleted. For large assistants, this phase could appear to hang. Consider adding periodic progress updates.
Example enhancement:
console.log(`Checking ${existingFiles.length} existing file(s)...`); let deleted = 0; for (const file of existingFiles) { const sourcePath = file.metadata?.source_path; if (sourcePath && localPaths.has(sourcePath)) { await assistant.deleteFile(file.id); console.log(`Deleted: ${file.name} (${sourcePath})`); deleted++; } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/deploy-docs.yaml(1 hunks)scripts/ingest-docs-pinecone.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/data-flow.mdc)
**/*.{ts,tsx}: MCP tools must use Zod schemas for input and, when applicable, output validation
Register tools with server.registerTool providing description, inputSchema.shape, and optional outputSchema.shape
In every MCP tool handler, perform authorization checks first and then call context.resourceAccess.grant() before business logic
Name tools using the {RESOURCE}_{ACTION} pattern (e.g., AGENTS_CREATE, THREADS_LIST)
Group related tools into typed collections (e.g., GLOBAL_TOOLS, WORKSPACE_TOOLS) and export them as const
Always check workspace/team access before executing operations that touch workspace resources
Return proper authorization errors and handle forbidden operations explicitlyPrefer specific TypeScript types over any
**/*.{ts,tsx}: Write concise, maintainable, and technically accurate TypeScript code
Use functional and declarative programming patterns; avoid classes
Favor iteration and modularization to follow DRY and avoid code duplication
Use descriptive variable names with auxiliary verbs (e.g., isLoading, hasError)
Organize files so each contains only related content (components, subcomponents, helpers, static content, types)
Prefer interfaces over types for object shapes
Avoid enums; use maps (objects/records) instead
Use the function keyword for pure functions to benefit from hoisting and clarity
Ensure dependency arrays use stable references; do not inline new objects/arrays/functions
Memoize options/objects used in hooks (e.g., useMemo for options passed to useMemo/useCallback)
Favor named exports for functions
Files:
scripts/ingest-docs-pinecone.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/data-flow.mdc)
Log authorization failures to aid debugging and auditing
Files:
scripts/ingest-docs-pinecone.ts
scripts/**
📄 CodeRabbit inference engine (AGENTS.md)
Keep infrastructure helper scripts under scripts/
Files:
scripts/ingest-docs-pinecone.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx,js,jsx}: Biome formatting: use two-space indentation and double quotes
Keep imports sorted
Name hooks and utility functions using camelCase
Files:
scripts/ingest-docs-pinecone.ts
{apps/api,scripts}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use explicit file extensions in imports where Deno requires them
Files:
scripts/ingest-docs-pinecone.ts
🔇 Additional comments (2)
.github/workflows/deploy-docs.yaml (2)
29-30: LGTM: Dependencies installed correctly.The dependencies are installed at the repository root, which is appropriate since the script is located at
scripts/ingest-docs-pinecone.ts(root level).
37-37: Verify ifpt-bris your intended documentation language, or if it should been.Both
docs/view/src/content/pt-br/anddocs/view/src/content/en/directories exist in the repository. The current workflow targets Portuguese (pt-br), but a past review comment suggested using English (en) instead. Confirm this is your intended documentation source, or update toenif that's preferred.
| } | ||
|
|
||
| async function uploadWithRetry( | ||
| assistant: any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Replace any with specific type.
The assistant parameter is typed as any, which bypasses type safety. Use the specific Pinecone Assistant type from the SDK.
As per coding guidelines
Apply this diff:
+import type { Assistant } from "@pinecone-database/pinecone";
+
async function uploadWithRetry(
- assistant: any,
+ assistant: Assistant<any>,
file: string,
normalizedPath: string,
retries = MAX_RETRIES🤖 Prompt for AI Agents
In scripts/ingest-docs-pinecone.ts around line 77, the assistant parameter is
currently typed as "any"; replace it with the specific Assistant type exported
by the Pinecone SDK (use the exact exported type name from the SDK you depend
on), update the function signature to use that type, and add the corresponding
import at the top of the file. Ensure you remove the "any" usage everywhere this
parameter is referenced so TypeScript enforces proper typings (adjust related
code if the stricter type surfaces required changes).
|
plz run format, lint check etc @igoramf |
What is this contribution about?
This PR adds automated Pinecone documentation ingestion to the CI/CD pipeline. The
changes include:
scripts/ingest-docs-pinecone.ts) that handles the ingestionprocess into Pinecone vector database
This enables automatic synchronization of documentation into Pinecone whenever changes
are deployed.
Screenshots/Demonstration
N/A - Backend/CI changes only
Review Checklist
Summary by CodeRabbit