-
Notifications
You must be signed in to change notification settings - Fork 643
Feature/OpenAI markitdown integration #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/OpenAI markitdown integration #1147
Conversation
Implements user-selectable document conversion models (OpenAI, Mistral, Local) for Google Drive sources. - Adds model selection dropdown to SourceConfigView - Updates GoogleDriveConfig to include model field with UI hints - Refactors SmartConverter to respect preferred_model - Updates EntityPipeline to pass user preference during sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 issues found across 15 files
Prompt for AI agents (all 3 issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/airweave/platform/destinations/weaviate.py">
<violation number="1" location="backend/airweave/platform/destinations/weaviate.py:52">
P2: Collection names are declared to be alphanumeric only, yet the code inserts an underscore (Airweave_<id>), so collection creation will be rejected by Weaviate.</violation>
<violation number="2" location="backend/airweave/platform/destinations/weaviate.py:58">
P1: Port parsing treats the scheme separator as a port, so URLs without an explicit port (e.g., https://host) crash with ValueError and the destination can never be created.</violation>
</file>
<file name="backend/airweave/platform/converters/smart_converter.py">
<violation number="1" location="backend/airweave/platform/converters/smart_converter.py:58">
P1: Selecting the “Local” conversion model still uses OpenAI whenever an OPENAI_API_KEY is configured, so the new preference cannot disable remote processing.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR
| # Weaviate collection names must start with a capital letter and be alphanumeric | ||
| # We'll use a prefix + sanitized collection ID | ||
| sanitized_id = str(collection_id).replace("-", "") | ||
| instance.collection_name = f"Airweave_{sanitized_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Collection names are declared to be alphanumeric only, yet the code inserts an underscore (Airweave_), so collection creation will be rejected by Weaviate.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/weaviate.py, line 52:
<comment>Collection names are declared to be alphanumeric only, yet the code inserts an underscore (Airweave_<id>), so collection creation will be rejected by Weaviate.</comment>
<file context>
@@ -0,0 +1,174 @@
+ # Weaviate collection names must start with a capital letter and be alphanumeric
+ # We'll use a prefix + sanitized collection ID
+ sanitized_id = str(collection_id).replace("-", "")
+ instance.collection_name = f"Airweave_{sanitized_id}"
+
+ if credentials:
</file context>
| instance.collection_name = f"Airweave_{sanitized_id}" | |
| + instance.collection_name = f"Airweave{sanitized_id}" |
| instance.client = weaviate.use_async_with_params( | ||
| client=weaviate.connect.ConnectionParams.from_params( | ||
| http_host=credentials.cluster_url.replace("http://", "").replace("https://", "").split(":")[0], | ||
| http_port=int(credentials.cluster_url.split(":")[-1]) if ":" in credentials.cluster_url else 8080, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Port parsing treats the scheme separator as a port, so URLs without an explicit port (e.g., https://host) crash with ValueError and the destination can never be created.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/destinations/weaviate.py, line 58:
<comment>Port parsing treats the scheme separator as a port, so URLs without an explicit port (e.g., https://host) crash with ValueError and the destination can never be created.</comment>
<file context>
@@ -0,0 +1,174 @@
+ instance.client = weaviate.use_async_with_params(
+ client=weaviate.connect.ConnectionParams.from_params(
+ http_host=credentials.cluster_url.replace("http://", "").replace("https://", "").split(":")[0],
+ http_port=int(credentials.cluster_url.split(":")[-1]) if ":" in credentials.cluster_url else 8080,
+ http_secure=credentials.cluster_url.startswith("https"),
+ grpc_host=credentials.cluster_url.replace("http://", "").replace("https://", "").split(":")[0],
</file context>
| # 3. Local Strategy (Fallback) | ||
| # Always available via MarkItDown (it falls back to local if no LLM client) | ||
| try: | ||
| self.local_converter = MarkItDownConverter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Selecting the “Local” conversion model still uses OpenAI whenever an OPENAI_API_KEY is configured, so the new preference cannot disable remote processing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/airweave/platform/converters/smart_converter.py, line 58:
<comment>Selecting the “Local” conversion model still uses OpenAI whenever an OPENAI_API_KEY is configured, so the new preference cannot disable remote processing.</comment>
<file context>
@@ -0,0 +1,132 @@
+ # 3. Local Strategy (Fallback)
+ # Always available via MarkItDown (it falls back to local if no LLM client)
+ try:
+ self.local_converter = MarkItDownConverter()
+ logger.info("SmartConverter: Local strategy available")
+ except Exception as e:
</file context>
Summary
This PR implements a user-selectable document conversion model feature for Google Drive sources. Users can now choose between "OpenAI", "Mistral", or "Local" (MarkItDown) when configuring a source.
Changes
SourceConfigViewto select the conversion model.GoogleDriveConfigto include themodelfield with UI hints.SmartConverterto accept and prioritize apreferred_modelargument.EntityPipelineto pass the user's selection from the source config to the converter.Summary by cubic
Added a user-selectable document conversion model for Google Drive and introduced a SmartConverter that supports MarkItDown (local), OpenAI, and Mistral for better control and quality.
New Features
Migration
Written for commit 908fb64. Summary will update automatically on new commits.