Skip to content

Conversation

@sambhavnoobcoder
Copy link

@sambhavnoobcoder sambhavnoobcoder commented Nov 23, 2025

Issue

Issue #1117 requested integration with Neo4j, a leading graph database platform. The request was minimal ("neo4j as a source would also be helpful"), without specifying requirements, use cases, or what data should be synced.

Problem Identification

After analyzing the request and Neo4j's capabilities, I identified the following requirements:

  1. Data to Index: Neo4j stores graph data with two primary components:

    • Nodes: Entities with labels (types) and properties
    • Relationships: Connections between nodes (can be added in future enhancement)
  2. Authentication: Neo4j uses URI + username/password authentication:

    • URI: Connection string (e.g., neo4j://localhost:7687)
    • Username & Password: Database credentials
  3. Key Use Cases:

    • "Find all Person nodes in my graph database"
    • "Search for Company entities with specific properties"
    • "What products are stored in my Neo4j database?"
    • Enable AI agents to query graph database contents
  4. Technical Challenges:

    • Dynamic schema discovery (labels and properties vary by database)
    • Handling different Neo4j property types
    • Memory-efficient streaming of large graph datasets
    • URI validation and normalization

Solution Design

I designed a solution following Airweave's existing PostgreSQL connector pattern, as Neo4j is also a database:

Architecture Decisions

  1. Dynamic Entity Creation:

    • Use PolymorphicEntity (like PostgreSQL) for dynamic label-based entities
    • Each Neo4j label becomes a separate entity class
    • Properties become entity fields with appropriate types
  2. Schema Discovery:

    • Query CALL db.labels() to discover all node labels
    • Sample 100 nodes per label to infer property types
    • Create entity class dynamically for each label
  3. Configuration:

    • Simple auth config with URI, username, password
    • Empty source config (can be extended for label filtering)
  4. Streaming Strategy:

    • Batch processing with 1000-node buffers
    • Async iteration for memory efficiency
    • Per-label processing to avoid memory overload
  5. Entity ID Generation:

    • Prefer common ID properties (id, uuid, guid)
    • Fall back to Neo4j's internal node ID
    • Hash-based IDs for very long identifiers (>2000 chars)
  6. Error Handling:

    • Graceful degradation for auth failures
    • Specific error messages for Neo4j driver exceptions
    • Connection health checks before processing each label

Implementation

Components Delivered

1. Authentication Configuration

Enhanced Neo4jAuthConfig with proper field validation:

  • URI field with min_length=1 constraint
  • Username and password validation
  • Clear field descriptions for UI integration

2. Source Configuration

Simple Neo4jConfig class:

  • Uses empty SourceConfig for now
  • Can be extended for label filtering or custom queries

3. Neo4j Source Connector

Built comprehensive connector (backend/airweave/platform/sources/neo4j.py, 492 lines):

  • Async driver management with connection pooling
  • Label discovery via db.labels() procedure
  • Property type inference from sample nodes
  • Dynamic PolymorphicEntity creation per label
  • Streaming entity generation with buffering
  • Proper field name normalization (reserved field handling)
  • Entity ID generation with multiple fallback strategies

4. Comprehensive Testing

Wrote 15 unit tests (backend/tests/unit/platform/sources/test_neo4j.py, 539 lines):

  • Source creation and configuration
  • Driver connection and validation
  • Label discovery and property inference
  • Entity ID generation (with fallbacks)
  • Field name normalization
  • Entity class creation
  • Full entity generation workflow
  • Error handling (auth errors, service unavailable)
  • Entity ID length enforcement
  • Empty database handling
  • Auth config validation

All tests use exact mock responses matching Neo4j's driver API format with custom MockNode and MockResult classes.

5. Frontend Integration

  • Added Neo4j icon (SVG) with official Baltic Blue branding (#0A6190)
  • Updated README with Neo4j in supported integrations (alphabetically ordered)

Testing Strategy

Similar to n8n, comprehensive mocking approach:

  1. Created MockNode class mimicking Neo4j's node structure
  2. Created MockResult class for async iteration of query results
  3. Mocked AsyncGraphDatabase.driver() and session management
  4. Tested all core functionality with exact API mocks
  5. Achieved 100% test pass rate (15/15 tests)
  6. Test runtime: ~2 seconds

Results

Test Coverage

  • 15/15 unit tests passing (100%)
  • ~2 second test runtime
  • Covers all core functionality
  • Exact Neo4j driver API mocks
  • Complete error scenario coverage

Fixes : #1117


Summary by cubic

Adds a Neo4j connector to sync graph nodes by label into dynamic entities with type mapping and efficient streaming. Enables quick indexing and search of Neo4j data; addresses #1117.

  • New Features

    • Direct auth with URI, username, and password validation.
    • Schema discovery via db.labels() and sampled nodes; maps Neo4j types to entity fields.
    • Per-label PolymorphicEntity with field normalization and property-to-field mapping.
    • Streaming sync with a 1000-node buffer and connection health checks.
    • Stable entity IDs (id/uuid/guid → fallback to node ID; hash when too long).
    • Error handling for auth failures and service unavailability, with reconnects.
    • UI/docs: added Neo4j icon and README entry.
  • Testing

    • 15 unit tests using mocked async driver responses.
    • Covers label discovery, property inference, entity creation, ID rules, errors, and empty DB.
    • ~2s runtime; ensures driver closes after sync.
    • Validates auth config field constraints.

Written for commit 4d1f3f5. Summary will update automatically on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="backend/tests/unit/platform/sources/test_neo4j.py">

<violation number="1" location="backend/tests/unit/platform/sources/test_neo4j.py:368">
`test_neo4j_entity_class_creation` should assert that reserved properties use the normalized field name (e.g., `name_field`) instead of the raw `name`, otherwise the test cannot catch collisions.</violation>

<violation number="2" location="backend/tests/unit/platform/sources/test_neo4j.py:419">
The full-sync test should assert that exactly four entities were yielded to match the documented expectation, otherwise missing labels go undetected.</violation>
</file>

Reply to cubic to teach it or ask questions. Re-run a review with @cubic-dev-ai review this PR


# Verify we got entities from all labels
# 2 Person + 1 Company + 1 Product = 4 entities
assert len(entities) >= 3
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full-sync test should assert that exactly four entities were yielded to match the documented expectation, otherwise missing labels go undetected.

Prompt for AI agents
Address the following comment on backend/tests/unit/platform/sources/test_neo4j.py at line 419:

<comment>The full-sync test should assert that exactly four entities were yielded to match the documented expectation, otherwise missing labels go undetected.</comment>

<file context>
@@ -0,0 +1,539 @@
+
+        # Verify we got entities from all labels
+        # 2 Person + 1 Company + 1 Product = 4 entities
+        assert len(entities) &gt;= 3
+
+        # Verify entity properties
</file context>
Suggested change
assert len(entities) >= 3
assert len(entities) == 4
Fix with Cubic

assert entity_class is not None
assert hasattr(entity_class, "model_fields")
assert "id_" in entity_class.model_fields # 'id' normalized to 'id_'
assert "name" in entity_class.model_fields
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_neo4j_entity_class_creation should assert that reserved properties use the normalized field name (e.g., name_field) instead of the raw name, otherwise the test cannot catch collisions.

Prompt for AI agents
Address the following comment on backend/tests/unit/platform/sources/test_neo4j.py at line 368:

<comment>`test_neo4j_entity_class_creation` should assert that reserved properties use the normalized field name (e.g., `name_field`) instead of the raw `name`, otherwise the test cannot catch collisions.</comment>

<file context>
@@ -0,0 +1,539 @@
+    assert entity_class is not None
+    assert hasattr(entity_class, &quot;model_fields&quot;)
+    assert &quot;id_&quot; in entity_class.model_fields  # &#39;id&#39; normalized to &#39;id_&#39;
+    assert &quot;name&quot; in entity_class.model_fields
+    assert &quot;email&quot; in entity_class.model_fields
+    assert &quot;age&quot; in entity_class.model_fields
</file context>
Suggested change
assert "name" in entity_class.model_fields
assert "name_field" in entity_class.model_fields
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Connector Request] neo4j as a source would also be helpful

1 participant