Skip to content

Conversation

@aldaonggar
Copy link

@aldaonggar aldaonggar commented Nov 23, 2025

Optimize CTTI entity generation with concurrent batching and Azure storage checks

Implements batched, concurrent entity creation for the CTTI source to improve performance on large datasets.

Changes:

  • Added concurrent processing using process_entities_concurrent (batch size: 50) to parallelize entity creation
  • Added Azure storage existence checks before creating entities; skip entities already in storage to avoid redundant processing
  • Improved error handling: storage check failures are logged but don't stop the sync

Performance impact:

  • Parallelizes Azure storage I/O (50 concurrent checks vs sequential)
  • Reduces downstream work by skipping already-processed entities
  • Faster incremental syncs when many entities already exist

Summary by cubic

Speeds up CTTI entity generation by batching work and skipping entities already in Azure storage. This reduces redundant processing and makes incremental syncs faster.

  • New Features
    • Concurrent batching for entity creation (batch size 50); order not required.
    • Azure storage existence check before creation; skip existing entities and continue on check errors.

Written for commit 6ff166f. Summary will update automatically on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@aldaonggar
Copy link
Author

@orhanrauf @lennertjansen

@aldaonggar
Copy link
Author

aldaonggar commented Nov 26, 2025

10,000 studies
Original: 13.1407 seconds
Optimized with 0 Azure storage hits: 13.7565 seconds
Optimized with 50% Azure storage hits: 2.0019 seconds

Conclusion

Actual parallelization doesnt show much effect, however, the Azure storage lookups are quite useful. I want to emphasize that this is a premature optimization and actual bottle-neck is embedding, which takes most of the time

Important

The CTTI connector doesn't work on PROD, this PR solves the issue too

@orhanrauf @lennertjansen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant