High-Performance File Deduplication Scanner with Intelligent Optimization
TurboDedup is the definitive file deduplication solution featuring intelligent three-phase optimization, smart caching, GPU acceleration, and similarity detection. Achieve 10-100x performance improvements over traditional duplicate detection tools.
- Three-Phase Optimization: Intelligent partial hashing for large files (10-100x faster)
- Smart Caching System: Persistent SQLite cache eliminates redundant I/O (80-95% hit rates)
- GPU Acceleration: CUDA/OpenCL support for massive performance gains (5-15x speedup)
- Similarity Detection: Find near-duplicates with perceptual hashing (images, audio, documents)
- Multiple Hash Algorithms: MD5, SHA1, SHA256, xxHash support
- Intelligent Deletion Strategies: Keep newest, oldest, original files with pattern recognition
- Progress Tracking: Real-time progress with ETA and throughput metrics
- Export Options: CSV/JSON export with comprehensive metadata
- Resume Capability: Interrupt and resume scans seamlessly
- Cross-Platform: Windows, Linux, macOS support
- Scalable: Handles TB-scale datasets efficiently
- Safe Operations: Dry-run mode, backups, and verification
- Comprehensive Logging: Detailed error handling and audit trails
| Feature | Benefit | Improvement |
|---|---|---|
| Partial Hashing | Large file optimization | 10-100x faster |
| Smart Caching | Eliminates redundant I/O | 10-50x on repeat scans |
| GPU Acceleration | Parallel hash computation | 5-15x speedup |
| Combined | All optimizations together | Up to 1000x improvement |
# Basic installation
pip install turbodedup
# With GPU acceleration
pip install turbodedup[gpu]
# With similarity detection
pip install turbodedup[similarity]
# Full installation with all features
pip install turbodedup[all]git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup
# Basic usage (no installation required)
python3 turbodedup.py --help
# Install in development mode
pip install -e .
# Install with all features for development
pip install -e .[all,dev]# Using installed package
turbodedup --enable-cache --enable-gpu
# Using direct script (development)
python3 turbodedup.py --enable-cache --enable-gpu
# High-performance scan with smart deletion
turbodedup --path /data --enable-cache --enable-gpu --delete-strategy keep_newest --delete-live
# Find similar images with GPU acceleration
turbodedup --path /photos --image-similarity --enable-gpu --delete-strategy keep_original# Maximum performance configuration
turbodedup --workers 16 --chunk-size 4MB --algorithm xxhash --enable-gpu --enable-cache
# Memory-constrained systems
turbodedup --workers 4 --chunk-size 1MB --enable-cache
# Network drives
turbodedup --workers 4 --chunk-size 256KB --retry-attempts 5 --enable-cache# Interactive selection (default)
turbodedup --delete-strategy interactive
# Automatic strategies
turbodedup --delete-strategy keep_newest # Keep most recent
turbodedup --delete-strategy keep_original # Smart pattern recognition
turbodedup --delete-strategy keep_priority # Priority directories# Check GPU capabilities
turbodedup --gpu-info
# Cache statistics and management
turbodedup --cache-stats
turbodedup --clear-cache
# Export results
turbodedup --export csv --export-path results.csv- Photo Libraries: Find duplicate photos across devices and cloud storage
- Music Collections: Identify duplicate songs in different formats/bitrates
- Download Cleanup: Remove duplicate downloads with intelligent original detection
- Storage Optimization: Reclaim storage space across enterprise filesystems
- Backup Deduplication: Identify redundant backup files and archives
- Migration Projects: Clean up duplicate files during system migrations
- Media Libraries: Organize video/audio libraries with similarity detection
- Project Archives: Identify duplicate project files and assets
- Client Deliverables: Ensure unique deliverables without duplicates
- Discovery Phase: Fast filesystem traversal with intelligent filtering
- Partial Hash Phase: Smart sampling for large files (head + tail segments)
- Full Hash Phase: Complete hashing only for potential duplicates
- Similarity Phase: Advanced algorithms for near-duplicate detection
- Persistent SQLite Database: Stores computed hashes with metadata
- Automatic Validation: File size and modification time verification
- Performance Tracking: Hit rates, I/O savings, and efficiency metrics
- Intelligent Cleanup: Automatic cache maintenance and optimization
- Multi-Backend Support: CUDA (NVIDIA) and OpenCL (AMD/Intel)
- Batch Processing: Optimized GPU utilization with configurable batch sizes
- Automatic Fallback: Seamless CPU fallback when GPU unavailable
- Memory Management: Smart memory allocation and cleanup
- Python: 3.8+ (3.9+ recommended for best performance)
- RAM: 1GB (4GB+ recommended for large datasets)
- Storage: 100MB for application, additional space for cache
- GPU: NVIDIA (CUDA 11.8+), AMD/Intel (OpenCL 2.0+)
- RAM: 8GB+ for GPU acceleration, 16GB+ for TB-scale datasets
- Storage: SSD recommended for optimal performance
- Windows: 10/11 (x64)
- Linux: Ubuntu 20.04+, CentOS 8+, Debian 11+
- macOS: 11+ (Intel and Apple Silicon)
We welcome contributions! Please see our Contributing Guidelines for details.
git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup
pip install -r requirements_enhanced.txt
turbodedup --gpu-info # Test installationThis project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Full Documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
If TurboDedup helps you reclaim storage space and improve performance, please give us a star! β
TurboDedup - The fastest way to find and manage duplicate files