TurboDedup

High-Performance File Deduplication Scanner with Intelligent Optimization

TurboDedup is the definitive file deduplication solution featuring intelligent three-phase optimization, smart caching, GPU acceleration, and similarity detection. Achieve 10-100x performance improvements over traditional duplicate detection tools.

🚀 Key Features

Core Engine

Three-Phase Optimization: Intelligent partial hashing for large files (10-100x faster)
Smart Caching System: Persistent SQLite cache eliminates redundant I/O (80-95% hit rates)
GPU Acceleration: CUDA/OpenCL support for massive performance gains (5-15x speedup)
Similarity Detection: Find near-duplicates with perceptual hashing (images, audio, documents)

Advanced Capabilities

Multiple Hash Algorithms: MD5, SHA1, SHA256, xxHash support
Intelligent Deletion Strategies: Keep newest, oldest, original files with pattern recognition
Progress Tracking: Real-time progress with ETA and throughput metrics
Export Options: CSV/JSON export with comprehensive metadata
Resume Capability: Interrupt and resume scans seamlessly

Enterprise Ready

Cross-Platform: Windows, Linux, macOS support
Scalable: Handles TB-scale datasets efficiently
Safe Operations: Dry-run mode, backups, and verification
Comprehensive Logging: Detailed error handling and audit trails

📊 Performance

Feature	Benefit	Improvement
Partial Hashing	Large file optimization	10-100x faster
Smart Caching	Eliminates redundant I/O	10-50x on repeat scans
GPU Acceleration	Parallel hash computation	5-15x speedup
Combined	All optimizations together	Up to 1000x improvement

⚡ Quick Start

Installation Options

Option 1: Package Installation (Recommended)

# Basic installation
pip install turbodedup

# With GPU acceleration
pip install turbodedup[gpu]

# With similarity detection
pip install turbodedup[similarity]

# Full installation with all features
pip install turbodedup[all]

Option 2: Development Installation

git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup

# Basic usage (no installation required)
python3 turbodedup.py --help

# Install in development mode
pip install -e .

# Install with all features for development
pip install -e .[all,dev]

Basic Usage

# Using installed package
turbodedup --enable-cache --enable-gpu

# Using direct script (development)
python3 turbodedup.py --enable-cache --enable-gpu

# High-performance scan with smart deletion
turbodedup --path /data --enable-cache --enable-gpu --delete-strategy keep_newest --delete-live

# Find similar images with GPU acceleration
turbodedup --path /photos --image-similarity --enable-gpu --delete-strategy keep_original

🔧 Core Commands

Performance Optimization

# Maximum performance configuration
turbodedup --workers 16 --chunk-size 4MB --algorithm xxhash --enable-gpu --enable-cache

# Memory-constrained systems
turbodedup --workers 4 --chunk-size 1MB --enable-cache

# Network drives
turbodedup --workers 4 --chunk-size 256KB --retry-attempts 5 --enable-cache

Deletion Strategies

# Interactive selection (default)
turbodedup --delete-strategy interactive

# Automatic strategies
turbodedup --delete-strategy keep_newest    # Keep most recent
turbodedup --delete-strategy keep_original  # Smart pattern recognition
turbodedup --delete-strategy keep_priority  # Priority directories

System Management

# Check GPU capabilities
turbodedup --gpu-info

# Cache statistics and management
turbodedup --cache-stats
turbodedup --clear-cache

# Export results
turbodedup --export csv --export-path results.csv

🎯 Use Cases

Home Users

Photo Libraries: Find duplicate photos across devices and cloud storage
Music Collections: Identify duplicate songs in different formats/bitrates
Download Cleanup: Remove duplicate downloads with intelligent original detection

IT Professionals

Storage Optimization: Reclaim storage space across enterprise filesystems
Backup Deduplication: Identify redundant backup files and archives
Migration Projects: Clean up duplicate files during system migrations

Content Creators

Media Libraries: Organize video/audio libraries with similarity detection
Project Archives: Identify duplicate project files and assets
Client Deliverables: Ensure unique deliverables without duplicates

🏗️ Architecture

Three-Phase Optimization

Discovery Phase: Fast filesystem traversal with intelligent filtering
Partial Hash Phase: Smart sampling for large files (head + tail segments)
Full Hash Phase: Complete hashing only for potential duplicates
Similarity Phase: Advanced algorithms for near-duplicate detection

Smart Caching System

Persistent SQLite Database: Stores computed hashes with metadata
Automatic Validation: File size and modification time verification
Performance Tracking: Hit rates, I/O savings, and efficiency metrics
Intelligent Cleanup: Automatic cache maintenance and optimization

GPU Acceleration

Multi-Backend Support: CUDA (NVIDIA) and OpenCL (AMD/Intel)
Batch Processing: Optimized GPU utilization with configurable batch sizes
Automatic Fallback: Seamless CPU fallback when GPU unavailable
Memory Management: Smart memory allocation and cleanup

📋 System Requirements

Minimum Requirements

Python: 3.8+ (3.9+ recommended for best performance)
RAM: 1GB (4GB+ recommended for large datasets)
Storage: 100MB for application, additional space for cache

Recommended for Enhanced Features

GPU: NVIDIA (CUDA 11.8+), AMD/Intel (OpenCL 2.0+)
RAM: 8GB+ for GPU acceleration, 16GB+ for TB-scale datasets
Storage: SSD recommended for optimal performance

Platform Support

Windows: 10/11 (x64)
Linux: Ubuntu 20.04+, CentOS 8+, Debian 11+
macOS: 11+ (Intel and Apple Silicon)

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup
pip install -r requirements_enhanced.txt
turbodedup --gpu-info  # Test installation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

Documentation: Full Documentation
Issues: GitHub Issues
Discussions: GitHub Discussions

⭐ Support

If TurboDedup helps you reclaim storage space and improve performance, please give us a star! ⭐

TurboDedup - The fastest way to find and manage duplicate files

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
turbodedup		turbodedup
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SMART_SYMLINK_EXAMPLES.md		SMART_SYMLINK_EXAMPLES.md
TASKS.md		TASKS.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
requirements_enhanced.txt		requirements_enhanced.txt
setup.py		setup.py
test_smart_symlink_patterns.py		test_smart_symlink_patterns.py
test_symlink_feature.py		test_symlink_feature.py
turbodedup.py		turbodedup.py

License

arjaygg/TurboDedup

Folders and files

Latest commit

History

Repository files navigation