Skip to content

arjaygg/TurboDedup

Repository files navigation

TurboDedup

High-Performance File Deduplication Scanner with Intelligent Optimization

Python 3.8+ License: MIT Performance

TurboDedup is the definitive file deduplication solution featuring intelligent three-phase optimization, smart caching, GPU acceleration, and similarity detection. Achieve 10-100x performance improvements over traditional duplicate detection tools.

πŸš€ Key Features

Core Engine

  • Three-Phase Optimization: Intelligent partial hashing for large files (10-100x faster)
  • Smart Caching System: Persistent SQLite cache eliminates redundant I/O (80-95% hit rates)
  • GPU Acceleration: CUDA/OpenCL support for massive performance gains (5-15x speedup)
  • Similarity Detection: Find near-duplicates with perceptual hashing (images, audio, documents)

Advanced Capabilities

  • Multiple Hash Algorithms: MD5, SHA1, SHA256, xxHash support
  • Intelligent Deletion Strategies: Keep newest, oldest, original files with pattern recognition
  • Progress Tracking: Real-time progress with ETA and throughput metrics
  • Export Options: CSV/JSON export with comprehensive metadata
  • Resume Capability: Interrupt and resume scans seamlessly

Enterprise Ready

  • Cross-Platform: Windows, Linux, macOS support
  • Scalable: Handles TB-scale datasets efficiently
  • Safe Operations: Dry-run mode, backups, and verification
  • Comprehensive Logging: Detailed error handling and audit trails

πŸ“Š Performance

Feature Benefit Improvement
Partial Hashing Large file optimization 10-100x faster
Smart Caching Eliminates redundant I/O 10-50x on repeat scans
GPU Acceleration Parallel hash computation 5-15x speedup
Combined All optimizations together Up to 1000x improvement

⚑ Quick Start

Installation Options

Option 1: Package Installation (Recommended)

# Basic installation
pip install turbodedup

# With GPU acceleration
pip install turbodedup[gpu]

# With similarity detection
pip install turbodedup[similarity]

# Full installation with all features
pip install turbodedup[all]

Option 2: Development Installation

git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup

# Basic usage (no installation required)
python3 turbodedup.py --help

# Install in development mode
pip install -e .

# Install with all features for development
pip install -e .[all,dev]

Basic Usage

# Using installed package
turbodedup --enable-cache --enable-gpu

# Using direct script (development)
python3 turbodedup.py --enable-cache --enable-gpu

# High-performance scan with smart deletion
turbodedup --path /data --enable-cache --enable-gpu --delete-strategy keep_newest --delete-live

# Find similar images with GPU acceleration
turbodedup --path /photos --image-similarity --enable-gpu --delete-strategy keep_original

πŸ”§ Core Commands

Performance Optimization

# Maximum performance configuration
turbodedup --workers 16 --chunk-size 4MB --algorithm xxhash --enable-gpu --enable-cache

# Memory-constrained systems
turbodedup --workers 4 --chunk-size 1MB --enable-cache

# Network drives
turbodedup --workers 4 --chunk-size 256KB --retry-attempts 5 --enable-cache

Deletion Strategies

# Interactive selection (default)
turbodedup --delete-strategy interactive

# Automatic strategies
turbodedup --delete-strategy keep_newest    # Keep most recent
turbodedup --delete-strategy keep_original  # Smart pattern recognition
turbodedup --delete-strategy keep_priority  # Priority directories

System Management

# Check GPU capabilities
turbodedup --gpu-info

# Cache statistics and management
turbodedup --cache-stats
turbodedup --clear-cache

# Export results
turbodedup --export csv --export-path results.csv

🎯 Use Cases

Home Users

  • Photo Libraries: Find duplicate photos across devices and cloud storage
  • Music Collections: Identify duplicate songs in different formats/bitrates
  • Download Cleanup: Remove duplicate downloads with intelligent original detection

IT Professionals

  • Storage Optimization: Reclaim storage space across enterprise filesystems
  • Backup Deduplication: Identify redundant backup files and archives
  • Migration Projects: Clean up duplicate files during system migrations

Content Creators

  • Media Libraries: Organize video/audio libraries with similarity detection
  • Project Archives: Identify duplicate project files and assets
  • Client Deliverables: Ensure unique deliverables without duplicates

πŸ—οΈ Architecture

Three-Phase Optimization

  1. Discovery Phase: Fast filesystem traversal with intelligent filtering
  2. Partial Hash Phase: Smart sampling for large files (head + tail segments)
  3. Full Hash Phase: Complete hashing only for potential duplicates
  4. Similarity Phase: Advanced algorithms for near-duplicate detection

Smart Caching System

  • Persistent SQLite Database: Stores computed hashes with metadata
  • Automatic Validation: File size and modification time verification
  • Performance Tracking: Hit rates, I/O savings, and efficiency metrics
  • Intelligent Cleanup: Automatic cache maintenance and optimization

GPU Acceleration

  • Multi-Backend Support: CUDA (NVIDIA) and OpenCL (AMD/Intel)
  • Batch Processing: Optimized GPU utilization with configurable batch sizes
  • Automatic Fallback: Seamless CPU fallback when GPU unavailable
  • Memory Management: Smart memory allocation and cleanup

πŸ“‹ System Requirements

Minimum Requirements

  • Python: 3.8+ (3.9+ recommended for best performance)
  • RAM: 1GB (4GB+ recommended for large datasets)
  • Storage: 100MB for application, additional space for cache

Recommended for Enhanced Features

  • GPU: NVIDIA (CUDA 11.8+), AMD/Intel (OpenCL 2.0+)
  • RAM: 8GB+ for GPU acceleration, 16GB+ for TB-scale datasets
  • Storage: SSD recommended for optimal performance

Platform Support

  • Windows: 10/11 (x64)
  • Linux: Ubuntu 20.04+, CentOS 8+, Debian 11+
  • macOS: 11+ (Intel and Apple Silicon)

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/arjaygg/TurboDedup.git
cd TurboDedup
pip install -r requirements_enhanced.txt
turbodedup --gpu-info  # Test installation

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— Links

⭐ Support

If TurboDedup helps you reclaim storage space and improve performance, please give us a star! ⭐


TurboDedup - The fastest way to find and manage duplicate files

About

High-Performance File Deduplication Scanner with GPU Acceleration & Smart Caching

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages