π Winner: UIC Engineering Expo 2023 Best in Show Award
- Overview
- Key Features
- Technical Approach
- Dataset
- Model Performance
- Installation & Usage
- Project Structure
- Future Roadmap
- Acknowledgments
SecureSense is an advanced machine learning-based framework designed to detect and prevent phishing attacks with high accuracy. Developed as a capstone project for a B.S. in Computer Science with a Machine Learning major and Business Analytics minor, this project addresses the growing threat of phishing attacks targeting university students and the broader online community.
The framework leverages supervised learning algorithms to analyze website characteristics and distinguish between legitimate and phishing websites with 94.35% accuracy, providing a robust solution for cybersecurity threat detection.
- Multi-Model Ensemble Approach: Implements three complementary ML algorithms for comprehensive phishing detection
- 48-Feature Analysis: Analyzes diverse website characteristics including URL structure, domain properties, and page content
- High Performance: Achieves 94%+ accuracy across all implemented models
- Balanced Dataset: Trained on 10,000 labeled samples ensuring unbiased predictions
- Real-time Detection: Optimized Decision Tree model for production deployment
- Interpretable Results: Feature importance analysis and model explainability
This project implements and compares three state-of-the-art classification algorithms:
- Algorithm: CART (Classification and Regression Trees) with Gini impurity
- Training Accuracy: 94.61%
- Test Accuracy: 94.35%
- Key Advantages:
- Fast inference time ideal for production
- Interpretable decision paths
- No feature scaling required
- Optimization: Cost-complexity pruning (Ξ± = 0.010) to prevent overfitting
- Algorithm: Binary logistic regression with L2 regularization
- Training Accuracy: 92.8%
- Test Accuracy: 92.5%
- Key Advantages:
- Probabilistic predictions
- Linear decision boundaries
- Fast training and prediction
- Algorithm: Ensemble of decision trees with bagging
- Training Accuracy: 96.2%
- Test Accuracy: 95.1%
- Key Advantages:
- Highest overall accuracy
- Robust to overfitting
- Feature importance ranking
- Configuration: 100 estimators with bootstrap sampling
The framework analyzes 48 distinct features categorized into:
URL-based Features:
- Number of dots, dashes, special characters
- URL length and subdomain level
- Presence of IP address, HTTPS, suspicious patterns
Domain Features:
- Hostname length and structure
- Domain in paths/subdomains
- Path and query component analysis
Page Content Features:
- External hyperlinks percentage
- Form action attributes
- JavaScript and iframe usage
- Meta and script tags analysis
Behavioral Features:
- Right-click disabled
- Pop-up windows
- Fake status bar links
Classification Report (Decision Tree - Test Set):
precision recall f1-score support
Not Phishing 0.94 0.95 0.94 1019
Phishing 0.94 0.94 0.94 981
accuracy 0.94 2000
macro avg 0.94 0.94 0.94 2000
weighted avg 0.94 0.94 0.94 2000
Statistical Methods Applied:
- Mutual Information: Feature selection and relevance scoring
- Spearman Correlation: Feature independence analysis
- Gini Impurity: Decision tree splitting criterion
- Cross-Validation: 80-20 train-test split with random sampling
Source: Phishing Legitimate Full Dataset (Mendeley Data)
Specifications:
- Total Samples: 10,000
- Legitimate Websites: 5,000
- Phishing Websites: 5,000
- Features: 48 website characteristics
- Class Balance: Perfectly balanced (50-50 split)
- Data Format: CSV with labeled samples
Data Split:
- Training Set: 8,000 samples (80%)
- Test Set: 2,000 samples (20%)
- Random sampling with seed=42 for reproducibility
| Model | Training Accuracy | Test Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| Decision Tree (Pruned) | 94.61% | 94.35% | 0.94 | 0.94 | 0.94 |
| Logistic Regression | 92.80% | 92.50% | 0.93 | 0.93 | 0.93 |
| Random Forest | 96.20% | 95.10% | 0.95 | 0.95 | 0.95 |
Key Insights:
- All models demonstrate strong generalization with minimal overfitting
- Random Forest achieves highest accuracy but with increased computational cost
- Decision Tree offers optimal balance between performance and inference speed
- Consistent precision and recall indicate robust performance across both classes
Production Model Selection: The Decision Tree model is recommended for deployment due to:
- Near-optimal accuracy (94.35%)
- Fastest prediction time
- Lower memory footprint
- Interpretable decision rules
- Python 3.8+ (Python 3.10 or 3.11 recommended)
- pip (Python package installer)
- git (for cloning the repository)
git clone https://github.com/khuynh22/SecureSense-A-Data-Driven-Framework-for-Phishing-Attack-Prevention.git
cd SecureSense-A-Data-Driven-Framework-for-Phishing-Attack-PreventionWindows (PowerShell):
python -m venv .venv
.venv\Scripts\Activate.ps1macOS/Linux:
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtThis will install all required packages:
- Flask (3.0.0) - Web framework
- pandas (2.1.3) - Data manipulation
- numpy (1.26.2) - Numerical computing
- scikit-learn (1.3.2) - Machine learning algorithms
- plotly (5.18.0) - Interactive visualizations
- matplotlib (3.8.2) & seaborn (0.13.0) - Data visualization
- joblib (1.3.2) - Model persistence
- jupyter (optional) - For running notebooks
-
Activate virtual environment (if not already active):
# Windows .venv\Scripts\Activate.ps1 # macOS/Linux source .venv/bin/activate
-
Start the Flask server:
python app.py
You should see output like:
* Running on http://127.0.0.1:5000 * Restarting with stat * Debugger is active! -
Open your browser: Navigate to:
http://127.0.0.1:5000 -
Upload and analyze:
- Drag and drop
Phishing_Legitimate_full.csvor click "Browse Files" - Click "Analyze Dataset"
- Wait 10-30 seconds for model training
- View interactive results with performance metrics and visualizations
- Drag and drop
-
Stop the server: Press
CTRL+Cin the terminal
# Load and prepare data
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
# Load dataset
df = pd.read_csv('Phishing_Legitimate_full.csv')
# Prepare features and labels
X = df.drop(['CLASS_LABEL'], axis=1)
if 'id' in X.columns:
X = X.drop(['id'], axis=1)
y = df['CLASS_LABEL']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree model (optimized parameters)
clf = tree.DecisionTreeClassifier(ccp_alpha=0.010, random_state=42)
clf.fit(X_train, y_train)
# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")
# Predict on new data
prediction = clf.predict(new_website_features)-
Ensure Jupyter is installed:
pip install jupyter notebook
-
Start Jupyter:
jupyter notebook
-
Open a notebook:
Decision_Tree_for_Phishing_Attack.ipynbPhishing_Detection_Using_Logistic_Regression_and_Random_Forest_Classifier.ipynb
-
Run cells sequentially (Shift+Enter)
Issue: ModuleNotFoundError: No module named 'flask'
- Solution: Ensure virtual environment is activated and run
pip install -r requirements.txt
Issue: Port 5000 already in use
- Solution: Change port in
app.pyline 265:app.run(debug=True, port=5001)
Issue: CSV upload fails
- Solution: Verify CSV contains
CLASS_LABELorlabelscolumn with binary values (0/1)
Issue: Models take too long to train
- Solution: Reduce dataset size or use a smaller
n_estimatorsvalue for Random Forest
For more details, see WEB_APP_GUIDE.md
SecureSense/
βββ README.md
βββ Phishing_Legitimate_full.csv # Dataset
βββ decision_tree_for_phishing_attack.py # Decision Tree implementation
βββ phishing_detection_using_logistic_regression_and_random_forest_classifier.py
βββ Decision_Tree_for_Phishing_Attack.ipynb
βββ Phishing_Detection_Using_Logistic_Regression_and_Random_Forest_Classifier.ipynb
βββ Data Preprocessing # Data cleaning scripts
βββ Web Scraping.ipynb # Feature extraction notebook
βββ Phishing Attacks Poster.pdf # Research poster
βββ SecureSense_ A Data-Driven Framework for Phishing Attack Prevention.pdf
- Design and implement web interface using Figma wireframes
- Integrate real-time URL analysis
- Deploy NLP model for text feature extraction
- Implement automated web scraping for feature generation
- Deep learning models (CNN/RNN for URL pattern recognition)
- Browser extension for real-time protection
- API endpoint for third-party integration
- Continuous learning from new phishing samples
- Cloud infrastructure setup (AWS/Azure)
- Load balancing and scalability optimization
- User authentication and database integration
- Comprehensive documentation and API guide
Target Release: Q4 2024
This project was developed as an academic capstone. Contributions, suggestions, and feedback are welcome! Please feel free to open issues or submit pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
This research was conducted under the guidance of:
- Professor Mitchell Theys - Project Supervisor, Department of Computer Science, UIC
- Professor Xinhua Zhang - Technical Advisor, Department of Computer Science, UIC
Special thanks to the University of Illinois Chicago (UIC) Computer Science Department for providing resources and support for this capstone project.
If you use this framework in your research, please cite:
@software{SecureSense2023,
title={SecureSense: A Data-Driven Framework for Phishing Attack Prevention},
author={Huynh, Nguyen},
year={2023},
institution={University of Illinois Chicago},
note={UIC Engineering Expo 2023 Best in Show Winner}
}For questions, collaborations, or more information about this project:
- GitHub: @khuynh22
- Project Repository: SecureSense
Leveraging Machine Learning to Combat Cybersecurity Threats

