Risk Analysis: Credit Card Fraud Detection

Click Here to Read the Full Paper

This study explores methods for detecting fraudulent credit card transactions, focusing on the analysis of a dataset consisting of 284,807 transactions conducted by European cardholders over two days. Among these, only 492 transactions (0.173%) were fraudulent, presenting a significant class imbalance. The research examines three approaches: logistic regression, random forest, and support vector machine (SVM), evaluating their accuracy, reliability, and applicability in fraud detection.

Key Findings

1. Dataset Characteristics: The dataset is highly imbalanced and anonymized, with most features derived through Principal Component Analysis (PCA). Two notable features include Time, representing the seconds elapsed since the first transaction, and Amount, indicating transaction value. The study maintained the dataset's original distribution for analysis.

2. Logistic Regression: The logistic regression model demonstrated strong performance with AUC values of 0.976 (training) and 0.985 (test). While it effectively handled the imbalanced data, its simplicity limited its ability to model complex relationships between variables.

3. Random Forest: The random forest model was optimized using grid search and validated through cross-validation. However, due to computational constraints, a smaller subsample (N=7,000) was used for training, resulting in inconsistent performance (AUC: 0.722 for training and 1.000 for testing). This discrepancy suggests overfitting and poor generalizability to the full dataset.

4. Support Vector Machine (SVM): Using a radial basis function kernel, the SVM model achieved an AUC of 0.928 on the training set but dropped to 0.648 on the test set, reflecting instability and overfitting. Parameter tuning and better handling of the class imbalance could potentially improve its performance.