Build Fraud Detection Model
Say you are at Stripe and working on a fraud detection model for online transactions to the websites of small businesses. Fraud means that the credit card usage was either erroneous (and the user will refute the transaction) or invalid (someone stole a credit card and used it, and there will be a dispute). How would you build such a model, and how would you evaluate the model? What trade-offs are involved in the evaluation?
First you need to define the features and the target of interest to decide what kind of model to run.
Then you want to think about the relevant evaluation metrics for this type of model based on real-life outcomes to the business.
Finally, based on those real-life outcomes, weigh in the trade-offs in trying to maximize or minimize certain metrics.
How does the model work? Is it providing a continuous or categorical output?
Do your evaluation metrics tie into real-life business outcomes?
What trade-offs are being evaluated based on the evaluation metrics?
We want to make a classifier, where the target variable is whether a given transaction is fraudulent or not. The model will produce a probability score at the transaction level for that particular transaction being fraudulent, and if it is above a certain threshold then it will predict a 1 for fraud and 0 for non-fraud. The label, fraud or not, can be identified from looking at which past transactions were actually refunded or disputed. Features used in the model can be about the user (whatever background data is known about them, IP address, or associated Stripe activity they may have), transaction details (amount, time of day), etc. These features should be incorporated at the transaction level.
To build our model, we split the data into a training and testing set. Regardless of the classifier chosen (SVM, Logistic Regression, etc.) we can look into k-fold cross validation and regularization as ways of addressing overfitting. After the model has been trained on the features of the training set, we can evaluate it on the testing set by having it generate the probability of fraud for every testing transaction.
To evaluate our classifier, we can look at precision and recall:
- Precision is defined as the number of true positives divided by the sum of true positives and false positives. A true positive, for this business context, is when the model predicts a fraudulent transaction and the transaction was actually fraudulent. A false positive is when the model predicts a fraudulent transaction but the transaction was actually not fraudulent. There is a cost to a false positive: the business will lose out on the particular transaction. Therefore, it is important to maximize precision, since the higher it is, the fewer false positives there are.
- Recall is defined as the number of true positives divided by the sum of true positives and false negatives. A false negative is when the model predicts that the transaction is not fraudulent, but the transaction actually turned out to be fraudulent. The cost of a false negative is the cost of the transaction itself. Therefore, it is important to maximize recall, since the higher it is, the fewer false negatives there are.
The relevant trade-offs here are how to weigh reducing false positives versus reducing false negatives. One can reduce the number of false positives dramatically by blocking every payment, or the number of false negatives dramatically by not blocking any payment. However, these are infeasible since in the first case there would be no legitimate transactions allowed, and in the second case there would be no effort to stop fraud. Thus, the trade-off between precision and recall can be plotted through a precision-recall curve (seen below). In practice, businesses often care more about false negatives (fraud happening that was not caught) rather than false positives, as there is both the revenue hit as well as a reputational hit. Therefore, it may be sensible to weigh the two accordingly when training and testing a model.