How We Built a Model-Backed First Party Fraud Product at SentiLink
January 29, 2026
This essay describes how we successfully built a 0-to-1 fraud risk score product at SentiLink. We built this utilizing a machine learning model despite having to train on heterogeneous sets of labels where we couldn't directly verify quality or relevance. The problem was the broad category of "first party fraud" (FPF), a collection of fraud variants that our customers (FIs offering lending and depository products) typically discovered only "after the fact": they would observe unusually bad behavior from one of their customers and, possibly after doing additional manual review, conclude that first party fraud had occurred at the time of the application.
The result was that our labels were a collection of customer-defined "cohorts"; in each one, a different "FPF-adjacent" behavior had been observed. Most were instances where our FI customers had observed behaviors highly suggestive of FPF, but some were cases where those customers had reached the FPF determination through manual review. Examples included sets of accounts where bad checks were passed, sets of accounts with large customer-initiated ACH returns, and cases where a customer determined that their customer had charged off and never had intent to pay (possibly because they planned to engage in credit washing later on).
This presented three potential problems for building a score. The first was the heterogeneity of these labels: while we could have just thrown all these labels into a model, we ultimately had to sell these scores to conservative banks and credit unions that would want a coherent story for what our model was targeting, which this approach would be unlikely to provide. The second was mislabeling: some labels, especially those based mostly on our customers observing a bad outcome, may not have been due to FPF at all (the behaviors may have occurred due to identity theft, for example). Finally, in cases where a broad swath of applications was labeled "bad" during a fraud attack, some legitimate applications may have been swept up in the labeling.
These properties were new for SentiLink. At the time we started working on these first party fraud scores in mid-2022, SentiLink had two successful products: scores to help FIs stop synthetic fraud and identity theft. SentiLink's two founders had built solutions to address these problems at Affirm, the online lender, and thus had strong priors on the building blocks needed to enable machine learning models tailored to solving them.
We licensed data sources relevant to the patterns associated with synthetic fraud and identity theft and used these sources, along with data collected via our APIs, to build data structures relevant to detecting these fraud M.O.s. We built internal tooling to visualize these data structures and hired an internal team of fraud analysts (several of whom came from Affirm) to use this tooling to label subsets of applications our partners sent us.
These curated labels, each based on manual review of on-prem data relevant to the application, were then used by the data science team to train our models. This set of approaches drove the company from a few months after founding (when the decision to focus on "scores" was made) through the Series B fundraise from Craft Ventures in May 2021.
With first party fraud, we could not determine through manual review that an individual application was FPF in the same way that we could for synthetic fraud and ID theft. The main reason was that, as described above, FPF is typically discovered through unusually bad behavior on the part of an existing customer.
Two other reasons relate to the definitions of the fraud types themselves. First, synthetic fraud and identity theft can often be detected due to "mismatches" in the PII on an application: a brand new social security number on a person in their 40s is a sign of first party synthetic fraud, and an address, phone number, and social security number all from different states is a sign of identity theft. Second, the data we had licensed and the data structures we built were highly tailored toward detecting synthetic fraud and identity theft.
For example:
So we had to rely on labels from our partners, which had the issues of heterogeneity and accuracy described above.
An analogy: suppose your task is to train a classifier to detect whether a car is in an image, given several "sets" of images handed to you by individuals of varying expertise. Maybe one person gave you two chunks (one of 1,000 images and another of 2,000 images), a second person gave you a chunk of 3,000 images, and so on. However, looking at the images yourself wouldn't actually confirm whether there was a car in the image! While most chunks presumably do have cars in them, some may instead have motorcycles, some may have trucks, and some chunks may not even be images of vehicles at all.
How to proceed? Suppose you can manually define certain "patterns" and check whether each pattern is present in each image. For example, you can detect whether there is a sheet of metal (which could be part of a door), whether there is a pane of glass (which would be a window or a windshield), whether there is a metal circle inside a rubber circle (which could indicate a wheel), and so on. You can also define patterns that might be present in an image containing a motorcycle or a truck (handlebars, a large cab), but not in one containing a car.
This information, even taken all together, would not be sufficient to determine whether a set of images actually contains cars. However, from the prevalence of these patterns among the images, and from comparing this to the prevalence among a random set of images pulled from the internet, you can infer which chunks of images are likely to contain cars and thus are likely to be beneficial if included as "targets" in the "car" model.
For us, the analogues to these patterns were signals we could observe in the data tied to these applications (in the "identity database," the "graph," and other on-prem data structures) that we hypothesized could be good signals for first party fraud.
There was a virtually infinite universe of these potential signals we could engineer; we refined our possible set of hypotheses through:
While each of these plausibly could be a signal for first party fraud, we then had to validate them quantitatively using the heterogeneous, unreliable labels we had received from our customers. The next section describes how we did this.
Suppose we started with three sets of potential first party fraud labels:
For each dataset, we produced a table showing how each potential signal behaved on both the fraud labels and the relevant population of which the applications were a subset:
| label category | n_total | fpf_signal_1 recall |
fpf_signal_2 recall |
fpf_signal_3 recall |
fpf_signal_4 recall |
high_synthetic_score_recall | high_id_theft_score_recall | n_pos_fpf_signal_1 | n_pos_fpf_signal_2 | n_pos_fpf_signal_3 | n_pos_fpf_signal_4 | n_high_synthetic_score | n_high_id_theft_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bad_labels_overall | 4,000 | 32.9% | 39.0% | 2.6% | 3.0% | 4.4% | 1.3% | 1,317 | 1,560 | 105 | 119 | 177 | 50 |
| bad_label_sub_category_1 | 2,500 | 37.9% | 30.0% | 3.3% | 1.4% | 1.3% | 1.2% | 948 | 749 | 82 | 36 | 33 | 30 |
| bad_label_sub_category_2 | 1,000 | 3.8% | 3.8% | 2.8% | 1.7% | 2.7% | 2.8% | 38 | 38 | 28 | 17 | 27 | 28 |
| bad_label_sub_category_3 | 500 | 32.8% | 22.8% | 3.2% | 4.0% | 2.4% | 2.6% | 164 | 114 | 16 | 20 | 12 | 13 |
| overall_approvals | 200,000 | 4.2% | 2.8% | 4.0% | 3.8% | 1.5% | 1.7% | 8,422 | 5,676 | 7,908 | 7,581 | 3,096 | 3,368 |
| label category | n_total | fpf_signal_1 recall |
fpf_signal_2 recall |
fpf_signal_3 recall |
fpf_signal_4 recall |
high_synthetic_score_recall | high_id_theft_score_recall | n_pos_fpf_signal_1 | n_pos_fpf_signal_2 | n_pos_fpf_signal_3 | n_pos_fpf_signal_4 | n_high_synthetic_score | n_high_id_theft_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bad_labels_overall | 2,000 | 4.3% | 15.1% | 21.2% | 27.4% | 1.2% | 3.2% | 86 | 302 | 425 | 547 | 24 | 63 |
| overall_approvals | 165,000 | 4.4% | 2.1% | 2.5% | 3.6% | 4.3% | 3.4% | 7,179 | 3,469 | 4,192 | 5,981 | 7,076 | 5,578 |
| label category | n_total | fpf_signal_1 recall |
fpf_signal_2 recall |
fpf_signal_3 recall |
fpf_signal_4 recall |
high_synthetic_score_recall | high_id_theft_score_recall | n_pos_fpf_signal_1 | n_pos_fpf_signal_2 | n_pos_fpf_signal_3 | n_pos_fpf_signal_4 | n_high_synthetic_score | n_high_id_theft_score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bad_labels_overall | 18,000 | 3.5% | 3.4% | 3.5% | 1.1% | 1.9% | 3.8% | 622 | 606 | 639 | 200 | 350 | 683 |
| bad_label_sub_category_1 | 12,500 | 4.1% | 3.4% | 2.2% | 3.2% | 3.9% | 17.8% | 513 | 419 | 269 | 405 | 482 | 2,227 |
| bad_label_sub_category_2 | 5,500 | 27.6% | 30.3% | 4.5% | 2.0% | 2.9% | 4.0% | 1,517 | 1,666 | 248 | 109 | 160 | 218 |
| overall_apps | 240,000 | 2.6% | 8.2% | 8.9% | 8.4% | 2.6% | 5.8% | 2,580 | 8,239 | 8,947 | 8,474 | 2,753 | 5,818 |
These tables show, for each individual chunk of fraud labels, the percent of those labels "flagged" by each individual signal (i.e., the "recall" of that signal on that chunk). For comparison, we also show in the bottom row of each table the percent of the overall dataset that was flagged by the signal.
The ratio of these numbers was an important quantity for us. We called it "relative likelihood":
relative_likelihood =
P(signal = 1 | label = fraud) / P(signal = 1)
This is a proxy for precision that we would both look at internally and present along with recall. We found that it was easier for potential customers to translate "This feature flags 30% of your fraud while flagging 2% of your approved applications, a 15x ratio" into business value than "This feature has a 30% recall and a 15% precision." 1
| label category | n_total | fpf_signal_1 relative likelihood |
fpf_signal_2 relative likelihood |
fpf_signal_3 relative likelihood |
fpf_signal_4 relative likelihood |
high_synthetic_score relative likelihood |
high_id_theft_score relative likelihood |
|---|---|---|---|---|---|---|---|
| bad_labels_overall | 4,000 | 7.83 | 13.93 | 0.65 | 0.79 | 2.93 | 0.76 |
| bad_label_sub_category_1 | 2,500 | 9.02 | 10.71 | 0.82 | 0.37 | 0.87 | 0.71 |
| bad_label_sub_category_2 | 1,000 | 0.90 | 1.36 | 0.70 | 0.45 | 1.80 | 1.65 |
| bad_label_sub_category_3 | 500 | 7.81 | 8.14 | 0.80 | 1.05 | 1.60 | 1.53 |
| overall_approvals | 200,000 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| label category | n_total | fpf_signal_1 relative likelihood |
fpf_signal_2 relative likelihood |
fpf_signal_3 relative likelihood |
fpf_signal_4 relative likelihood |
high_synthetic_score relative likelihood |
high_id_theft_score relative likelihood |
|---|---|---|---|---|---|---|---|
| bad_labels_overall | 2,000 | 0.98 | 7.19 | 8.48 | 7.61 | 0.28 | 0.94 |
| overall_approvals | 165,000 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| label category | n_total | fpf_signal_1 relative likelihood |
fpf_signal_2 relative likelihood |
fpf_signal_3 relative likelihood |
fpf_signal_4 relative likelihood |
high_synthetic_score relative likelihood |
high_id_theft_score relative likelihood |
|---|---|---|---|---|---|---|---|
| bad_labels_overall | 18,000 | 1.35 | 0.41 | 0.39 | 0.13 | 0.73 | 0.66 |
| bad_label_sub_category_1 | 12,500 | 1.58 | 0.41 | 0.25 | 0.38 | 1.50 | 3.07 |
| bad_label_sub_category_2 | 5,500 | 10.62 | 3.70 | 0.51 | 0.24 | 1.12 | 0.69 |
| overall_apps | 240,000 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
You'll note that we also included in these tables the recalls and relative likelihoods for scoring high on our synthetic fraud and identity theft models. Extending the cars, trucks, and bicycles analogy above: in addition to telling you whether there were certain material patterns present, imagine you also had models that you knew were 80%–90% accurate at telling you whether the image contained a truck or a bicycle.
Looking at these six tables, several things jump out:
This analysis certainly helped us understand our data deeply, but how did we go from running these numbers to producing first party fraud scores?
When modeling, before making technical decisions like which model structure to use (logistic regression or gradient-boosted trees) and how to select hyperparameters, there are decisions closer to business strategy that end up having a much bigger impact on the efficacy of the model you build:
The analysis described above directly informs both of these questions. Without breaking down how key individual signals are (or are not) associated with various potential targets, you are largely taking shots in the dark. (An alternative approach is to feed every conceivable signal you can think of into the model. We learned some hard lessons about doing this with our synthetic and ID theft models: while it did not necessarily degrade aggregate model performance, it led to less explainable SHAP values and occasional misses that sometimes raised uncomfortable questions among our largest and most conservative customers.)
Of the 10–20 potential signals we evaluated, we ended up finding eight that were conceptually distinct from one another and individually predictive on many of the datasets we'd received. By this, we mean that they were prevalent among many of the FPF label datasets we'd received and not so prevalent overall (in other words, high recall and high relative likelihood, respectively). Each of these eight signals was a boolean true/false flag, accompanied by numeric "sub-signals" (e.g., the number of distinct SSNs someone had committed synthetic fraud with, or the number of DDAs applied for in different time windows). These signals ultimately became ~40 features in our model.
To choose which datasets to use as our "goods" and "bads," we used the analysis above as a starting point and followed it up with primary research into the fraud M.O.s. This led us to conclude that the "bad" labels we'd received could be broadly grouped into two distinct FPF M.O.s (one related to check fraud and one related to ACH fraud). We saw these bad labels consistently associated with distinctive, though overlapping, sets of the eight signals we'd found.
Note that we did not use statistical techniques such as clustering on the tables above. We had on the order of 10–20 datasets and 10–20 signals (of which we ended up including eight in the initial models), so we simply manually reviewed the performance of each boolean signal on each dataset to determine what to include.
Once we had decided on these two distinct models and had a general sense of the labels we wanted to include in each, we were able to go back to the customers from whom we'd received these labels, share our analysis and research, and dive deeper into these labels. Deep engagement allowed us to refine the labels we had down to fit the fraud M.O.s that we were targeting.
This ability to refine the labels with input from the customers we received them from was so critical in increasing label quality that we ended up training each of the initial scores on one customer's data. Our collaboration with these customers about the sub-categories of labels they had sent led to concentrated sets of 1,000–2,000 high-quality labels as the "bads" for each model; the "goods" were approved applications from the same time period, from those same customers.
Launched in early 2024, by the time I left SentiLink in late 2025, many top FIs were using them as part of their fraud decisioning, including:
One additional FI had signed to use them but had not yet integrated. Furthermore, nearly all of the top FIs SentiLink worked with were engaged in the retrostudy process with these scores.
These scores became SentiLink's fastest-growing product launched during my last two years at the company, accounting for roughly 5% of ARR by the time I left despite no marketing push. We largely offered existing customers a way to test these products alongside our synthetic fraud and identity theft scores, and more often than not, they found that the scores caught fraud our other products missed. In my final quarter, four of the top FIs mentioned above went live via API.
We eventually created a single score that combined the two underlying models via a simple linear transformation. As I was transitioning out, the team was exploring training a unified model that included labels from multiple partners. Still, the fact that the simple approach we took got us as far as it did holds several lessons.
Evaluation illuminates everything. Just as, at a macro level, how a SaaS company defines and breaks down its ARR can affect and elucidate how it is operating, deciding how to evaluate your models can affect everything. In particular, creating a more granular evaluation framework, as we did here (breaking down performance into 10–15 individual datasets along a couple of key metrics) affected not just how we understood how well we were solving the problem of first party fraud, but even how we defined the problem itself. This framework was nothing fancy from a technical perspective: we stored the datasets themselves in S3, with a couple of version-controlled Python scripts that could produce tables like the ones above, add new datasets as needed, and evaluate new features and scores on our existing datasets.
These scores allowed us to initiate collaborations with major U.S. FIs around first party fraud that continue to this day. SentiLink has used these relationships to explore more "determinative" solutions for first party fraud, and a couple of the eight signals we developed later became useful building blocks for adjacent fraud efforts that were in progress when I left and that SentiLink may be announcing soon.
Most importantly, this work established at SentiLink a set of practices for evaluating fraud signals on external datasets in a systematic way. As I was departing, the business had launched an initiative to evaluate our identity theft score and some of its key underlying signals on external datasets, using the same approach we'd developed for FPF. We even built a similar evaluation framework for a new fraud area, and it helped us see that we likely couldn't produce compelling scores or flags there.
In a future post, I'll discuss how this framework applies to evaluating quantized models.
Seth Weidman worked at SentiLink for about six years, from when the company was about two-and-a-half years old (December 2019) to when it was eight-and-a-half years old (November 2025).
1 Assuming the overall fraud rate is 1%, flagging 30% of all fraud while flagging 2% of approvals implies roughly 15% precision (the precision value referenced in the main text). ↩