Building Fraud Models with Heterogeneous Labels of Unknown Quality (draft)

How We Built a Model-Backed First Party Fraud Product at SentiLink

January 29, 2026

This essay describes how we successfully built a 0-to-1 fraud risk score product at SentiLink. We built this utilizing a machine learning model despite having to train on heterogeneous sets of labels where we couldn't directly verify quality or relevance. The problem was the broad category of "first party fraud" (FPF), a collection of fraud variants that our customers (FIs offering lending and depository products) typically discovered only "after the fact": they would observe unusually bad behavior from one of their customers and, possibly after doing additional manual review, conclude that first party fraud had occurred at the time of the application.

The result was that our labels were a collection of customer-defined "cohorts"; in each one, a different "FPF-adjacent" behavior had been observed. Most were instances where our FI customers had observed behaviors highly suggestive of FPF, but some were cases where those customers had reached the FPF determination through manual review. Examples included sets of accounts where bad checks were passed, sets of accounts with large customer-initiated ACH returns, and cases where a customer determined that their customer had charged off and never had intent to pay (possibly because they planned to engage in credit washing later on).

This presented three potential problems for building a score. The first was the heterogeneity of these labels: while we could have just thrown all these labels into a model, we ultimately had to sell these scores to conservative banks and credit unions that would want a coherent story for what our model was targeting, which this approach would be unlikely to provide. The second was mislabeling: some labels, especially those based mostly on our customers observing a bad outcome, may not have been due to FPF at all (the behaviors may have occurred due to identity theft, for example). Finally, in cases where a broad swath of applications was labeled "bad" during a fraud attack, some legitimate applications may have been swept up in the labeling.

These properties were new for SentiLink. At the time we started working on these first party fraud scores in mid-2022, SentiLink had two successful products: scores to help FIs stop synthetic fraud and identity theft. SentiLink's two founders had built solutions to address these problems at Affirm, the online lender, and thus had strong priors on the building blocks needed to enable machine learning models tailored to solving them.

We licensed data sources relevant to the patterns associated with synthetic fraud and identity theft and used these sources, along with data collected via our APIs, to build data structures relevant to detecting these fraud M.O.s. We built internal tooling to visualize these data structures and hired an internal team of fraud analysts (several of whom came from Affirm) to use this tooling to label subsets of applications our partners sent us.

These curated labels, each based on manual review of on-prem data relevant to the application, were then used by the data science team to train our models. This set of approaches drove the company from a few months after founding (when the decision to focus on "scores" was made) through the Series B fundraise from Craft Ventures in May 2021.

Our FPF Problem and an Analogy to Understand It

With first party fraud, we could not determine through manual review that an individual application was FPF in the same way that we could for synthetic fraud and ID theft. The main reason was that, as described above, FPF is typically discovered through unusually bad behavior on the part of an existing customer.

Two other reasons relate to the definitions of the fraud types themselves. First, synthetic fraud and identity theft can often be detected due to "mismatches" in the PII on an application: a brand new social security number on a person in their 40s is a sign of first party synthetic fraud, and an address, phone number, and social security number all from different states is a sign of identity theft. Second, the data we had licensed and the data structures we built were highly tailored toward detecting synthetic fraud and identity theft.

For example:

We had an "identity database," based mostly on licensed data, that was structured to "natively" show if someone was committing first party synthetic fraud (where one person uses multiple SSNs with their own name and date of birth).
We had an "application graph" based mostly on first party data (e.g., our FI customers calling our API) that let us see if one phone number was being used on multiple applications with different pieces of core PII, a sign that it might be the phone of a fraudster who had stolen multiple identities.

So we had to rely on labels from our partners, which had the issues of heterogeneity and accuracy described above.

An analogy: suppose your task is to train a classifier to detect whether a car is in an image, given several "sets" of images handed to you by individuals of varying expertise. Maybe one person gave you two chunks (one of 1,000 images and another of 2,000 images), a second person gave you a chunk of 3,000 images, and so on. However, looking at the images yourself wouldn't actually confirm whether there was a car in the image! While most chunks presumably do have cars in them, some may instead have motorcycles, some may have trucks, and some chunks may not even be images of vehicles at all.

How to proceed? Suppose you can manually define certain "patterns" and check whether each pattern is present in each image. For example, you can detect whether there is a sheet of metal (which could be part of a door), whether there is a pane of glass (which would be a window or a windshield), whether there is a metal circle inside a rubber circle (which could indicate a wheel), and so on. You can also define patterns that might be present in an image containing a motorcycle or a truck (handlebars, a large cab), but not in one containing a car.

This information, even taken all together, would not be sufficient to determine whether a set of images actually contains cars. However, from the prevalence of these patterns among the images, and from comparing this to the prevalence among a random set of images pulled from the internet, you can infer which chunks of images are likely to contain cars and thus are likely to be beneficial if included as "targets" in the "car" model.

For us, the analogues to these patterns were signals we could observe in the data tied to these applications (in the "identity database," the "graph," and other on-prem data structures) that we hypothesized could be good signals for first party fraud.

There was a virtually infinite universe of these potential signals we could engineer; we refined our possible set of hypotheses through:

Primary research into first party fraud M.O.s. A common type of FPF involving ACH transactions requires multiple checking accounts, so we hypothesized that applying for multiple checking accounts in a short period might be predictive. We were able to observe this behavior in the application graph, even though it was built to identify identity theft.
Reviewing cases from the ambiguous labels we'd received. Even though these labels were imperfect, we found certain anomalous patterns coming up repeatedly. For example, we could see in our identity database that a disproportionately high number of the supposed first party fraudsters had previously committed synthetic fraud, even if they were not committing synthetic fraud on the specific applications where they were labeled as first party fraud.

While each of these plausibly could be a signal for first party fraud, we then had to validate them quantitatively using the heterogeneous, unreliable labels we had received from our customers. The next section describes how we did this.

Evaluating hypothesized signals on heterogeneous, unreliable labels

Suppose we started with three sets of potential first party fraud labels:

One from a credit union, with 4,000 overall "first party fraud" labels divided into three subcategories.
One from a large bank, consisting of a single chunk of 2,000 first party fraud labels.
One from another large bank, with 18,000 first party fraud labels divided into two subcategories.

For each dataset, we produced a table showing how each potential signal behaved on both the fraud labels and the relevant population of which the applications were a subset:

Credit Union 1

label category	n_total	fpf_signal_1 recall	fpf_signal_2 recall	fpf_signal_3 recall	fpf_signal_4 recall	high_synthetic_score_recall	high_id_theft_score_recall	n_pos_fpf_signal_1	n_pos_fpf_signal_2	n_pos_fpf_signal_3	n_pos_fpf_signal_4	n_high_synthetic_score	n_high_id_theft_score
bad_labels_overall	4,000	32.9%	39.0%	2.6%	3.0%	4.4%	1.3%	1,317	1,560	105	119	177	50
bad_label_sub_category_1	2,500	37.9%	30.0%	3.3%	1.4%	1.3%	1.2%	948	749	82	36	33	30
bad_label_sub_category_2	1,000	3.8%	3.8%	2.8%	1.7%	2.7%	2.8%	38	38	28	17	27	28
bad_label_sub_category_3	500	32.8%	22.8%	3.2%	4.0%	2.4%	2.6%	164	114	16	20	12	13
overall_approvals	200,000	4.2%	2.8%	4.0%	3.8%	1.5%	1.7%	8,422	5,676	7,908	7,581	3,096	3,368

Large Bank 1

label category	n_total	fpf_signal_1 recall	fpf_signal_2 recall	fpf_signal_3 recall	fpf_signal_4 recall	high_synthetic_score_recall	high_id_theft_score_recall	n_pos_fpf_signal_1	n_pos_fpf_signal_2	n_pos_fpf_signal_3	n_pos_fpf_signal_4	n_high_synthetic_score	n_high_id_theft_score
bad_labels_overall	2,000	4.3%	15.1%	21.2%	27.4%	1.2%	3.2%	86	302	425	547	24	63
overall_approvals	165,000	4.4%	2.1%	2.5%	3.6%	4.3%	3.4%	7,179	3,469	4,192	5,981	7,076	5,578

Large Bank 2

label category	n_total	fpf_signal_1 recall	fpf_signal_2 recall	fpf_signal_3 recall	fpf_signal_4 recall	high_synthetic_score_recall	high_id_theft_score_recall	n_pos_fpf_signal_1	n_pos_fpf_signal_2	n_pos_fpf_signal_3	n_pos_fpf_signal_4	n_high_synthetic_score	n_high_id_theft_score
bad_labels_overall	18,000	3.5%	3.4%	3.5%	1.1%	1.9%	3.8%	622	606	639	200	350	683
bad_label_sub_category_1	12,500	4.1%	3.4%	2.2%	3.2%	3.9%	17.8%	513	419	269	405	482	2,227
bad_label_sub_category_2	5,500	27.6%	30.3%	4.5%	2.0%	2.9%	4.0%	1,517	1,666	248	109	160	218
overall_apps	240,000	2.6%	8.2%	8.9%	8.4%	2.6%	5.8%	2,580	8,239	8,947	8,474	2,753	5,818

What these tables tell us

These tables show, for each individual chunk of fraud labels, the percent of those labels "flagged" by each individual signal (i.e., the "recall" of that signal on that chunk). For comparison, we also show in the bottom row of each table the percent of the overall dataset that was flagged by the signal.

The ratio of these numbers was an important quantity for us. We called it "relative likelihood":

relative_likelihood =
  P(signal = 1 | label = fraud) / P(signal = 1)

This is a proxy for precision that we would both look at internally and present along with recall. We found that it was easier for potential customers to translate "This feature flags 30% of your fraud while flagging 2% of your approved applications, a 15x ratio" into business value than "This feature has a 30% recall and a 15% precision." ¹

Credit Union 1 (relative likelihood)

label category	n_total	fpf_signal_1 relative likelihood	fpf_signal_2 relative likelihood	fpf_signal_3 relative likelihood	fpf_signal_4 relative likelihood	high_synthetic_score relative likelihood	high_id_theft_score relative likelihood
bad_labels_overall	4,000	7.83	13.93	0.65	0.79	2.93	0.76
bad_label_sub_category_1	2,500	9.02	10.71	0.82	0.37	0.87	0.71
bad_label_sub_category_2	1,000	0.90	1.36	0.70	0.45	1.80	1.65
bad_label_sub_category_3	500	7.81	8.14	0.80	1.05	1.60	1.53
overall_approvals	200,000	1.00	1.00	1.00	1.00	1.00	1.00

Large Bank 1 (relative likelihood)

label category	n_total	fpf_signal_1 relative likelihood	fpf_signal_2 relative likelihood	fpf_signal_3 relative likelihood	fpf_signal_4 relative likelihood	high_synthetic_score relative likelihood	high_id_theft_score relative likelihood
bad_labels_overall	2,000	0.98	7.19	8.48	7.61	0.28	0.94
overall_approvals	165,000	1.00	1.00	1.00	1.00	1.00	1.00

Large Bank 2 (relative likelihood)

label category	n_total	fpf_signal_1 relative likelihood	fpf_signal_2 relative likelihood	fpf_signal_3 relative likelihood	fpf_signal_4 relative likelihood	high_synthetic_score relative likelihood	high_id_theft_score relative likelihood
bad_labels_overall	18,000	1.35	0.41	0.39	0.13	0.73	0.66
bad_label_sub_category_1	12,500	1.58	0.41	0.25	0.38	1.50	3.07
bad_label_sub_category_2	5,500	10.62	3.70	0.51	0.24	1.12	0.69
overall_apps	240,000	1.00	1.00	1.00	1.00	1.00	1.00

You'll note that we also included in these tables the recalls and relative likelihoods for scoring high on our synthetic fraud and identity theft models. Extending the cars, trucks, and bicycles analogy above: in addition to telling you whether there were certain material patterns present, imagine you also had models that you knew were 80%–90% accurate at telling you whether the image contained a truck or a bicycle.

Making sense of these tables

Looking at these six tables, several things jump out:

Signals 1 and 2 tend to "pop" on sub-categories 1 and 3 from the credit union, and sub-category 2 from the second large bank. These signals often "fire together," and they tend to fire on distinct sets of fraud-label datasets.
Signal 1 does not meaningfully "pop" on Large Bank 1 overall (relative likelihood near 1), whereas Signals 2–4 do.
Large Bank 2's sub-category 1 has a high rate of ID theft (measured by the percentage of these applications that score highly on our ID theft score); this indicated to us that this label set was likely mostly identity theft rather than first party fraud.
Sub-category 2 from the credit union did not fire on any of our hypothesized FPF signals, suggesting either a different underlying M.O. or substantial label noise.

From signals to scores

This analysis certainly helped us understand our data deeply, but how did we go from running these numbers to producing first party fraud scores?

When modeling, before making technical decisions like which model structure to use (logistic regression or gradient-boosted trees) and how to select hyperparameters, there are decisions closer to business strategy that end up having a much bigger impact on the efficacy of the model you build:

Which features (or "signals" as we tended to refer to them) to use, i.e., what the "columns" of the dataset behind the model will be.
Which dataset or datasets you'll train on; this includes deciding both what the "bads" and the "not bads" should be.

The analysis described above directly informs both of these questions. Without breaking down how key individual signals are (or are not) associated with various potential targets, you are largely taking shots in the dark. (An alternative approach is to feed every conceivable signal you can think of into the model. We learned some hard lessons about doing this with our synthetic and ID theft models: while it did not necessarily degrade aggregate model performance, it led to less explainable SHAP values and occasional misses that sometimes raised uncomfortable questions among our largest and most conservative customers.)

Of the 10–20 potential signals we evaluated, we ended up finding eight that were conceptually distinct from one another and individually predictive on many of the datasets we'd received. By this, we mean that they were prevalent among many of the FPF label datasets we'd received and not so prevalent overall (in other words, high recall and high relative likelihood, respectively). Each of these eight signals was a boolean true/false flag, accompanied by numeric "sub-signals" (e.g., the number of distinct SSNs someone had committed synthetic fraud with, or the number of DDAs applied for in different time windows). These signals ultimately became ~40 features in our model.

To choose which datasets to use as our "goods" and "bads," we used the analysis above as a starting point and followed it up with primary research into the fraud M.O.s. This led us to conclude that the "bad" labels we'd received could be broadly grouped into two distinct FPF M.O.s (one related to check fraud and one related to ACH fraud). We saw these bad labels consistently associated with distinctive, though overlapping, sets of the eight signals we'd found.

Note that we did not use statistical techniques such as clustering on the tables above. We had on the order of 10–20 datasets and 10–20 signals (of which we ended up including eight in the initial models), so we simply manually reviewed the performance of each boolean signal on each dataset to determine what to include.

Once we had decided on these two distinct models and had a general sense of the labels we wanted to include in each, we were able to go back to the customers from whom we'd received these labels, share our analysis and research, and dive deeper into these labels. Deep engagement allowed us to refine the labels we had down to fit the fraud M.O.s that we were targeting.

This ability to refine the labels with input from the customers we received them from was so critical in increasing label quality that we ended up training each of the initial scores on one customer's data. Our collaboration with these customers about the sub-categories of labels they had sent led to concentrated sets of 1,000–2,000 high-quality labels as the "bads" for each model; the "goods" were approved applications from the same time period, from those same customers.

The scores did well

Launched in early 2024, by the time I left SentiLink in late 2025, many top FIs were using them as part of their fraud decisioning, including:

Four top fifteen banks
Two top ten credit card issuers
A top five credit union

One additional FI had signed to use them but had not yet integrated. Furthermore, nearly all of the top FIs SentiLink worked with were engaged in the retrostudy process with these scores.

These scores became SentiLink's fastest-growing product launched during my last two years at the company, accounting for roughly 5% of ARR by the time I left despite no marketing push. We largely offered existing customers a way to test these products alongside our synthetic fraud and identity theft scores, and more often than not, they found that the scores caught fraud our other products missed. In my final quarter, four of the top FIs mentioned above went live via API.

We eventually created a single score that combined the two underlying models via a simple linear transformation. As I was transitioning out, the team was exploring training a unified model that included labels from multiple partners. Still, the fact that the simple approach we took got us as far as it did holds several lessons.

Lessons

Evaluation illuminates everything. Just as, at a macro level, how a SaaS company defines and breaks down its ARR can affect and elucidate how it is operating, deciding how to evaluate your models can affect everything. In particular, creating a more granular evaluation framework, as we did here (breaking down performance into 10–15 individual datasets along a couple of key metrics) affected not just how we understood how well we were solving the problem of first party fraud, but even how we defined the problem itself. This framework was nothing fancy from a technical perspective: we stored the datasets themselves in S3, with a couple of version-controlled Python scripts that could produce tables like the ones above, add new datasets as needed, and evaluate new features and scores on our existing datasets.

What came next

These scores allowed us to initiate collaborations with major U.S. FIs around first party fraud that continue to this day. SentiLink has used these relationships to explore more "determinative" solutions for first party fraud, and a couple of the eight signals we developed later became useful building blocks for adjacent fraud efforts that were in progress when I left and that SentiLink may be announcing soon.

Most importantly, this work established at SentiLink a set of practices for evaluating fraud signals on external datasets in a systematic way. As I was departing, the business had launched an initiative to evaluate our identity theft score and some of its key underlying signals on external datasets, using the same approach we'd developed for FPF. We even built a similar evaluation framework for a new fraud area, and it helped us see that we likely couldn't produce compelling scores or flags there.

Evaluations beyond fraud models

In a future post, I'll discuss how this framework applies to evaluating quantized models.

About the author

Seth Weidman worked at SentiLink for about six years, from when the company was about two-and-a-half years old (December 2019) to when it was eight-and-a-half years old (November 2025).

Footnotes

¹ Assuming the overall fraud rate is 1%, flagging 30% of all fraud while flagging 2% of approvals implies roughly 15% precision (the precision value referenced in the main text). ↩