A Machine Learning Approach to Deduping Salesforce

When we look at all of the Salesforce deduplication apps on the AppExchange, they all have one thing in common: they are all rule-based. Basically, Salesforce admins have to set up all kinds of rules to catch duplicates in all of their shapes, forms, and sizes. However, this is a much better approach to solving duplicate issues that involve machine learning. In this article, we will discuss how machine learning algorithms are trained to spot duplicates and what makes such an approach superior to the rule-based one. Let’s start off by learning about how machine learning algorithms identify similar records.

Identifying Similar Records

If we take a look at the two records below, it will be fairly obvious, to a human being, that these two records are duplicates:

First Name	Last Name	Address
Benjamin	Harding	123 Wisconsin Avenue
Ben	harding	123 Wisconsin Ave

However, if we are asking a machine to identify these records as duplicates, we would have to explain exactly why these records are duplicates. This process is much harder than it may seem to be. We could start out by listing all of the similarities, but then we would have to explain what we mean by “similar”. Are there gradations to “similar”? If so, what are they? Which similarities automatically indicate that two records are duplicates.

One way researchers go about solving this problem is by using string metrics. There are many different types of string metrics, but for the purpose of this article, we will go into only a couple of them. First of all, a string metric is a way of taking two strings and returning a number that is low if the strings are similar and high if they are not similar. Of the most widely used string metrics is called the Hamming Distance, which counts the number of substitutions that must be made to turn one string into another. For example, if we return to our table of records above, there needs to be 1 substitution made to turn “harding” into “Harding”. Therefore the Hamming distance would be 1.

Don't forget to check out: 8 Salesforce Sales Cloud Tips To Boost Your Productivity and Drive Revenue

There are also the learnable distance metrics that take into consideration that different edit operations have varying significance in different domains. For example, if we change even one digit in the house number, we are effectively changing the entire address. But if we make a letter substitution in the street name, this may not be as significant since this could have been done by mistake or is an abbreviation. From this example, we see that certain similarities need to be given greater emphasis than others. We will talk more about this later on, but for now, let’s take a look at how all of these metrics are used to dedupe your Salesforce environment.

Using Machine Learning to Dedupe Your Salesforce

A machine learning algorithm can view a Salesforce record either as a single block or by each field individually. Below is a representation of the block approach:

Record 1	Record 2
Benjamin Harding 123 Wisconsin Avenue	Ben harding 123 Wisconsin Ave

This is how a field-by-field approach would look:

	Record 1	Record 2
First Name	Benjamin	Ben
Last Name	Harding	harding
Address	123 Wisconsin Avenue	123 Wisconsin Ave

The field-by-field approach is a lot more useful since it allows you to assign a weight to each individual field. For example, the field “Last Name” will be given more weight than the “First Name” field. Salesforce deduping tools based on machine learning will allow you to set the weights for each individual field and use those weights when comparing future records.

The Benefits of Using Machine Learning to Dedupe Your Salesforce

One of the biggest benefits you get with a machine learning-based approach is active learning. Basically, when you label two records as duplicates (or not) the system will “learn” from these actions and will adjust its algorithms accordingly to identify such records as duplicates or unique in the future. The weights that we talked about in the previous section will continuously be modified to fit your particular data.

Speaking of weights, exactly how much more weight should be given to the “Last Name” field than the “First Name” field? Is it 2 times more or 1.7? A human being would never be able to crunch so many numbers to calculate the appropriate weight. Machine Learning on the other hand can crunch an infinite amount of data and would have problems coming up with the necessary weights for each field. Machine learning algorithms will be able to calculate accurate weights for each field in your dataset, a process known as regularized logistic regressions.

Check out another amazing blog by ildudkin here: These 5 Steps Will Help You Improve the Health of Your Salesforce Data

Additional Benefits of Using Machine Learning

When you are using a traditional rule-based tool to dedupe your Salesforce, you will need to set up complex rules to catch all of the “fuzzy” duplicates. This is very time-consuming and is a futile effort since there is no way you can set up a rule for each possible scenario. Machine learning does all of this work for you There are no complex rules to set up. All you have to do is download and install the tool and you can get started right away. The algorithm will be fully customizable for your individual needs and will adjust all of the field weight automatically.

If the tool you are currently using has let you down, consider switching over to the machine learning approach. This will help you identify and remove duplicates a lot quicker and with greater efficiency which makes this approach a lot more attractive than the rule-based one. Just like most apps on the AppExchange, they offer a limited free trial, so you can try it out for yourself and see first hand the difference machine learning can make in improving the health of your data.