Let's say you have two Data Extensions, one that contains customer information and one that contains order information as shown below:
|ID||First Name||Last Name|
The two Data Extensions are matched using the ID column from the Customers Data Extension and the CustomerID column from the Orders Data Extension. In total, you have 60 rows in your Customers Data Extension and 300 rows in your Orders Data Extension. One customer can have many orders and all CustomerIDs inside Orders Data Extension exist in the ID column of Customers Data Extension.
You want to get all customers that have placed an order, alongside an order's information, so you add those two Data Extensions to your Selected Data Extensions and only keep the matched fields of the two Data Extensions (INNER JOIN) and store the results in a Data Extension that looks as follows:
|CUSTOMERS WITH ORDERS
|CustomerID||First Name||Last Name||Order Date||Amount|
Where CustomerID is the primary key of the Data Extension.
After running your selection you get 60 results. But afterward, you decide you would like to perform an A/B test so you decide to use the Sort & Limit functionality to randomly select 50% of your results and rerun, but still get 60 results on your selection.
The reason behind this is that Sort & Limit is applied to the query level before the data are saved inside the Target Data Extension.
When the Data Extensions are combined they actually generate 300 results, and the CustomerID field is not unique. When the data are getting saved inside the Target Data Extension the duplicates are being removed, since CustomerID is the primary key and it must be unique, and you end up seeing only 60 results.
So, when you apply Sort & Limit the number of results you see will be half of the originally generated data. In this case, what happened is that the original query generated 300 results, so half of that would be 150 rows, and when those were stored in the Target Data Extension the duplicates were removed and you got again 60 rows.
Keep in mind that the opposite might happen as well, you might see a smaller amount of results than the one expected in case of duplicates in a primary key column that got removed.
To avoid such unpredictable behavior, it is recommended to either use our Prio Deduplication feature or make sure that when dealing with one to many relationships, you use as the primary key of your Target Data Extension the primary key of the latter (in the example provided ID of Orders Data Extension)
Find out more about how to use Prio Deduplication on the links below:
- How can I deduplicate the results of my selection?
Please sign in to leave a comment.