Match and Merge duplicates using rules

osaka · 22 February 2024 05:55

Is it possible to identify duplicates using rules and then merge or create new records.
For. e.g. I have contacts from several sources with name, email, phone number.
Some sources may give nam, email. Some can give name and phone.
Is it possible to identify the duplicates using rules.
and then tag all the duplicates with one common id.
Then take the best of values based on some score again calculated via rule.
Thinking how to query the mastered or golden record values vs source only values.
And how to represent that each attribute came from which source and source id.

e.g.

record 1: 
name: Bob
email : bob@gmail.com
source: Facebook
sourceId:889001

record 2: Robert
email: bob@gmail.com
source: Twitter
sourceId: @bob23
phone: "+1-214-111-1111"

Result: Option 1: Created via rule

New Record 3:  
name: Robert, Bob: Best Value: Robert
email: bob@gmail.com
sources: [{
source: Twitter
sourceId: @bob23
}, {
source: Facebook
sourceId:889001
}
]
phone: "+1-214-111-1111"
email: bob@gmail.com
mastered id: 101 or select one id from one of the sources based on some logic

Result option 2: Edit one of the existing records and enrich values there

krishnan · 22 February 2024 10:27

Do you have a deterministic way to determine which the duplicates are?

At the moment, there’s not a great way to do it because you can’t extend relations using rules, nor can you do aggregate queries within rules (We’re looking into it for a later version of 3.x)

You can have a set of rules of the form duplicate-because-same-<x>
E.g.:

rule duplicate-because-same-email:
when {
  $r1 isa record, has email $e;
  $r2 isa record, has email $e;
} then {
  (record: $r1, record: $r2) isa duplication; 
}

Something like this should get you as far as returning all duplicate records for a given record:

match 
$r1 isa record, has sourceId "889001";
fetch
duplicates:  { 
   match 
     { $r1 is $r2; } or { ($r1, $r2) isa duplication; };
   fetch:
     $r2: attribute;

Without aggregation in rules, the scoring bit is a bit hard to do. You might have to do it client side.

Also remember that rules are an extension to queries and are meant for transactional workloads. Is there a reason you’d prefer to use rules than having an external, periodic batch-process sweeping through your database and merging records?

krishnan · 22 February 2024 10:38

Oops. I forgot a transitive rule. You’ll want:

rule duplicate-because-same-email:
when {
  (record: $r1, record: $r2) isa duplication; 
  (record: $r2, record: $r3) isa duplication; 
} then {
  (record: $r1, record: $r3) isa duplication; 
}

osaka · 22 February 2024 14:51

Thanks for your response.
Periodic is the right way as you said.
But the challenges are, if sources changes the data we need identify and rematch things.
And redo merges/unmerges and all.
Thinking if we can make it query time population so that the new data generated is not stale.

Also about changing data.
Is there a way to prevent data to be modified as there is a reference of the data used by some other data. E.g. someone created a list of contacts they are interested for some purpose say campaign. But now what if the data from the source changed, e.g name and email changed.
There is a stale copy in campaign or if we used reference, modified data which was not reviewed.
Is there to stop or warn if the update is going to cause data issue.

krishnan · 22 February 2024 15:11

About changing data: There isn’t a good way to enforce the data to be “read-only” if it’s been referred to in a copy somewhere.

There is a stale copy in campaign or if we used reference, modified data which was not reviewed.

^ This will indeed happen. I can see why query-time makes sense for you.
How would you determine what the “best-value” is? Is it dependent only on the value itself, or on the other attributes of the sources too? Can you assign each value a ‘score’ so you can pick the one with the highest score?
Or can you only determine whether one value is better than the other?

osaka · 23 February 2024 01:48

About the score, few attributes are solely scored based on the value it has. Majority will be based on the source.
Very rarely used, based on the frequency of occurence of the value among all the duplicate records.

Yeah I would want to do it that way.

krishnan · 29 February 2024 12:50

You might be able to do it with the subqueries in fetch.
Assuming you have a rule to compute the rule of an attribute (here, email):

 match 
$r isa record, has sourceId "889001";
fetch
best_email:  { 
   match 
     { $r is $r_best; } or { ($r, $r_best) isa duplication; }; 
     $r_best has email $e_best;
     $e_best has score $es_best;
     not {
         ($r_other, $r_best) isa duplication;
         $r_other has email $e_other;
         $e_other has score $es_other;
         $es_other > $es_best;
     };
     
   fetch:
     $e_best;

Topic		Replies	Views
Inference taking forever, any suggestions? TypeDB	8	174	17 May 2024
Infer existence of an entity? TypeQL	1	582	13 July 2022
How many rules does the TypeDB support? TypeDB	19	760	8 December 2022
Include all explanations for an inferred relation? TypeDB typedb-server	4	562	10 February 2023
Why inference doesn't work? TypeDB	2	132	15 May 2024

Services

Platform

Tools

What is TypeDB?

Try cloud free

Pricing and plans

Learn

Community

Learn

Content

Match and Merge duplicates using rules

Services

Platform

Tools

Learn

Community

Learn

Content

Match and Merge duplicates using rules

Related topics