Is it possible to identify duplicates using rules and then merge or create new records.
For. e.g. I have contacts from several sources with name, email, phone number.
Some sources may give nam, email. Some can give name and phone.
Is it possible to identify the duplicates using rules.
and then tag all the duplicates with one common id.
Then take the best of values based on some score again calculated via rule.
Thinking how to query the mastered or golden record values vs source only values.
And how to represent that each attribute came from which source and source id.
e.g.
record 1:
name: Bob
email : bob@gmail.com
source: Facebook
sourceId:889001
record 2: Robert
email: bob@gmail.com
source: Twitter
sourceId: @bob23
phone: "+1-214-111-1111"
Result: Option 1: Created via rule
New Record 3:
name: Robert, Bob: Best Value: Robert
email: bob@gmail.com
sources: [{
source: Twitter
sourceId: @bob23
}, {
source: Facebook
sourceId:889001
}
]
phone: "+1-214-111-1111"
email: bob@gmail.com
mastered id: 101 or select one id from one of the sources based on some logic
Result option 2: Edit one of the existing records and enrich values there
Do you have a deterministic way to determine which the duplicates are?
At the moment, thereās not a great way to do it because you canāt extend relations using rules, nor can you do aggregate queries within rules (Weāre looking into it for a later version of 3.x)
You can have a set of rules of the form duplicate-because-same-<x>
E.g.:
rule duplicate-because-same-email:
when {
$r1 isa record, has email $e;
$r2 isa record, has email $e;
} then {
(record: $r1, record: $r2) isa duplication;
}
Something like this should get you as far as returning all duplicate records for a given record:
match
$r1 isa record, has sourceId "889001";
fetch
duplicates: {
match
{ $r1 is $r2; } or { ($r1, $r2) isa duplication; };
fetch:
$r2: attribute;
Without aggregation in rules, the scoring bit is a bit hard to do. You might have to do it client side.
Also remember that rules are an extension to queries and are meant for transactional workloads. Is there a reason youād prefer to use rules than having an external, periodic batch-process sweeping through your database and merging records?
Thanks for your response.
Periodic is the right way as you said.
But the challenges are, if sources changes the data we need identify and rematch things.
And redo merges/unmerges and all.
Thinking if we can make it query time population so that the new data generated is not stale.
Also about changing data.
Is there a way to prevent data to be modified as there is a reference of the data used by some other data. E.g. someone created a list of contacts they are interested for some purpose say campaign. But now what if the data from the source changed, e.g name and email changed.
There is a stale copy in campaign or if we used reference, modified data which was not reviewed.
Is there to stop or warn if the update is going to cause data issue.
About changing data: There isnāt a good way to enforce the data to be āread-onlyā if itās been referred to in a copy somewhere.
There is a stale copy in campaign or if we used reference, modified data which was not reviewed.
^ This will indeed happen. I can see why query-time makes sense for you.
How would you determine what the ābest-valueā is? Is it dependent only on the value itself, or on the other attributes of the sources too? Can you assign each value a āscoreā so you can pick the one with the highest score?
Or can you only determine whether one value is better than the other?
About the score, few attributes are solely scored based on the value it has. Majority will be based on the source.
Very rarely used, based on the frequency of occurence of the value among all the duplicate records.
You might be able to do it with the subqueries in fetch.
Assuming you have a rule to compute the rule of an attribute (here, email):
match
$r isa record, has sourceId "889001";
fetch
best_email: {
match
{ $r is $r_best; } or { ($r, $r_best) isa duplication; };
$r_best has email $e_best;
$e_best has score $es_best;
not {
($r_other, $r_best) isa duplication;
$r_other has email $e_other;
$e_other has score $es_other;
$es_other > $es_best;
};
fetch:
$e_best;