Sunday 4 June 2017

Aggregation Framework - finding & removing the duplicate elements in MongoDB

 Given a collection  name product having following documents in the collection.


{ "_id" : NumberInt("1"),"account" : "abc","vendor" : "amazon","category" : "mobile"},
{"_id" : NumberInt("2"), "account" : "abc","vendor" : "overstock","category" : "mobile"},
{"_id" : NumberInt("3"),"account" : "xyz","vendor" : "fashionStreet","category" : "fashion"},
{"_id" : NumberInt("4"),"account" : "pqr","vendor" : "foodTruck","category" : "food"},
{"_id" : NumberInt("5"),"account" : "abc","vendor" : "amazon", "category" : "mobile"}

If one notice there are duplicate elements exist in the collection with the composite key(account+ vendor+ category) marked in blue on colour , so how do we identify the duplicates in a collection and how do we remove the duplicate keys in the MongoDB?

This problem is easily solved using Aggregation pipeline framework which provides  $group operator to group an compound or composite key along with the $match operator to match a criteria.

db.product.aggregate([

{ $group: { 

_id: { account: "$account", vendor: "$vendor", category: "$category" }, 
count: { $sum: 1 } 
}}, 
{
    $match: {
        count: { $gt: 1 }
    }
}, 
{
 $sort: { 
    count : -1 }
 }
]);

Above query find all the duplicates elements whose account, vendor , category  values are same and having duplicate values and output of the above query is as follow  with the descending order of count : 

Output

{
"_id" : {
"account" : "abc",
"vendor" : "amazon",
"category" : "mobile"
},
"count" : 2
}

Note that _id is made composite of  3 columns account , vendor , category so in general 

_id :  { key1 : "$key1",  key2 : "$key2", key3 : "$key3", key4 : "$key4", key5 : "$key5" }

and count plays an important role to find out the number of duplicates in the collection.

With the above example , we have identified the duplicate elements in the collection , so the next question arises , how to remove duplicate keys in a collection in MongoDB?

var duplicates = [];
db.product.aggregate([
  { $group: { 
    _id: { account: "$account" , vendor : "$vendor", category : "$category"}, // can be grouped on multiple properties/keys/columns 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
])             
.forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting and important line here
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
});

// If you want to Check all "_id" which you are deleting
printjson(duplicates); 

Output : 

[ NumberInt("1") ]



Now we have identified the array of duplicate "_id"   as shown in the output of the above query.
To remove all the duplicates from the collection  , need to execute the following 

// Remove all duplicates in one go    
db.product.remove({_id:{$in:duplicates}});

db.product.find({});

Output

{"_id" : NumberInt("2"), "account" : "abc","vendor" : "overstock","category" : "mobile"},
{"_id" : NumberInt("3"),"account" : "xyz","vendor" : "fashionStreet","category" : "fashion"},
{"_id" : NumberInt("4"),"account" : "pqr","vendor" : "foodTruck","category" : "food"},
{"_id" : NumberInt("5"),"account" : "abc","vendor" : "amazon", "category" : "mobile"}


So you will get an amazing result with the all the records are unique in the collection.

Be alert while applying an index on the composite columns with {unique: true, dropDups: true}). 

Suppose in the above scenario we have applied these condition then it will throw an error as 


{
    "createdCollectionAutomatically" : false,
    "numIndexesBefore" : 1,
    "errmsg" : "exception: E11000 duplicate key error index
    "code" : 11000,
    "ok" : 0
}
Note that dropDups has been removed in the Mongo 3.x release and has a severe disadvantage of removing the any random duplicate element.

Even if you apply ensureIndex() or createIndex()  at the time of collection having duplicate keys then it will prompt   exception "exception: E11000 duplicate key error index"  , so its important to remove the duplicate elements with the above mentioned aggregation example and  then applying the {unique: true}  on the composite index created with the fields say here {account, vendor , category}.


Enjoy the MongoDB  and the brilliant  Aggregation Framework !!!!

Enjoy  the coding with tek9g!!! 

1 comment:

  1. The most effective method to Solve MongoDB Error Message 11000 through MongoDB Technical Support
    At whatever point you confront this Error 11000 which basically signifies "Copy Key Error Index" it implies the file has two passages for a similar ID. To get very best arrangement with respect to this issue, contact to MongoDB Online Support or MongoDB Customer Support USA. We at Cognegic empower proactive checking and observing of your whole database and help to keep the issue before they happen. We just consider what is essential for the best practice and diminish single purpose of disappointment.
    For More Info: https://cognegicsystems.com/
    Contact Number: 1-800-450-8670
    Email Address- info@cognegicsystems.com
    Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801

    ReplyDelete