Given a collection name
product having following documents in the collection.
{ "_id" : NumberInt("1"),
"account" : "abc","vendor" : "amazon","category" : "mobile"},
{"_id" : NumberInt("2"), "account" : "abc","vendor" : "overstock","category" : "mobile"},
{"_id" : NumberInt("3"),"account" : "xyz","vendor" : "fashionStreet","category" : "fashion"},
{"_id" : NumberInt("4"),"account" : "pqr","vendor" : "foodTruck","category" : "food"},
{"_id" : NumberInt("5"),
"account" : "abc","vendor" : "amazon", "category" : "mobile"}
If one notice there are duplicate elements exist in the collection with the composite key(account+ vendor+ category) marked in blue on colour , so how do we identify the duplicates in a collection and how do we remove the duplicate keys in the MongoDB?
This problem is easily solved using Aggregation pipeline framework which provides $group operator to group an compound or composite key along with the $match operator to match a criteria.
db.product.aggregate([
{ $group: {
_id: { account: "$account", vendor: "$vendor", category: "$category" },
count: { $sum: 1 }
}},
{
$match: {
count: { $gt: 1 }
}
},
{
$sort: {
count : -1 }
}
]);
Above query find all the duplicates elements whose account, vendor , category values are same and having duplicate values and output of the above query is as follow with the descending order of count :
Output :
{
"_id" : {
"account" : "abc",
"vendor" : "amazon",
"category" : "mobile"
},
"count" : 2
}
Note that _id is made composite of 3 columns account , vendor , category so in general
_id : { key1 : "$key1", key2 : "$key2", key3 : "$key3", key4 : "$key4", key5 : "$key5" }
and count plays an important role to find out the number of duplicates in the collection.
With the above example , we have identified the duplicate elements in the collection , so the next question arises , how to remove duplicate keys in a collection in MongoDB?
var duplicates = [];
db.product.aggregate([
{ $group: {
_id: { account: "$account" , vendor : "$vendor", category : "$category"}, // can be grouped on multiple properties/keys/columns
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
])
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting and important line here
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
});
// If you want to Check all "_id" which you are deleting
printjson(duplicates);
Output :
[ NumberInt("1") ]
Now we have identified the array of duplicate "_id" as shown in the output of the above query.
To remove all the duplicates from the collection , need to execute the following
// Remove all duplicates in one go
db.product.remove({_id:{$in:duplicates}});
db.product.find({});
Output :
{"_id" : NumberInt("2"), "account" : "abc","vendor" : "overstock","category" : "mobile"},
{"_id" : NumberInt("3"),"account" : "xyz","vendor" : "fashionStreet","category" : "fashion"},
{"_id" : NumberInt("4"),"account" : "pqr","vendor" : "foodTruck","category" : "food"},
{"_id" : NumberInt("5"),
"account" : "abc","vendor" : "amazon", "category" : "mobile"}
So you will get an amazing result with the all the records are unique in the collection.
Be alert while applying an index on the composite columns with
{unique: true, dropDups: true}).
Suppose in the above scenario we have applied these condition then it will throw an error as
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"errmsg" : "exception: E11000 duplicate key error index
"code" : 11000,
"ok" : 0
}
Note that dropDups has been removed in the Mongo 3.x release and has a severe disadvantage of removing the any random duplicate element.
Even if you apply ensureIndex() or createIndex() at the time of collection having duplicate keys then it will prompt exception "exception: E11000 duplicate key error index" , so its important to remove the duplicate elements with the above mentioned aggregation example and then applying the {unique: true} on the composite index created with the fields say here {account, vendor , category}.
Enjoy the MongoDB and the brilliant Aggregation Framework !!!!
Enjoy the coding with tek9g!!!