Data is everywhere, but?: Dropping Duplicates

Wednesday, February 2, 2011

Dropping Duplicates

If you search for "Duplicate rows sql server", in Google you will get about 136,000 results. This will tell how series the problem is. I am not going to add another post to above huge number of articles. What I going to explain here is how some other database system does this.

I was able to attend to a MongoDB conference last month and came across few features which are nice to have features in SQL Server.

In MongoDB, there are few ways for creating indexes, one is stated below.

db.users.ensureIndex({UserID:1},{"unique":true, "dropDups":true})

Just to tell you what the above script is, users is the "table" (Collection in MongoDB) and UserID is the "column" (Attribute in MongoDB") and ensureIndex is the command to create the index.

so, in this statement, unique:true will create a unique index BUT if there are duplicates records, it will retain only the first "record" ( Document in MongoDB" ) and others will be deleted because of the dropDups:true statment.

Now, this is very efficient way of deleting duplicates, but you need to pay more attention before issuing this command.

5 comments:

Dinesh AsankaFebruary 4, 2011 at 2:19 PM
Here is a general SQL method that will log deletes and that should work with modern, ANSI compliant databases supporting standard OLAP functions. In this example tab1 has 3 nullable smallint columns, named not very imaginatively col1, col2, col3. I ran this in DB2 Personal Edition V9.7.1 just now:

delete from
(select col1,col2,col3,
row_number() over (order by col1,col2,col3) RN,
rank() over (order by col1, col2, col3) RNK
from tab1) temp
where RN != RNK

Here is a general SQL method that will log deletes and that should work with modern, ANSI compliant databases supporting standard OLAP functions. In this example tab1 has 3 nullable smallint columns, named not very imaginatively col1, col2, col3. I ran this in DB2 Personal Edition V9.7.1 just now:

delete from
(select col1,col2,col3,
row_number() over (order by col1,col2,col3) RN,
rank() over (order by col1, col2, col3) RNK
from tab1) temp
where RN != RNK

You can easily modify this query to select and verify qualifying rows before they are deleted. In fact, in DB2, you can do both report the deleted rows and run the delete at the same time. No idea if there is a SQL Server equivalent as I eschew Micro$loth.Note: this is a general method, posted for interest only, and I would not use it in DB2 for tables of significant cardinality.

By Alex Levy
ReplyDelete
Replies
Dinesh AsankaFebruary 4, 2011 at 2:19 PM
• Here's a SQL Server method:

1. Create a temp table with the same structure as the source table.
2. Create a unique index on the temp table with "ignore dupe key"
3. Insert the source table into the temp table (duplicates will be removed)
4. Truncate the source table
5. Insert the temp table into the source table.

By Michael Austin
ReplyDelete
Replies
Dinesh AsankaFebruary 4, 2011 at 3:47 PM
I don't think I'd want to do it that way with a large table- would at least double the TLOG use and the physical space used by the table's data.

There are ways to dedupe without making a new copy of the data. Research rownum and windowing functions...

Having said that any database designer who creates DDL that allows dupes in a table should be fired anyway.
regards

By Kenneth Ambrose
ReplyDelete
Replies
Dinesh AsankaFebruary 4, 2011 at 3:48 PM
Michael Austin described one way to remove duplicates in MS SQL Server. There are other ways of accomplishing the same goal and that is the beauty of most products in our industry. Dinesh described a way of doing it using MongoDB and Kenneth Ambrose made some recommendations for Oracle.

My only advice is to make sure you analyze the whole data row before removing duplicate key values and that entire rows are duplicate and not just the columns to be used as a unique constraint.

By Sal Young
ReplyDelete
Replies
Dinesh AsankaFebruary 5, 2011 at 5:18 PM
You need read MS-SQL book online, and understand T-SQL;)

http://msdn.microsoft.com/en-us/library/aa299742(v=SQL.80).aspx

Dropping statement for deleting a table not a duplicates in a row?

- You can delete duplicated row by folowing http://support.microsoft.com/kb/139444
temp table can be created by Select into statement

-Use DISTINCTROW

Posted by Osama Abu-Arisheh
ReplyDelete
Replies

Add comment

Translate

Wednesday, February 2, 2011

Dropping Duplicates

5 comments: