PDA

View Full Version : Find duplicates in each column, then delete rows containing only Subsequent duplicate



Diakonos1984
09-06-2013, 07:12 AM
I have a large spreadsheet (5000+ rows, with 50+ columns). It is a database report of items that have relationships, and it ends up listing every possible permutation of the related items. Roughly half of this spreadsheet ends up being rows made up entirely of duplicate data.

What I need to be able to do is search selected columns and identify duplicate data, differentiating between first instance and subsequent instances. Then I need to delete each row that contains ONLY subsequent duplicate data. If a row contains even one unique or first instance item, it is NOT to be deleted.

I have written a series of subroutines to do this, and they mostly work. It's just super slow. Can you identify some performance killers and help me speed this up?

I have uploaded a half-size version of the spreadsheet if you want to play with it. The full version is too big! 10552
On this half-size version with 3727 rows, the whole routine takes about 20 minutes and deletes 2549 rows.

Complete code is below:


Sub SKMMainRPurgeDups() 'Main wrapper for finding duplicates in each column and then deleting rows containing only Subsequent duplicates

Dim wsTarget As Worksheet
Dim strStarTime As String
Dim strStopTime As String

strStartTime = Now()
Set wsTarget = ActiveSheet
SKMMainRIDDups wsTarget, vbGreen, vbRed
SKMMainRDeleteDups wsTarget, vbGreen, vbRed
wsTarget.ListObjects(1).DataBodyRange.Interior.ColorIndex = xlNone
strStopTime = Now()
Debug.Print "Started at " & strStartTime & "; finished at " & strStopTime
Beep 'since I'm probably not paying attention....


End Sub

'####################################################

Sub SKMMainRIDDups(wsTarget As Worksheet, First_Color As Variant, Subsequent_Color As Variant)
'Identifies duplicates in each column

Dim indTarCol As Long
Dim indTarCols As Long

Debug.Print "Starting SKMMainRIDDups: " & Now()
Application.ScreenUpdating = False
Set wsTarget = ActiveSheet

indTarCols = wsTarget.ListObjects(1).ListColumns.Count
'This For statement and the following two If statements limit my searching to specific columns. I don't care about duplicates in other columns.
For indTarCol = 5 To indTarCols
If Not wsTarget.ListObjects(1).HeaderRowRange.Cells(1, indTarCol).Find("tbl", LookIn:=xlValues, LookAt:=xlPart) Is Nothing Then
If Not wsTarget.ListObjects(1).HeaderRowRange.Cells(1, indTarCol).Find("Status", LookIn:=xlValues, LookAt:=xlPart) Is Nothing Then
Else
'wsTarget.ListObjects(1).ListColumns(indTarCol).DataBodyRange.Select 'for watching
MiscHighlightDup wsTarget.ListObjects(1).ListColumns(indTarCol).DataBodyRange, First_Color, Subsequent_Color
End If
End If
Next indTarCol

Application.ScreenUpdating = True
Debug.Print "Finished SKMMainRIDDups: " & Now()

End Sub

'####################################################

Sub SKMMainRDeleteDups(wsTarget As Worksheet, First_Color As Variant, Subsequent_Color As Variant)
'Deletes rows containing only subsequent duplicates, as determined by SKMMainRIDDups

Dim rngTarget As Range
Dim tarCell As Range
Dim indTarRow As Long 'row index in Target
Dim indTarCol As Long 'column index in Target
Dim indTarRows As Long 'number of rows in Target
Dim indTarCols As Long 'number of columns in Target
Dim flDontDelete As Boolean
Dim indRowsDeleted As Long 'statistics

Debug.Print "Starting SKMMainRDeleteDups: " & Now()
Application.ScreenUpdating = False

Set rngTarget = wsTarget.ListObjects(1).DataBodyRange
indTarRows = rngTarget.Rows.Count
indTarCols = rngTarget.Columns.Count
indRowsDeleted = 0

For indTarRow = indTarRows To 1 Step (-1) 'To delete from Table need to start at bottom
flDontDelete = False 'we're going to delete this row unless we find a good reason not to!

'This For statement and the following two If statements limit my searching to specific columns. I don't care about duplicates in other columns.
For indTarCol = 5 To indTarCols
If Not wsTarget.ListObjects(1).HeaderRowRange.Cells(1, indTarCol).Find("tbl", LookIn:=xlValues, LookAt:=xlPart) Is Nothing Then
If Not wsTarget.ListObjects(1).HeaderRowRange.Cells(1, indTarCol).Find("Status", LookIn:=xlValues, LookAt:=xlPart) Is Nothing Then
Else
'This If statement prevents wasted time from checking empty cells
If rngTarget.Cells(indTarRow, indTarCol) <> "" Then
Debug.Print "Checking Row " & indTarRow & " of " & indTarRows
'rngTarget.Cells(indTarRow, indTarCol).Select
If rngTarget.Cells(indTarRow, indTarCol).Interior.Color = Subsequent_Color Then
flDontDelete = False 'We found a Subsequent duplicate
'As long as we get to the end of this row and find no Unique items or First duplicates, we can delete the row
Else
flDontDelete = True 'If it's not a Subsequent duplicate, it must be a Unique item or a First duplicate
GoTo MoveAlong: 'No sense searching the rest of this row, bail out here
End If
End If
End If
End If
Next indTarCol

MoveAlong: 'We either got to the end of the row, or bailed out on a row that we know we won't delete
If flDontDelete = False Then 'row contained only Subsequent duplicates. IT'S OUTA HERE!
Debug.Print "Deleting Row " & indTarRow & " of " & indTarRows
Application.ScreenUpdating = True
'rngTarget.Rows(indTarRow).Select 'used for watching
rngTarget.Rows(indTarRow).Delete
indRowsDeleted = indRowsDeleted + 1
Application.ScreenUpdating = False
End If
Next indTarRow

Application.ScreenUpdating = True
Debug.Print "Finished SKMMainRIDDups: " & Now() & " Deleted " & indRowsDeleted & " rows of " & indTarRows

End Sub

'####################################################

Sub MiscHighlightDup(rngTarget As Range, First_Color As Variant, Subsequent_Color As Variant)
'rngTarget should be a single column
'Colors the first instance of a duplicate datum, and all subsequent instances with a different color
Dim tarCell As Range
'initially tried to make it JUST the populated range.
Dim rngFoundPop As Range 'Couldn't get that to work, so it's the whole range, less the parts we've already searched, gets smaller as we go!
Dim rngFoundDup As Range 'the range of duplicates
Dim indTarRows As Long

Debug.Print "Starting MiscHighlightDup for column " & rngTarget.Column & ": " & Now()
indTarRows = rngTarget.Rows.Count
Set rngFoundPop = rngTarget
'rngFoundPop.Select 'used for watching
For Each tarCell In rngTarget
If tarCell.Interior.ColorIndex = xlNone And tarCell.Value <> "" Then 'IsEmpty(tarCell) = False Then
'tarCell.Select 'used for watching
'Debug.Print "Analyzing Row " & tarCell.Row & " Of " & indTarRows 'Clutters the immediate window
Set rngFoundDup = MiscFind_Range(tarCell.Text, rngFoundPop, xlValues, xlWhole, True)
If Not rngFoundDup Is Nothing Then
If rngFoundDup.Address <> tarCell.Address Then
Debug.Print tarCell.Text & " has duplicates. First location: Column " & tarCell.Column & ", Row " & tarCell.Row
'rngFoundDup.Select 'used for watching
rngFoundDup.Interior.Color = Subsequent_Color
tarCell.Interior.Color = First_Color
End If
End If
End If
Set rngFoundPop = rngTarget.Range(Cells(tarCell.Row, 1), Cells(indTarRows, 1)) 'resizes the seached range so that it gets smaller (and we get faster!) as we go
'probably could be resized even more agressively, but too complicated for me right now
'rngFoundPop.Select 'used for watching
Next tarCell
Debug.Print "Finishing MiscHighlightDup for column " & rngTarget.Column & ": " & Now()

End Sub

'####################################################

Function MiscFind_Range(Find_Item As Variant, _
Search_Range As Range, _
Optional LookIn As Variant, _
Optional LookAt As Variant, _
Optional MatchCase As Boolean) As Range
'returns a possibly non-consecutive range containing all found items
'borrowed from http://www.ozgrid.com/forum/showthread.php?t=27240

Dim c As Range
Dim FirstAddress As String

If IsMissing(LookIn) Then LookIn = xlValues 'xlFormulas
If IsMissing(LookAt) Then LookAt = xlPart 'xlWhole
If IsMissing(MatchCase) Then MatchCase = False

With Search_Range
Set c = .Find( _
What:=Find_Item, _
LookIn:=LookIn, _
LookAt:=LookAt, _
SearchOrder:=xlByRows, _
SearchDirection:=xlNext, _
MatchCase:=MatchCase, _
SearchFormat:=False)
If Not c Is Nothing Then
Set MiscFind_Range = c
FirstAddress = c.Address
Do
Set MiscFind_Range = Union(MiscFind_Range, c)
Set c = .FindNext(c)
Loop While Not c Is Nothing And c.Address <> FirstAddress
End If
End With

End Function

Kenneth Hobs
09-06-2013, 07:25 AM
You took care of the screen updates. Another trick is to turn off automatic calculation. Those and other speed tricks are in my speed routines. http://vbaexpress.com/kb/getarticle.php?kb_id=1035

snb
09-06-2013, 07:48 AM
Sub tst()
[a1:a4000].Offset(, 60) = [if(C1:C4000="","",if(countif(offset($C$1,,,row(C1:C4000)),C1:C4000)=1,C1:C4000,""))]
Columns(61).SpecialCells(4).EntireRow.Select ' for illustration purposes only
' Columns(61).SpecialCells(4).EntireRow.Delete
End Sub

Diakonos1984
09-06-2013, 08:25 AM
Wow, snb, that is much more compact and faster than mine. It's not quite right though--it deletes too much. Here's some pictures, maybe we can figure out how to refine it...
This is rows 10-13 of the sample spreadsheet. (See following posts for remaining pictures).

10553

Diakonos1984
09-06-2013, 08:25 AM
This is what your code would delete (shown in red):
10554

Diakonos1984
09-06-2013, 08:28 AM
This is what actually needs to be deleted (shown in red)
Green indicates first instance, blue indicates subsequent.
Only in the fourth row are all items subsequent duplicates. Therefore, only the fourth row should be deleted.

10555

Got any ideas on how to refine your code for that?

Thanks a bunch for the help, btw.

Diakonos1984
09-06-2013, 09:05 AM
That's some nice code, Kenneth. I had been thinking about doing something like that myself (with a Boolean switch passed to it to turn it on or off for troubleshooting), but maybe I'll just co-opt yours!
My macro still took about 20 minutes to run with your SpeedOn... I was hoping to find someway to eliminate or minimize the loops...

Diakonos1984
09-06-2013, 09:19 AM
I'm trying to understand your code, snb...
I see you are using an extra column off to the side to determine whether a row should be kept of deleted. It's this part that is challenging me:

[if(C1:C4000="","",If(countif(offset($C$1,,,row(C1:C4000)),C1:C4000)=1,C1:C4000,""))]
1) Are you only evaluating column C? I guess that's okay--I'd need to wrap it in a loop to evaluate multiple columns...
2) I'm not familiar with this usage of square brackets. What does the [ and ] accomplish for us?
3) I'm familiar with the countif and offset, but that implementation is not straightforward to me (that's how you can keep it so compact!)

I'm going to try to run this evaluation for several more columns, and have it write over column 60 only if it's blank....

snb
09-06-2013, 09:30 AM
Why not indicatiing in the file you posted which rows should be deleted and why ?

After all this is an Excel subforum, no photoshop (your images are too small to be of any value and do not represent the file you posted)

We don't need >3000 rows containing data to write a macro. I'd say 30 representative rows should suffice.

Kenneth Hobs
09-06-2013, 09:34 AM
SNB is tops for concise code.

For me to help best, make say a 10 row workbook with real data only obfuscated if need be. On Sheet 2 or further down the worksheet, manually make it look like you want. Note the parts that you define as duplicate. My guess is that we could help you.

You might find my multi-find routine helpful. Chip Pearson has a good one too. http://www.vbaexpress.com/forum/showthread.php?t=38802

Diakonos1984
09-06-2013, 10:03 AM
Ok, guys!

Here's a spreadsheet with good samples. I should have done that all along (sorry!)
10557

There are 4 sheets:
1) Clean Start: Use this one. It's 17 rows and 12 columns.
2) Duplicates Indicated: Green is first instance, blue is subsequent instances. You'll notice I'm only evaluating 5 of the 12 columns. The code to determine which ones to evaluate is in my original code at the top of this thread. The jist of it is if the column name includes "tbl" but NOT "Status" then it should be evaluated.
3) Deletions Indicated: Red indicates rows that have ONLY subsequent instances. These are to be deleted.
4) Final After Deletions: With Red rows deleted and all cells reset to xlNone (4 rows were deleted).

Thanks for your help.

snb
09-06-2013, 12:53 PM
I think you'd better use formulae in column N in the 'clean sheet'
See the attachment.

shrivallabha
09-06-2013, 11:09 PM
I think you'd better use formulae in column N in the 'clean sheet'
See the attachment.
You can drop all N's if you want to ;)

=IF((COUNTIF(OFFSET($E$1,,,ROW()),E2)>1)*(COUNTIF(OFFSET($F$1,,,ROW()),F2)>1)*(COUNTIF(OFFSET($I$1,,,ROW()),I2)>1)*(COUNTIF(OFFSET($K$1,,,ROW()),K2)>1)*(COUNTIF(OFFSET($L$1,,,ROW()),L2)>1),"X","")

Edit:Which can be whittled down to:

=IF((COUNTIF($E$2:E2,E2)>1)*(COUNTIF($F$2:F2,F2)> 1)*(COUNTIF($I$2:I2,I2)>1)*(COUNTIF($K$2:K2,K2)>1 )*(COUNTIF($L$2:L2,L2)>1),"X","")
All volatile functions thrown off...!

snb
09-07-2013, 03:35 AM
@Shri..

Yes !! :clap:

For our VBA adepts:


Sub M_snb()
sn = Sheets("Clean Start").Cells(1).CurrentRegion

With Application
For j = 2 To UBound(sn)
If (.Match(sn(j, 5), .Index(sn, 0, 5), 0) < j) * (.Match(sn(j, 6), .Index(sn, 0, 6), 0) < j) * (.Match(sn(j, 9), .Index(sn, 0, 9), 0) < j) * (.Match(sn(j, 11), .Index(sn, 0, 11), 0) < j) * (.Match(sn(j, 12), .Index(sn, 0, 12), 0) < j) Then c00 = c00 & "," & j
Next
End With

Sheets("Clean Start").Range(Mid(Replace(c00, ",", ",A"), 2)).EntireRow.Select
End Sub

You can see how you can expand this code easily to many more columns if you write the code alternatively:


Sub M_snb()
sn = Sheets("Clean Start").Cells(1).CurrentRegion

With Application
For j = 2 To UBound(sn)
x1 = (.Match(sn(j, 5), .Index(sn, 0, 5), 0) < j)
x2 = (.Match(sn(j, 6), .Index(sn, 0, 6), 0) < j)
x3 = (.Match(sn(j, 9), .Index(sn, 0, 9), 0) < j)
x4 = (.Match(sn(j, 11), .Index(sn, 0, 11), 0) < j)
x5 = (.Match(sn(j, 12), .Index(sn, 0, 12), 0) < j)
If x1 * x2 * x3 * x4 * x5 Then c00 = c00 & "," & j
Next
End With

Sheets("Clean Start").Range(Mid(Replace(c00, ",", ",A"), 2)).EntireRow.Select
End Sub