PDA

View Full Version : Compare and delete duplicated data from several columns



mpeterson
04-20-2013, 06:10 AM
I have found GTO's masterpiece script on this forum p=289279#post289279, titled "Compare and delete duplicated data" which compares between two columns and delete duplicated values in the second column. It is a masterpiece.

I was wondering if this script can be modified to suit my current data manipulation needs. I have 27 columns of sorted domain names by letter a-z and one extra column (27th) for domain names that start with a number or a hyphen. There's always new domain names to be added to the current worksheet "main_lists" which has about 3.7 million domain names.

First of all, if I am using GTO's script I would copy data column by column from "main_lists" worksheet to "Sheet1" where the new list of domain names are collected in ColB, having to do copy and past and run the script 27 times to delete duplicate data in the new data column B.

Secondly, GTO's script is fast with a small size of data, when I ran it between one column of my real "main_lists" sheet and the new data, it actually never ended and I had to ctrl+break the script.

Accordingly, can GTO's script be modified to quickly compare and delete data between 27 columns in "main_lists" worksheet, and "Sheet1" ColB new data sheet?

If this is possible that would be a dream that came true to me.

Highly appreciating any assistance with this need.

Mark Peterson

Paul_Hossler
04-20-2013, 07:04 AM
can u post post a small workbook with some data and GTO's code you're using?

what version of excel since 2007 has a 'Remove Duplicates' function that is very fast, and can be called from VBA

Paul

mpeterson
04-20-2013, 08:17 AM
Thank you Paul for your prompt reply. I'm attaching a small example with GTO's script in module1. I'm using Excel10. I'm not sure if I can run remove dups from all 27 cols in Sheet1 compared with colB in Sheet2, this is why I initially had to find another way to do such comparison and deletion before another step of data manipulation that precedes final distribution of data to Sheet1 by letter.

I hope attached example will clarify my question. Once again, thank you.

sassora
04-20-2013, 12:27 PM
As Paul said, you can remove duplicates in xl2007 onwards.
The following code removes the duplicates on sheet 1 (within each column). This won't completely solve your problem, but I don't see why you couldn't use the RemoveDuplicates property in some way.

Sub RemoveDuplicates()

Dim ColumnCount As Integer

With ActiveSheet
For ColumnCount = 1 To 27
Application.Intersect(.Columns(ColumnCount), .UsedRange).RemoveDuplicates Columns:=1, Header:=xlYes
Next ColumnCount
End With

End Sub

mpeterson
04-20-2013, 01:14 PM
Thank you sassora for your valuable contribution. In fact, data in my real "main_lists" worksheet are sorted a-z and duplicates are removed. It is actually the new data (domain names) that we want to verify whether we have them already before operating on them for classification purpose. After classifying them and collecting a variety of info about each name, we then add them to our main_lists data sheet. If there's a way that would enable me to do this task utilizing an excel existing capability, please advise me.

Paul_Hossler
04-20-2013, 04:01 PM
Q: Is it really necessary to group the 'inputs' into 27 colums and remove dups?

Q: you example xlsm uses sheet1, sheet2, etc. Which is 'main_lists'?

Q: is sheet2 the 'new' data and you want to add to the Col B list on sheet1 if that domain is not already there?

Possble pseudo code:

List new domains in Col A in WS
Sort
De-dup using the 'Remove Duplicates' function
For each left, use 'Match' against the 'master' list
If not there, add it, else forget it
Sort master list

Excel's RemoveDuplicates and Sort are very quick


Paul

mpeterson
04-20-2013, 11:56 PM
A1: Yes, data are sorted by letter, each has its own column, 27th col is for domain names that start with a digit, a hyphen or underscore.
A2: In my example, Sheet1 is the "main_lists" data sheet, Sheet2 is where the new "raw" data are collected. It is not necessary for it to be in a separate sheet, it could be column 28 on the main_lists sheet. I only put it in a separate sheet to utilize GTO's script. If you have any other idea, it is always welcome.
A3: New data column either it is on a separate sheet or the same sheet in the main_lists one, need to be verified that it has "new" domain names that aren't included in any column of those 27. The entire process is at this point as we need to compare each column data with that new data column and if there is any match, it is deleted from the new data column. This process loops through all the columns, yielding a new data column with lots of gaps due to deletions. At this moment, this column should be sorted a-z, to be ready for some manual manipulation of data. Once it is finished, that column's data chunks are distributed to all other 27 columns by data letter, then sorting of all columns a-z takes place ending the entire operation.

You have intelligently summarized all that I said so far in those pseudo code steps. There's no harm at all if new unique data are distributed by the same script to main_lists before we manually collect other data, as we can still do that at any case.

Thank you Paul very much for your time and effort spent on this matter.