PDA

View Full Version : Completely at a loss...



wa1esy
06-09-2010, 01:57 AM
So we have a big Excel (2007) sheet of data from Pubmed. It has around 20000 rows, with each row representing an author and their associated data. Quite often there can be multiple rows for the same author because of different ways that their name is entered, so we have to group these authors together and sum their data into one row.
This is not a real problem thanks to a nifty little piece of code I got from this forum a few months back, which looks for all records with the same surname and initial and groups them together using the subtotal function.
The real problem comes when there is a name (usually double barrelled but not always) which generates different surname and initials and therefore gets grouped and summed separately, when it is actually the same person. You can see on the example WB I've attached ("Before" tab = raw data, "After" tab = post subtotal and highlight) the bottom name Cohen Solal A, has generated three different surname initial combos which means that they are not brought together in the table and sometimes can be separated by thousands of rows!
My question is how can I find these names quickly and easily and group them together. My current thinking is:
From a table containing thousands of names we usually only need maybe the top 250 at the most so we will end up with a worksheet containing 250 summed names (the blue rows in the "After" tab). What I'm thinking is that there might be a way of looking through the sheet and identifying potential duplicate rows based on common words in column A. For the example above all of the cells in column A contain Solal, Cohen and A is there a way of getting Excel to look through and do a "fuzzy" match on cells in the spreadsheet that might be the same person? I have been looking at Febrl which is an open source piece of software but if it can be done in Excel using a combination of code, functions and formulas then all the better!

Apologies for the length of the post I thought it was necessary to explain it properly, any feedback, questions or discussion would be greatly appreciated. :dunno

mdmackillop
06-09-2010, 06:08 AM
Something like this?

Sub CheckNames()
Dim a
Dim cel As Range
Dim arr() As String
Dim c As Range
Dim j%
For Each cel In Range("Data")
cel.Select
arr = Split(cel)
For Each a In arr
j = j + 1
If Len(a) > 1 Then
For Each c In Range("data")
If c.Offset(, 12) = "" Then
If InStr(1, c, a) > 0 Then
c.Offset(, 12) = j
End If
End If
Next
End If
Next
Next
End Sub

wa1esy
06-09-2010, 06:19 AM
Sorry mdmackillop, I should have explained in my original post that I am an absolute tool when it comes to VBA! :confused4 DO you think you could just talk me through that one please?

mdmackillop
06-09-2010, 10:49 AM
Here's a sample. If the sorting is correct, we can look at totalling.

wa1esy
06-10-2010, 01:08 AM
I'm really sorry but I'm afraid I don't understand what it is doing?? What is the number that appears in column J? Is the macro looking at the names in column A and then assigning a number to the names which are likely to be the same?

mdmackillop
06-10-2010, 05:02 AM
The code checks all cells for each name in the first cell, if found it allocates a number. This repeats for all cells in the list. The code then sorts by number order, bringing "similar" cells together. I think inspection would then be required to confirm sorting is correct, and fixing erroneous entries. I don't see that this can be fully automated.
The numbers column can be deleted.