wa1esy
06-09-2010, 01:57 AM
So we have a big Excel (2007) sheet of data from Pubmed. It has around 20000 rows, with each row representing an author and their associated data. Quite often there can be multiple rows for the same author because of different ways that their name is entered, so we have to group these authors together and sum their data into one row.
This is not a real problem thanks to a nifty little piece of code I got from this forum a few months back, which looks for all records with the same surname and initial and groups them together using the subtotal function.
The real problem comes when there is a name (usually double barrelled but not always) which generates different surname and initials and therefore gets grouped and summed separately, when it is actually the same person. You can see on the example WB I've attached ("Before" tab = raw data, "After" tab = post subtotal and highlight) the bottom name Cohen Solal A, has generated three different surname initial combos which means that they are not brought together in the table and sometimes can be separated by thousands of rows!
My question is how can I find these names quickly and easily and group them together. My current thinking is:
From a table containing thousands of names we usually only need maybe the top 250 at the most so we will end up with a worksheet containing 250 summed names (the blue rows in the "After" tab). What I'm thinking is that there might be a way of looking through the sheet and identifying potential duplicate rows based on common words in column A. For the example above all of the cells in column A contain Solal, Cohen and A is there a way of getting Excel to look through and do a "fuzzy" match on cells in the spreadsheet that might be the same person? I have been looking at Febrl which is an open source piece of software but if it can be done in Excel using a combination of code, functions and formulas then all the better!
Apologies for the length of the post I thought it was necessary to explain it properly, any feedback, questions or discussion would be greatly appreciated. :dunno
This is not a real problem thanks to a nifty little piece of code I got from this forum a few months back, which looks for all records with the same surname and initial and groups them together using the subtotal function.
The real problem comes when there is a name (usually double barrelled but not always) which generates different surname and initials and therefore gets grouped and summed separately, when it is actually the same person. You can see on the example WB I've attached ("Before" tab = raw data, "After" tab = post subtotal and highlight) the bottom name Cohen Solal A, has generated three different surname initial combos which means that they are not brought together in the table and sometimes can be separated by thousands of rows!
My question is how can I find these names quickly and easily and group them together. My current thinking is:
From a table containing thousands of names we usually only need maybe the top 250 at the most so we will end up with a worksheet containing 250 summed names (the blue rows in the "After" tab). What I'm thinking is that there might be a way of looking through the sheet and identifying potential duplicate rows based on common words in column A. For the example above all of the cells in column A contain Solal, Cohen and A is there a way of getting Excel to look through and do a "fuzzy" match on cells in the spreadsheet that might be the same person? I have been looking at Febrl which is an open source piece of software but if it can be done in Excel using a combination of code, functions and formulas then all the better!
Apologies for the length of the post I thought it was necessary to explain it properly, any feedback, questions or discussion would be greatly appreciated. :dunno