PDA

View Full Version : Creating one list by comparing two list of different formats



Pharlap
08-21-2007, 07:27 PM
Hi VBAX masters

This is for someone who enjoys a challenge
..hope you can help

A work colleague of mine who knows very little about excel asked me to help him with this, however it is beyond my basic level of VBA skills, however as follows is what I have been able describe step by step by the way I think it might be done. As you might see it is a straight forward,...but I am sure there are heaps of VBAX members who can solve while blind foldered and with one hand tied behind your back....so here it is......

There is: 1 workbook containing 2 worksheets, and numbers of rows of info an be in the thousands

The task here is to create one list of addressees out of two sources of information – as you will see in the attached example file both sources are not formatted the same –also note that attached file also shows example of wanted results in column T of sheet 1 – (that is before it is tided up as per STEP3 below - by copying to anew worksheet and removing blank cells from the column in the new worksheet -.)

So here goes in terms as the way I am thinking about the probelm:

STEP 1

Compare Values in Column B in sheet1
if duplicated
then delete duplicates (Note: can be more than one duplicate)

STEP 2
NOW Row by row in sheet 1
compare the value remaining in column B of sheet 1 with all values listed in entire column B of sheet 2


If there is not a match then in sheet 1 output and reorder the string to Column T from Column C of sheet 1
Could be a number of cases sof how string will appear in source column B in sheet 1 – for example
In case 1 if in column C FLINTSTONE Fred
Then in column T “Fred FLINTSTONE”
Or if case 2 if in column C FLINTSTONE Fred, Wilma
Then in column T “Fred FLINTSTONE and Wilma FLINTSTONE”
Or if case 3 if in column C FLINTSTONE Fred, Wilma, RUBBLE Barney
Then in Column T “ Fred FLINTSTONE and Wilma FLINSTONE and Barney RUBBLE”
Of if case 4 if in Column C “FLINTSTONE Fred, Wilma, RUBBLE Barney, SLATTER, Joe
Then in Column T “Fred FLINTSTONE and Wilma FLINSTONE and Barney RUBBLE and Joe SLATTER”

‘NOTE – In the source column B of sheet 1 the surname will always be in CAPITALS and if more than one christian name after the same surname then the christian name will always be separated by a comma. However when there is a match and then the sourse comes from sheet 2 then the surname is not in capitals. The end result does not need to be in capitals.

IF there is a Match
Then DON’T use names from column B of sheet 1 but instead then use info from sheet 2 to concatenate (with appropriate single spacing) in Column T in sheet 1 a string from source sheet 2 columns C, D, and E for example attached (you will ahve to imagine that row ID numebrs in column B match fro FLINTSTONE )-then see how row 1 column T has come from sheet 2
And if in sheet 2 a value repeats (can repeat more than once) then also add to the just created string an additional concatenation being sourced from the row of the repeated value that has occurred in sheet 2 . see again in the attached example file how row 1 column T has come from the seconsheet 2


Step 3 Final step required - copy results from column T of sheet 1 to a new worksheet and in the new worksheet remove an blank cells from the copied column

I hope this explains the problem clear enough

Thanks for your help
Pharlap

Bob Phillips
08-22-2007, 12:13 AM
The de-duping part and the matching part is simple, but the name manipulation is extremley complex. How would you know for instan ce that Fredrick becomes Fred, or more difficult, that Wilmalia becomes Wilma? This and character matching in thousands of rows will take enormous times. A Regular Expresssion might do it efficiently, but I am not good enougfh at them personally.

Pharlap
08-22-2007, 02:21 AM
Thanks xld for thinking about this

Yes it is complex ....well for me for sure...but for you ??? never......

What is happening behind this is that people provided formal names and were given the option to provide preferred names - some people provided the prefered names and some people didnt -therefore the two list -what is wanted is one name list to be part of a mailing list - but to use the preferred name in sheet2 where one has been provided or if not use the formal name that was provided as per sheet1.

Again the first step for sheet1 only is to remove those rows that have duplicate row id numbers - so that each row id number in column B of sheet1 only appears ONCE -(one row ID number and one row of info)the duplicate row ID. Note removing duplicates row id numbers is NOT required for sheet 2 if fact they are needed in sheet2 .

Now at this point My own thoughts for the trick to getting a code to work out which name to use is in the matching/comparing of the row id numbers in columns B of each sheet

starting with sheet1 if there is NO matching of same row id number against those in column B of sheet 2 then just use the name in sheet1 otherwise if there is a match then use the preferred name in sheet2.

If there there has been a match of row Id numbers and therefore would then be using is names in sheet2 and if the row id in sheet2 is repeated then need to concatinate the names in each row that uses the same row id in column B of sheet2 which to output concatinated name - going to column T in sheet 1 in the row in which its row id was matched. That way they will still line up with the correct residental address in sheet1. -as per the example of the FLINTSTONES in the attached workbook.

I have attached a more refined test example workbook,....version3 which under the yellow row is what sheet1 would look like after the code has run.

I hope I have explained this OK..

I wont be able to seen your reply for about 12 hours as I have to leave soon as I am doing this after work and the security guys say if I dont go soon they will lock me in the building for the night, and I dont have internet where I am staying In SYD so I will look forward to your reply in about 12 hours time when I am back here again

Thanks again

regards

pharlap

CodeMakr
08-22-2007, 12:31 PM
You're probably better off applying a unique key (employee number??) so you can grab "formal" or "preferred" name accordingly based on key.

Bob Phillips
08-22-2007, 01:57 PM
How abut this as a starting point



Sub PharlapTheData()
Dim LastRow As Long
Dim iMatch As Long
Dim i As Long, j As Long
Dim shB As Worksheet
Dim sSurname As String
Dim sFirstnames As String

Set shB = Worksheets("Extraction Report B")

With Worksheets("Extraction Report A")

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
.Columns(3).Insert
.Range("C1").Value = "Temp"
.Range("C2").Resize(LastRow - 1).Formula = "=COUNTIF(R2C[-1]:RC[-1],RC[-1])>1"
.Range("C1").Resize(LastRow).AutoFilter field:=3, Criteria1:="=TRUE"
.Range("C2").Resize(LastRow - 1).SpecialCells(xlCellTypeVisible).EntireRow.Delete
.Columns(3).Delete

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
For i = 2 To LastRow
iMatch = Application.CountIf(shB.Columns(2), .Cells(i, "B").Value)
If iMatch = 1 Then
iMatch = Application.Match(shB.Columns(2), .Cells(i, "B").Value, 0)
.Cells(i, "B").Value = shB.Cells(iMatch, "B").Value
ElseIf iMatch > 1 Then
sFirstnames = ""
For j = shB.Range("B2").End(xlDown).Row To 2 Step -1
If .Cells(i, "B").Value = shB.Cells(j, "B").Value Then
sSurname = shB.Cells(j, "D").Value
sFirstnames = sFirstnames & shB.Cells(j, "E").Value & " and "
End If
Next j
sFirstnames = Left$(sFirstnames, Len(sFirstnames) - 4)
.Cells(i, "C").Value = sFirstnames & sSurname
End If
Next i
End With

End Sub

Pharlap
08-22-2007, 04:16 PM
Thanks Codemker for your idea of the use of keys,...however I wasn?t involved in the design how info was collected or maintained. I just have to work with the output as it is..lucky me.

xld...your code is a Brilliant starting point...certainly no way I could do that..perhaps in 5 to 10 more years of VBA experience or sooner with getting my head around code developed by people like yourself.

In my sample file the demonstrated output is in COLUMN T sheet1 under the yellow row.

Eg
Mr Fred Flintstone and Ms Wilma Flintstone
Sue Bush
Robin BATMAN and Frank BATMAN
John MARGOO and Joan MARGOO
George BUSH

So for the first step of the requirement to remove duplicates from column C of sheet 1 your code seems to do that fine , except it left off the last person ? George BUSH ? I know a lot of people would like to see him go , but I wanted to keep him on sheet1, so it seems the code loosing the last person ????

For the next step the code also seems to work fine in terms of using the names in Column C of sheet 2 when there is 1 or more matching row id numbers in sheet 2 and the code has done the concatenated perfectly from several rows in sheet 2 having the same row id number ,and separated the output two names with an ?and?. However the output should go to column T of sheet 1.

Also where there is NOT a match of row Id numbers between sheet1 and sheet2 the code works fine for using the next name in column C of sheet1 however the code at this point does not format the output as above but retaining it such as ?BATMAN Robin, Frank? what is required is ?Robin BATMAN and Frank BATMAN?

So all in all it seems the code is probably almost there,?but I need to rely of the members of the VBAX community for help as there is just no way I can make this work.

Thanks again

Pharlap

Bob Phillips
08-23-2007, 04:59 AM
So for the first step of the requirement to remove duplicates from column C of sheet 1 your code seems to do that fine , except it left off the last person ? George BUSH ? I know a lot of people would like to see him go , but I wanted to keep him on sheet1, so it seems the code loosing the last person ????

No, that is because George Bush has the same Row Id as the previous person, and so it gets de-duped.


For the next step the code also seems to work fine in terms of using the names in Column C of sheet 2 when there is 1 or more matching row id numbers in sheet 2 and the code has done the concatenated perfectly from several rows in sheet 2 having the same row id number ,and separated the output two names with an ?and?. However the output should go to column T of sheet 1.

Okay that should be peasy.


Also where there is NOT a match of row Id numbers between sheet1 and sheet2 the code works fine for using the next name in column C of sheet1 however the code at this point does not format the output as above but retaining it such as ?BATMAN Robin, Frank? what is required is ?Robin BATMAN and Frank BATMAN?

Okay will try.



Sub PharlapTheData()
Dim LastRow As Long
Dim iMatch As Long
Dim i As Long, j As Long
Dim shB As Worksheet
Dim sSurname As String
Dim sFirstnames As String
Dim sOriginal As String

Set shB = Worksheets("Extraction Report B")

With Worksheets("Extraction Report A")

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
.Columns(3).Insert
.Range("C1").Value = "Temp"
.Range("C2").Resize(LastRow - 1).Formula = "=COUNTIF(R2C[-1]:RC[-1],RC[-1])>1"
.Range("C1").Resize(LastRow).AutoFilter field:=3, Criteria1:="=TRUE"
.Range("C2").Resize(LastRow - 1).SpecialCells(xlCellTypeVisible).EntireRow.Delete
.Columns(3).Delete

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
For i = 2 To LastRow
iMatch = Application.CountIf(shB.Columns(2), .Cells(i, "B").Value)
sOriginal = .Cells(i, "C").Value
If iMatch = 1 Then
iMatch = Application.Match(shB.Columns(2), .Cells(i, "B").Value, 0)
.Cells(i, "T").Value = shB.Cells(iMatch, "C").Value & " " & _
shB.Cells(iMatch, "D").Value & " " & _
shB.Cells(iMatch, "E").Value
ElseIf iMatch > 1 Then
sFirstnames = ""
For j = shB.Range("B2").End(xlDown).Row To 2 Step -1
If .Cells(i, "B").Value = shB.Cells(j, "B").Value Then
sSurname = shB.Cells(j, "D").Value
sFirstnames = sFirstnames & shB.Cells(j, "C").Value & " " & _
shB.Cells(j, "E").Value & " and "
End If
Next j
sFirstnames = Left$(sFirstnames, Len(sFirstnames) - 4)
.Cells(i, "T").Value = sFirstnames & sSurname
Else
iMatch = InStr(sOriginal, ",")
If iMatch > 0 Then
.Cells(i, "T").Value = Trim$(Mid$(sOriginal, InStr(sOriginal, " ") + 1, _
iMatch - InStr(sOriginal, " ") - 1)) & " and " & _
Trim$(Right$(sOriginal, Len(sOriginal) - iMatch)) & " " & _
Trim$(Left$(sOriginal, InStr(sOriginal, " ") - 1))
Else
iMatch = InStr(.Cells(i, "C").Value, " ")
If iMatch > 0 Then
.Cells(i, "T").Value = Trim$(Right$(sOriginal, Len(sOriginal) - iMatch)) & " " & _
Trim$(Left$(sOriginal, InStr(sOriginal, " ") - 1))
Else
.Cells(i, "T").Value = sOriginal
End If
End If
End If
Next i
End With

End Sub

Pharlap
08-24-2007, 12:29 AM
Hi xld

Of course ...The code worked great on the test file ...BRILLANT

However on the real data file I got an error, I believe I missed something that occasional come up in the source data...I am going to have a better look at it over the weekend...I hope I can let you know how I went with it later tomorrow.


Thanks again

Pharlap :-)

Pharlap
08-27-2007, 02:45 AM
Hi xld


:bow: :bow: :bow: :bow:

How Are you VBA master?

Well I had a very late night last night reviewing how this code runs on the real data,...pretty good but there a few things I hope you can still help me to iron out.

Well since I first looked at doing this excel file that I wanted it has changed a little ,..the only 2 changes of any significance with respect to impact on the code is the inclusion of a new column in sheet1 - which has been inserted at column A called "region" so it just changes the columns across by one column but I am happy to delete this column for running the code, but I have left it in the new example test file just in case with your greater expereince/knowledge you might consider it of some signifcance. The real big chance is that the real file has had all the duplicate data removed from sheet1,....so the code now wont run unless I create a duplicate for testing purposes but I wonder if the code can be changed to allow work with or without the presence of duplicates in sheet1 in case there are some there in another extraction at a furhter date.

On the real file that has thousands of rows the code stops running and I get a mismatch message but I feel it is something about the data - that I am further looking into - because if I remove the next row from where it stops it will then run past that point to it gets to another mismatch error. :dunno

Anyway here is sort of points of the issues with which I hope you can provdie further assistance with

1. If there is no duplicate in sheet A column B (row ID number )then code still needs to run even if no duplicates as sometimes there may be no duplicates and other times there could be). Presently if the is no duplicates at all the code wont errors.

2. If there are two surnames using the same row ID the code seems to only use the first appearing surname for that row ID of sheet2. Members of Traditional families will have the same family /surname however in today’s society family members may have different surnames in fact married couples may have different surnames even when married.

So I am thinking the solution here is for any row ID of sheet 1 that is matched to the row id in sheet2 then always use the surname appearing on the SAME row of that of the Christian Name in sheet 2
Currently in the case of a match of row id eg for Flintstone in the test file - the code for Row id produces “Mrs Wilma and Mr Fred Flintstone”

But if the code sourced the surname out of the surname column and from the same row as Christian name next to it then The output would be” Mrs Wilma Flintstone and Mr Fred Flintstone” then if there when there are occurrences a third or more people with the same row id on sheet2 but different surnames then it should accommodate this. So if a Ms Mary Poppins was also a family member of the Flintstones and as such has the same row id on sheet 2 then the output would be “Ms Mary and Mrs Wilma and Mr Fred Flint-stone” as per the test data on attached version 8 of the test file. Whereas waht is required is "Ms Mary Poppins and Mrs Wilma and Mr Fred Flintstone.

3. If on sheet1 there has been no match to sheet 2 and there is a row in sheet1 where in a cell that has two or more people with different surnames then this is not accommodated in the output – for example in the a cell string “MARGOO John, SMITH Joan” is currently output as “John and SMITH Joan MARGOO” whereas what is needed is output “John MARGOO and Joan SMITH”. I am thinking perhaps as in sheet 1 the source data will always have the surname as CAPITALS then is it possible to rearrange names so that those with some mixed case word are put first followed by word of all CAPITALS and when separated by a comma substitute the comma with“and” and then repeat. ??

So I hope you can help :hi:

By the way did you end up getting the resent email ?
I will send you another tomorrow to fill you in


Thanks again :beerchug:

Pharlap

Bob Phillips
08-27-2007, 11:35 AM
Pharlap,

I managed to delet the entire folder this time, so if you send it yet again, I promise to be more careful.



Sub PharlapTheData()
Const TEST_COLUMN As Long = 3
Const NEW_COLUMN As Long = 4
Const NAME_COLUMN As Long = 4
Const TARGET_COLUMN As Long = 20
Dim LastRow As Long
Dim iMatch As Long
Dim nSpaces As Long
Dim i As Long, j As Long
Dim shB As Worksheet
Dim sSurname As String
Dim sFirstnames As String
Dim sOriginal As String
Dim RngDupes As Range
Dim aryNames As Variant

Set shB = Worksheets("Extraction Report B")

With Worksheets("Extraction Report A")

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
.Columns(NEW_COLUMN).Insert
.Cells(1, NEW_COLUMN).Value = "Temp"
.Cells(2, NEW_COLUMN).Resize(LastRow - 1).Formula = "=COUNTIF(R2C[-1]:RC[-1],RC[-1])>1"
.Cells(1, NEW_COLUMN).Resize(LastRow).AutoFilter field:=1, Criteria1:="=TRUE"
On Error Resume Next
Set RngDupes = .Cells(2, NEW_COLUMN).Resize(LastRow - 1).SpecialCells(xlCellTypeVisible)
On Error GoTo 0
If Not RngDupes Is Nothing Then
RngDupes.EntireRow.Delete
Else
.Cells.AutoFilter
End If
.Columns(NEW_COLUMN).Delete

LastRow = .Cells(.Rows.Count, "A").End(xlUp).Row
For i = 2 To LastRow
iMatch = Application.CountIf(shB.Columns(2), .Cells(i, TEST_COLUMN).Value)
sOriginal = .Cells(i, NAME_COLUMN).Value
If iMatch = 1 Then
iMatch = Application.Match(shB.Columns(2), .Cells(i, TEST_COLUMN).Value, 0)
.Cells(i, TARGET_COLUMN).Value = shB.Cells(iMatch, "C").Value & " " & _
shB.Cells(iMatch, "D").Value & " " & _
shB.Cells(iMatch, "E").Value
ElseIf iMatch > 1 Then
sFirstnames = ""
For j = shB.Range("B2").End(xlDown).Row To 2 Step -1
If .Cells(i, TEST_COLUMN).Value = shB.Cells(j, "B").Value Then
If LCase(shB.Cells(j, "D").Value) = LCase( _
Left$(.Cells(i, NAME_COLUMN).Value, InStr(.Cells(i, NAME_COLUMN).Value, " ") - 1)) Then
sSurname = shB.Cells(j, "D").Value
sFirstnames = sFirstnames & shB.Cells(j, "C").Value & " " & _
shB.Cells(j, "E").Value & " and "
Else
sFirstnames = sFirstnames & shB.Cells(j, "C").Value & " " & _
shB.Cells(j, "E").Value & " " & shB.Cells(j, "D").Value & " and "
End If
End If
Next j
sFirstnames = Left$(sFirstnames, Len(sFirstnames) - 4)
.Cells(i, TARGET_COLUMN).Value = sFirstnames & sSurname
Else
iMatch = InStr(sOriginal, ",")
If iMatch > 0 Then
nSpaces = Len(sOriginal) - Len(Replace(sOriginal, " ", ""))
If nSpaces > 2 Then
aryNames = Split(Replace(sOriginal, ",", ""), " ")
.Cells(i, TARGET_COLUMN).Value = aryNames(1) & " " & aryNames(0) & " and " & _
aryNames(3) & " " & aryNames(2)
Else
.Cells(i, TARGET_COLUMN).Value = Trim$(Mid$(sOriginal, InStr(sOriginal, " ") + 1, _
iMatch - InStr(sOriginal, " ") - 1)) & " and " & _
Trim$(Right$(sOriginal, Len(sOriginal) - iMatch)) & " " & _
Trim$(Left$(sOriginal, InStr(sOriginal, " ") - 1))
End If
Else
iMatch = InStr(sOriginal, " ")
If iMatch > 0 Then
.Cells(i, TARGET_COLUMN).Value = Trim$(Right$(sOriginal, Len(sOriginal) - iMatch)) & " " & _
Trim$(Left$(sOriginal, InStr(sOriginal, " ") - 1))
Else
.Cells(i, TARGET_COLUMN).Value = sOriginal
End If
End If
End If
Next i
End With

End Sub

Pharlap
08-28-2007, 02:00 AM
:hi: Hi there xld

Thanks again the code of course works GREAT :yes I spent some time today tying to get my mind around some parts of it in terms of how some parts of it work. Some of it makes sense but some parts are little confusing ....but in time I shall of course learn more.

There is still a problem in the data ....still occassionally getting a error type mismatch so I am hoping to sit down over the next day or two and have a hard look at the sourse data to find out why , if I can understand why then a change to the code might overcome that issue once the issue with the data is known.

By the way I have resent the email,...but when I get tiem over the next few days I want to send you an short update...I have had some interesting convesations about VBA in folks here ..all good of course.

Thanks again xld ..you are indeed a VBA legend :bow: :bow: :bow:

cheers

Pharlap :beerchug:

Bob Phillips
08-28-2007, 03:21 AM
Thanks, got the email.

We are now getting into the realms of 'mess' on the rules. We are alreay at 2/3 levels of increased complexity from where we started, due to the evolving nature of the names. The great problem with names is that the combinations are huge.

It may be best to step baak and re-design the parser, as it has been built upon for each new variation, rather than designed for all those variations.

Pharlap
08-28-2007, 02:38 PM
:hi: HI xld,
Yes I agree about the combination mix of names can be huge, and certainly so with non english based cultures - which in a multicultural society needs to be accommodated for. Your point about redesign the parser I am not sure if I realy understand what you mean however I am interpreting it as in reference to look at the data structures themselves and yes it does need to be reviewed.

On that point I will let you know how I go in the next day or so.

Weather here is great -just like the post cards

cheers

Pharlap :beerchug:

Pharlap
10-04-2007, 06:46 PM
hi there XLD,
:hi:
There must be some truth in what people say about having a break from a problem for a while if one cant immediately see what exactly is wrong. Well the other day I finally got back to this code and straight away I finally figured out why it stops on the real data - then of course I tried to fix the code but made more of a mess of it. Anyway seems only one change required - if you could help - in the code you developed it did allow for a single person on worksheet 1 of the test file- and would put them in the new generated output list - however my test file v9 didnt show/test a single person and their row id number being also on the second sheet which is what can happen in the real data sheets. This is way the code stops in its processing there - that is if it finds a match to the row id number on wksh 2 to that of wksh 1 - with wksh 2 only having one person with that same number thus meaning no one to be concatonated in the output string with an "and" -then the code just stops - see Test Worksheet v10 attached. In wksh v10 I have added Sue Rubble to wksh 2 who was is on the only first wksh of test file ver9. Therefore the only modification required to the code it seems it to allow for a single person also being on second worksheet when no other person on second worksheet has same row id number.

Amending the code to allow for this I feel confident will mean it should now work for the real data workbook - but how to amend it escapes me.

Hope you can help.

cheers

Pharlap :beerchug: