PDA

View Full Version : [SOLVED:] Identify Matches between Sheet Columns



gmaxey
01-29-2025, 06:48 AM
This may sound ridiculously complicated, but here a stab.
Looking for a practical/proficient process to identify matches between data in two sheet columns. For example lets say I have Sheet 1 with columns A and B and 300,000 rows.


Ref# Folder Path

1 C:\Folder 1\Test.png
...

243 C:\Folder 2\Subfolder 1\Testing.jpg

... .....

300000 E:\Folder1\SubFolder2\Test.doc





I have Sheet 2 with 5 columns and 100 rows


SN Type Title Date Name

1-23 A Dogs 1//1/25 Test.doc
....

35-456 A Cats 1/2/24 Test.png

...
100-D-3 B Pigs 1/4/20 Testing.jpg


I need to create a third sheet that gets the data from Sheet 1 Column A where the data in Sheet 2 Column 5 "Name" is "found" in Sheet 1 Column B.

So in the example. Sheet 3 would look like this:



A
1 35-456 Cats 1/2/24 Test.png
243 100-D-3 Pigs 1/4/20 Testing.jpg
300000 1-23 Dogs 1/1/25 Test.doc


My initial process was like this
Loop through each row in Sheet 1 (300,000)
Loop through each row in Sheet 2 (100)
Check InStr(Sheet1.Cell, Sheet2,Cell) > 1
Yes we have a match, develop Sheet3 row
This process takes about a minute to run.


I revised the process to:
Create a collection

Loop through each row in Sheet 2 (100)
Check if Sheet2.Column 5 is Found in Sheet1
Yes
For Each find
Add row index to collection
For Each Index in Collection
Develop Sheet3 row
This process takes about 12 seconds.


Either method is working, but the potential exists for the row counts in sheet 1 or sheet 2 to grow considerably and the only result is longer processing time.


Wondering if there is a better way. I know nothing about Power Query and very little about Excel. Hoping some expert knows a more efficient process to minimize looping to identify all row indexes in sheet 1 that requires processing?


Thanks







cross posted at: https://www.msofficeforums.com/excel-programming/53221-identify-matches-between-sheet-columns.html

p45cal
01-29-2025, 07:56 AM
In the attached, Power Query. A guess.
Hopefully Sheets 1 to 3 are close to what you have and want.
Right-click the (currently blank) table on Sheet3 and choose Refresh.

If you mock up a workbook with a bit more sample information with (some) matches between Sheet1 and Sheet2, together with a third Sheet3 containing expected results and attach it here, we can put together something which will more likely work for you.

georgiboy
01-29-2025, 08:01 AM
It doesn't sound complicated, however, 'my' answer may not be what you want to hear...

I use VBA a lot however, when dealing with larger datasets and matches, I tend to use Power Query.

My logic for the attached PQ file is:
1, Put all of your data from both sheet1 and sheet2 into table objects and not just data in a range.
2, Load table1 into PQ as a connection only and separate the file name from the folder path as a new column. This is done by splitting the 'Folder Path' column based on the rightmost delimiter of '\'.
3, Load table2 into PQ as a connection only.
4, Merge table2 with table1 based on the file name.
5, Expand the merged table and select only the 'Ref#' column.
6, Export the completed merge as a table to a new sheet.

PQ is super efficient and has replaced a lot of my use for VBA, IMO it is worth learning some PQ methods. ChatGPT can be helpful when learning how to complete specific tasks in PQ, for example you could ask ChatGPT "How to split a column based on a delimiter with Power Query" or "How to merge two tables using Power Query"

I have a attached the file for reference just in case you wanted to pick through it.

gmaxey
01-29-2025, 08:30 AM
Georgoboy\p45Cal

The PQ.xlsx shows exactly what I am currently doing with VBA (95% of the time). So my first step is to figure out how to do Georgboy's steps 1 through 6 using VBA.

95% of the time the conditions are:
a. Return only the first match in Sheet 1 and b. Discard all Sheet 1 rows where match is not found
The other 5% of the time, there are some conditionals.
a. For any second or subsequent match, create new row in Sheet 3
b. For any second or subsequent match, concatenate any second or subsequent match data in existing Sheet 3 row
c. If a match is not found for any row in Sheet 1, cancel and exit process
d. If a match for any row in Sheet 1, create a new row in Sheet 3 showing only Sheet 1 column 1 data
e. If a match is not found for any row in Sheet 2, cancel and exit process

So, before trying to climb the mountain and learn how to perform georgiboy's steps, do you feel these objectives could also be meet?

I'll try to put together some sanitized examples and post.

Thank you so much!!!

p45cal
01-29-2025, 09:04 AM
So, before trying to climb the mountain and learn how to perform georgiboy's steps, do you feel these objectives could also be meet?
Yes

I'll try to put together some sanitized examples and post.Try to include at least one instance of your a to e points in your sample data and how they should appear on Sheet3.

Mine and georgiboy's queries were remarkably similar.

gmaxey
01-29-2025, 09:37 AM
p45Cal,

Here is the sample file (with code to create Sheet 3 named Overlay).
I have included the overlay results for Normal conditions, Match Dups as New Rows, Match Dup Concatenated. The other conditions can be seen by running the code.
31873


If you are uncomfortable opening the attached and I can send an xlsx version.

31872

31874

p45cal
01-29-2025, 04:15 PM
I get differing results from you for the Overlay (Dup as New Row) sheet. I only get 4 rows with no duplicates:

31875

I suspect this could be because you've removed a bunch of rows which might have matched but I'm not 100% sure of that. In order to check could you tell me what settings you used in your frmOverlay userform for each of the 9 available settings therein so that I can run it and compare?

If this is because you've removed those rows could you re-attach a version with the appropriate rows included?

p45cal
01-29-2025, 04:48 PM
Also, there is one row on the REF_SHEET (row 32001) which has folder separators of \ rather than /.
31877
Is this something we must cater for?

gmaxey
01-29-2025, 04:53 PM
I have reattached the file with Overlay sheets for:

1. Normal use case - Match First use Left side/Discard Ref Row Right Side
2. Duplicate Ref Rows Left Side/Discard Ref Row Right Side
3. Concatenate with delim ; Left side/Discard Ref Row Right Side
4. Concatenate with line break Left side/Discard Ref Row Right side

Either right or left option to Do not create overlay will result in the overlay sheet not be created
The right option to Copy Control ID from Ref will result in an overlay has all REF rows (resulting file is too large to attach).

The YoPauli in the RS_SHEET is what creates the Do Not Create Overlay Cancel with the Left bottom option buttong
The YoAdrian in the NDS_SHeet is what creates the Do Not Create Overlay Cancel with the Right option button.








31876

gmaxey
01-29-2025, 04:57 PM
31878

p45cal
01-29-2025, 05:11 PM
It's bedtime here now (early start tomorrow) so it'll possibly be some 24 to 36 hours before I can come back… unless I get a moment.

ps. there's still only one #266912.1 in the REF_SHEET, doesn't matter, I'll add one myself.

gmaxey
01-29-2025, 05:12 PM
Looking at the picture you sent in your last previous post

My example is the result when:
Duplicate Duplicate Reference Row (Left side option) and Discard the Ref Row from Overlay Sheet (Right Side options) are applied.

You see the rose shaded A:2 and A:3 because row 299 of the REF_SHEET matched rows 24 and 25 of the NDS_SHEET. In the normal use case only the first #2669121 is listed.

Other than yours seems to exclude both double Serial Number 10 10 and 9 9, I don't know why your return is only four rows.

Hope this helps. Thank you.

p45cal
01-29-2025, 05:15 PM
OK I get it now, I'll have a play…

p45cal
01-29-2025, 05:28 PM
Tell me in words (might be easier then me looking through about 800 lines of code) about the likes of:
Sentinel Overlay Demo Data.zip//Sentinel Overlay Demo Data/DemoData_0000015_Import.zip//X Emails/Email 3.pdf
Should I treat it as 3 possible file names (splitting the text at the double forward slashes), then looking for all three of:
Sentinel Overlay Demo Data.zip
DemoData_0000015_Import.zip
Email 3.pdf

or something else?
(It really is bedtime now!)

gmaxey
01-30-2025, 01:43 AM
p45cal,

Yes, it was past my bedtime. Thanks for hanging in here! Well, as I said, it is complicated. In a sense the answer to your question is yes.
My first approach to achieve the required out put was to loop through every row of the RS_SHEET

So when we hit row 299 we have: "Sentinel Overlay Demo Data.zip//Sentinel Overlay Demo Data/DemoData_0000015_Import.zip//X Emails/Email 3.pdf"
Then I looped through every row of NDS_SHEET and checked to see if the data in column 8 was InStr of the last step
So when we hit row 24, we have "DemoData_0000015_Import.zip"
If InStr("Sentinel Overlay Demo Data.zip//Sentinel Overlay Demo Data/DemoData_0000015_Import.zip//X Emails/Email 3.pdf", "DemoData_0000015_Import.zip") > 0
Bingo, we have first match. record it.
If Use First Match Only then
get out
Else
We continue looping
We hit row 25 which again is "DemoData_0000015_Import.zip" and again is InStr so we record the line 25 record

So if you change NDS_SHEET row 25 column 8 from "DemoData_0000015_Import.zip" to "Email 3.pdf" and run another overlay, you will see the same result for a normal overlay. If you run it with the Duplicate Reference Row you will see a different result:


That required a lot of looping. With some actual data with 440,000 RS_SHEET rows it takes about 45 seconds.

My current approach (what you see) works like this:

I loop through each data row of the NDS_SHEET (Rows 5 to 26 rows)
In each loop, I find any RS_SHEET Row index(es) where the NDS_SHEET row index, column 8 data is found and add to a collection
So when NDS_SHEET row index = 24 the RS_SHEET Row 299 is the collection item
I then Loop through the collection (In this case only 1 item)
I add 299 to another DupCollection. If successful , I add the data for row 299 to the Overlay
When NDS_SHEET row index = 25 RS_SHEET Row 299 is again the only index returned
I attempt to add it to the DupCollection and it fails
If Match First Only - Get Out
If Duplicate Ref Row - Record it

With the actual data this process takes about 12 seconds.

I hope that makes sense. Thank you again.






31879

p45cal
01-31-2025, 04:34 PM
In the attached, on sheet Overlay (Dup as New Row), there's a Power Query query output at cell A14 which seems to be correct. I've cleared a few cells in that table so that you can right-click somewhere in that table and choose Refresh and see how long it takes for data to appear. The first time you do this it may take longer than subsequent refreshes. I'd be interested to know how long the refreshes take.
It's very basic at the moment and I should be able to add tables for the other sheets too and make it more streamlined, but I'm out of time right now.
Note that no macros are involved (I should have attached as .xlsx).

gmaxey
02-02-2025, 03:12 PM
p45cal,

I had a chance to look at your attached file. If I change some data in your table and refresh, it takes between 2.5 and 3 seconds nominally to refresh. You mention that there is no VBA. I think for this approach to work there will have to be some VBA involved. The process will need to work for any number of REF_SHEETS and NDS_SHEETS. Sometimes the number of headings and titles in the REF_SHEET will vary. On top of all of the various conditions that must be accounted for, there is another wrinkle called "Family Relationships" that occasionally must be handled. With some help offered in the linked post, I've managed to get the processing time down to nearly a order of magnitude below what I did have. The attached processes normal conditions in less than 4 seconds (with family relationships it is still under 30 seconds.

I want you to be award of this so you don't spend more of your time on an approach that might not be suited for this requirement.
Up to you. If you like a challenge fine, if you wan to leave it be that is fine of course also. Thank you!!

31882

Paul_Hossler
02-03-2025, 06:58 AM
This may sound ridiculously complicated, but here a stab.
Looking for a practical/proficient process to identify matches between data in two sheet columns. For example lets say I have Sheet 1 with columns A and B and 300,000 rows.


Ref# Folder Path

1 C:\Folder 1\Test.png
...
243 C:\Folder 2\Subfolder 1\Testing.jpg
... .....
300000 E:\Folder1\SubFolder2\Test.doc

I have Sheet 2 with 5 columns and 100 rows

SN Type Title Date Name

1-23 A Dogs 1//1/25 Test.doc
....
35-456 A Cats 1/2/24 Test.png
...
100-D-3 B Pigs 1/4/20 Testing.jpg


I need to create a third sheet that gets the data from Sheet 1 Column A where the data in Sheet 2 Column 5 "Name" is "found" in Sheet 1 Column B.


It seems to me (FWIW) that the main question was one of decreasing processing speed??

Using the attachment in #17 can you describe in words and color coding how (I assume) matching between NDS_Sheet and REF_Sheet works to create OVERLAY?

If we come up with an acceptable 'match algorithm' the rest of the process to integrate into the UI could be a homework assignment :yes

p45cal
02-03-2025, 01:52 PM
I only mentioned via because I noticed I'd attached a .xlsm file after posting.
Sure I'm happy to use it if needed.
I said that the Power Query query I used was quite basic and indeed it was, with further coding it can be quite sophisticated and flexible.
You say you will have to handle any number of Ref/Nds sheets; do you mean at the same time?
I have a further question; the data that are in these sheets, do they come as Excel sheets or in some other form (maybe csv or txt files)? I ask this because Power Query can eliminate the step of bringing those files into a worksheet, what's more it would probably be faster.

gmaxey
02-03-2025, 05:57 PM
p45cal,

As I informed you via PM, I'm going to mark this thread solved as I have found a solution that will more than meet the need.
31883

I tried to send you a photo in the PM of the typical column names that are used but it got converted to a xml format. So here it is again.

31885

Thanks for all you time and help with this!

gmaxey
02-04-2025, 07:10 AM
Paul, Sorry for the delayed reply. I got the help I needed in the cross post forum and having managed to take that and reduce processing speed significantly enough to meet requirements. To be honest, I was helping a friend and I while I have managed to reduce the processing speed, I am still a bit muddled by how it all works. So much so that I don't know if I know how describe it in words.

I get your point about homework assignment. My skill (if it can be called that) is with Word VBA. Yes, I sometimes dabble with Excel and can cobble together something that works. But, I don't use it enough (and hope not to) to take on the learning curve of Power Query. At least not right now. Thank you.

Paul_Hossler
02-04-2025, 07:50 AM
Glad you and your friend got something that works

You can mark the thread [Solved] by using [Thread Tools] above post #1