PDA

View Full Version : Sleeper: Extraction of key entities using VBA



synthex
09-16-2018, 06:36 AM
My csv file has next structure
22881My csv file has next structure
22881

The body of text is names of products from check. (GOODS_NAME)

I want to group any similar names.

some phrases have the same and either root-key, key patterns

For example

word MAKFA Makar (it is Ukraine words)

I found 7 rows with



"root or key word MAKFA Makar"




Pasta Makfa snail flow-pack 450 g.
MAKFA Macaroni feathers like. in/ with
2013077 MAKFA Makar.RAKERS 450g
2013077 MAKFA Makar.RAKERS 450g
6788 MAKFA Makar.perya 450g
2049750 MAKFA Makar.SHIGHTS 450g
2049750 MAKFA Makar.SHIGHTS 450g

and so on. There are many phares. So we extract similar root-key pattern
Initially similar pattern
1 Pasta Makfa snail flow-pack 450 g. MAKFA Makar.
2 MAKFA Macaroni feathers like. in/ with MAKFA Makar.
32013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
42013077 MAKFA Makar.RAKERS 450g MAKFA Makar.
56788 MAKFA Makar.perya 450g MAKFA Makar.
62049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
72049750 MAKFA Makar.SHIGHTS 450g MAKFA Makar.
8*3398012 DD Kolb.SERV.OKHOTN in/ to v / y0.35 kolb
9*3014084 D.Dym.Spikachki DEREVEN.MINI 1kg Spikachki
10809 Bananas 1kg Bananas
11 Lemons 55+ Lemons
12 Napkins paper color 100pcs PL Napkins paper
13 SOFT Cotton sticks 100 PE (BELL Cotton sticks
14 SHEBEKINSKIE Macaroni Butterfly №40 SHEBEKINSKIE Macaroni
15*3426789 WH.The corn rav guava / yagn.d / Cat SEED 85g CAT seed
16 FetaXa Cheese product 60% 400g ( Cheese
173491144 LIP.NAP.ICE TEA green yellow 0.5 liter TEA
182030918 MARIA TRADITIONAL Biscuit 180g Biscuit
19197 Onion 1 kg Onion
20 TOBUSsteering-wheel 0.5kg flow steering-wheel
21 Package "Magnet" white (Plastiktre) Package (Plastiktre)
22*2108609 SLOB.Mayon.OLIVK.67% 400ml Mayon
23 TENDER AGE Cottage cheese 10 Cottage cheese







here 7 patterns is root-key( MAKFA Makar.) another products in this list are unique, but it can be that in whole list
3491144 LIP.NAP.ICE TEA green yellow 0.5 liter TEA
3491144 LIP.NAP.ICE TEA BLACK yellow 0.5 liter TEA

same root-key pattern is TEA.

How to get all unique root-key patterns. Here 17 uninque patterns.

MAKFA Makar., kolb , Spikachki, Bananas, Lemons, Napkins paper, Cotton sticks, HEBEKINSKIE Macaroni, CAT seed, Cheese ,TEA, Biscuit, Onion, steering-wheel, Package (Plastiktre), Mayon,Cottage cheese

csv file has some million rows and path C:/myfold/goods.csv (or goods.txt)




In code tagm can't arrange pattern, so provide it as image



SIMILAR PATTERN
22882

werafa
09-17-2018, 03:11 AM
Hi Synthex,
I'm not sure if I understand you correctly, but here goes

You can use the Instring (instr? I forget the exact spelling) to test if one string is contained in another. for instance, "Instring("Test", "Test Cricket") would return "1", so


If instring(.....) > 0 then
'string exists inside second string
end if


I think you need to use application.worksheetfunction.instring to get this one to work

You can use a nested loop to test for unique/duplicated items. this should get you started


for lRow = 1 to lLastRow
sString1 = range("A" & lrow).value
bFlag = false
for lRow2 = lrow+1 to lLastRow
sString2 = range("A" & lrow2).value
if sstring1 = sstring2 then
bflag = true 'values match
'do something with matching values
exit for
end if
next lrow2
next lrow

synthex
09-18-2018, 03:00 AM
hm. i got ther compile error
here the screen, what i did wrong22889

werafa
09-18-2018, 03:56 AM
Hi Synthex

you have to put the code in a sub routine
I would also strongly recommend that you go to the VBA editor settings and tick 'Option Explicit', this forces you to declare all your variables, and helps prevent typing mistakes.

also, you need to define what you need to do when you find a duplicated Value - all I did was put a comment at this place.

anyway - try this



option explicit

Public Sub FindDuplicates ()

dim lRow as long
dim lRow2 as long
dim lLastRow as long
dim sString as long
'This use of capitalisation is called 'camel case' - and is good for lots of reasons
' using lots of comments like this helps you to understand your code later

lastrow = activesheet.usedrange.rows.count
for lRow = 1 to lLastRow
sString1 = activesheet.range("A" & lrow).value
if sstring1 <> "" then 'cell is not blank - proceed with test
for lRow2 = lrow+1 to lLastRow
sString2 = range("A" & lrow2).value
if sstring1 = sstring2 then
bflag = true 'values match
'do something with matching values
activesheet.range("A" & lrow2).interior.colorindex = 3 'mark cells with red background fill
'you could also activesheet.range("A" & lrow2).clearcontents 'to delete values
end if

next lrow2
end if
next lrow


if bflag = false then msgbox("no duplicate values found")

end sub

werafa
09-18-2018, 04:02 AM
I've not tested this code, so you probably have to debug it.
press f8 to step through the code line by line.

the VBE will attempt to compile the code, and highlight the line associated with any compile errors.
once it will compile, F8 lets you see how each line works/what it does

good luck
Werafa

synthex
09-19-2018, 02:20 AM
dear, werafa
when i run you code, i got the error, see the screen
22896


also, you need to define what you need to do when you find a duplicated Value - all I did was put a comment at this place.

duplicate value put in column
to be more clearly, here simple example


makar makfa 500
makar makfa


onion
onion


makfa makar circle
makar makfa



makar makfa it is common pattern for row 1 and 3 so opposite it, pattern placed.

How to do it?

synthex
09-19-2018, 03:00 AM
Note, in the first post I showed on the screen the entities that I self identified. (similar patter)I want that vba independently identified these entities without me. I do not ask the initial patterns.
i just provide example what i want vba to do without me.
For this purpose i need vba
let's this similar pattern it finds without me and not just comparing my patterns with the available data.
that's the main goal)))

werafa
09-19-2018, 03:09 AM
this is a 'compile error', and excel is telling you that there is a typo. (this is exactly why you use the 'option explicit' at the top)
The typo is in the text that is highlighted. you will see that the dim statement is lLastRow - but the highlighted text is 'lastrow'. Spell it exactly as written in the dim statement near the top (type it in lowercase, and excel will convert it to the correct capitalisation when it recognises it correctly).

to write a value in a cell, you can use:

activesheet.range("B" & lRow2).value = "some text"
if lRow2 = 5, then this resolves to cell B5. Change "B" to be whatever column you need, and change "Some text" to be something useful. eg.

activesheet.range("B" & lRow2).value = activesheet.range("A" & lRow2).value

You seem to have a couple of choices to make, so you might like to search the internet for "Excel VBA If Then"
I think you would also do well to search for "Excel VBA Instring"

I think these two commands will let you identify the patterns you are looking for

snb
09-19-2018, 05:17 AM
Keep it simple


sub M_snb()
sn=split("MAKFA Makar.,kolb,Spikachki,Bananas,Lemons,Napkins paper,Cotton sticks,HEBEKINSKIE Macaroni,CAT seed,Cheese,TEA,Biscuit,Onion,steering-wheel,Package (Plastiktre),Mayon,Cottage cheese",",")

sp=split(createobject("scripting.filesystemobject).opentextfile("G:\OF\23.csv").readall,vbcrlf)

for each it in sn
msgbox join(fiilter(sp,it),vblf)
next
End Sub

synthex
09-19-2018, 06:31 AM
Keep it simple


sub M_snb()
sn=split("MAKFA Makar.,kolb,Spikachki,Bananas,Lemons,Napkins paper,Cotton sticks,HEBEKINSKIE Macaroni,CAT seed,Cheese,TEA,Biscuit,Onion,steering-wheel,Package (Plastiktre),Mayon,Cottage cheese",",")

sp=split(createobject("scripting.filesystemobject).opentextfile("G:\OF\23.csv").readall,vbcrlf)

for each it in sn
msgbox join(fiilter(sp,it),vblf)
next
End Sub


snb, a little be wrong. Why?
MAKFA Makar.,kolb,Spikachki,Bananas,Lemons,Napkins paper,Cotton sticks,HEBEKINSKIE Macaroni,CAT seed,Cheese,TEA,Biscuit,Onion,steering-wheel,Package (Plastiktre),Mayon,Cottage cheese"
these classes i extract manualle and provide as example of outpur i wanted, so my question was about how to do that VBA self extracts these classe
so as input, i just have




I.E. only 0ne column with data on input

22898

I.E, two column where class was defined by vba on the basis of root-key,key words



22899

Of course, I understand that it is not worth waiting for absolute precision. any accuracy is valuable.
Do you understand my question?:) If no, let me know, i 'll try to explain

snb
09-19-2018, 06:42 AM
You only need to adapt the first code line.

synthex
09-19-2018, 06:54 AM
Can't understand you. Can you show what do you mean

DanielLee2
09-19-2018, 06:54 AM
I have extracted key entities by VBA and also Tried ZetExcel SDK. (https://www.zetexcel.com)

synthex
09-19-2018, 08:12 AM
DanielLee2
just provide example how can i extract key entities by VBA in mu data.
I have any difficulties.