PDA

View Full Version : verify start and end tags



saban
01-12-2006, 07:22 AM
I have a question about validating XML tags in word documents

How to check if every start tag has it's end tag (<Amend> </Amend> and so on)

for instance:

<Amend>bla bla bla <NumAm>1</NumAM>
bla bla bla </Amend>

<Amend>bla bla bla <NumAm>1</NumAM>
bla bla bla </Amen

There are many tags in document how to check whether some of them are missing or damaged, and to show me where is missing or is damaged

saban
01-12-2006, 07:29 AM
lets assume that in between the tags text is written

fumei
01-12-2006, 06:41 PM
If you could NOT have nested tags this would not be really all that difficult. Tedious, but not difficult. However, as you CAN nest tags, this vastly increases the logic statements required.

Can be done, but would be SO tedious...it becomes difficult.

Essentially, take just a piece your sample. I have adjusted the presentation of it to try and make this easier to read. And let's pretend that the first <Amend> is the beginning of the doc.

<Amend>
<Date>{07/12/2005}7.12.2005</Date>
<ANo>A6-0317</ANo>/
<NumAm>49</NumAm>
</Amend>

OK, say you are testing for <Amend> to see if it has a proper </Amend>.

1. Go to the start of the doc.
2. Find the first tag. Search for any text enclosed by < >.
3. Make a string variable for that. Could use wildcards as well.
4. Search forward text for this variable, but with the added "/".
5. Search BACK to see if there is another instance of the original string.

LOGIC: the issue is how do you trap an instance of a word (a string) between other strings.

A:
<Amend> text text <Amend>
<Date>textext<Date> <Amend>texttext</Amend> WRONG

B:
<Amend> text text </Amend>
<Date> textext<Amend>text text </Amend> WRONG

So say you search for a tag PLUS that tag again with closing character in it.

In A you end with the </Amend at the end of this snippet - with tags in between. What do you do? You have to search THOSE tags logically. Is one of them another <Amend>? Is it the first one? If so, then THAT one needs the closing "/".

Now another choice. Do you continue to determine the logic decisions for this initial chunk? In other words, do you do the logic testing for the OTHER tags - in this case, <Date>? Or do you finish with the original tag - this case <Amend>?

In B you end up with the correct closing tag - [b]with NO tags in between. But how do you know that? You don't. The most important point being is that one of them may be another <Amend> (or the original tag, whatever it is you are testing). If it is - then that is probably...but may not be...you are going to have to test...the closing tag.

So...you gotta check.

Do you see what I mean. Yes, this could be done. There may be more efficient ways of going about it. Likely there are. Still, it is an issue of the ....booooorrrrrring..tediousness of doing it.

Mind you, if you have a lot of this to do, well it may be worth it.

saban
01-13-2006, 01:46 AM
I was thinking to test this tags lets say through input box for example the user puts text to check in input box. Lets say he puts in input box starting tag amend and word searches for first instance of <Amend> tag and before finding the next <Amend> it should find closing tag </Amend> if it finds <Amend> before </Amend> it means that closing tag is missing. Am I thinking right or not? I am not even sure if I am thinking right

saban
01-13-2006, 04:19 AM
I find your A option Ok ,but how could i write this in code

fumei
01-13-2006, 01:21 PM
word searches for first instance of <Amend> tag and before finding the next <Amend> it should find closing tag </Amend> if it finds <Amend> before </Amend> it means that closing tag is missing.
Yes, this is true, but think about it. A search instruction is a search instruction. You search FOR the next <Amend>. You can not search for two things at once. So you say search for <Amend> but before you find <Amend> find </Amend>...it does not work that way. You can look for <Amend> OR you can look for </Amend>. You can not look for both at the same time.

So again, this has to be done - as I posted - by logic. Find the next </Amend>, go back and check if there is an <Amend> between your starting point and end point....yadda yadda yadda.

How do you code this? By coding it using the needed logic, exactly as I posted. This is the problem, it is tedious, fussy and must be completely air tight logic.

mdmackillop
01-13-2006, 02:40 PM
I don't know if this is any help, but here's some rough code to list all the tag codes. I think with a little (maybe a lot) more work you could record the tags and add/reduce a tab count which could produce an indented listing.
Regards
MD

Sub Tags()
Dim MyData(100)
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = "<"
.Replacement.Text = "xx"
.Forward = True
.Wrap = wdFindContinue
.MatchWildcards = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
With Selection.Find
.Text = ">"
.Replacement.Text = "zz"
.Forward = True
.Wrap = wdFindContinue
.MatchWildcards = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.HomeKey Unit:=wdStory

For i = 1 To 99
With Selection.Find
.Text = "xx*zz"
.MatchWildcards = True
.Forward = True
.Wrap = wdFindStop
End With
Selection.Find.Execute
MyData(i) = Mid(Selection.Text, 3, Len(Selection.Text) - 4)
Next
With Selection.Find
.Text = "xx"
.Replacement.Text = "<"
.Forward = True
.Wrap = wdFindContinue
.MatchWildcards = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "zz"
.Replacement.Text = ">"
.Forward = True
.Wrap = wdFindContinue
.MatchWildcards = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Documents.Add
For i = 1 To 100
Selection.TypeText MyData(i) & vbCr
Next
End Sub

saban
01-14-2006, 04:04 AM
thnx I will give it a try and let you know

:)

fumei
01-15-2006, 09:37 PM
Nice md, but sorry, that does NOT take in any consideration whatsoever the essential issue - which is <whatever> properly followed by a </whatever>.

There is no logic at all to deal with determining if a tag is properly closed. Yes it lists them..which is good I suppose, but there is no logic to deal with incorrect ones. I mean if you want to get the list, simply extract all text strings that are between < and >. Would be much simpler.

saban
01-16-2006, 02:30 AM
And when i extract them how can check if there is some damaged ones any ideas?
What would be your solution ??

geekgirlau
01-16-2006, 05:03 PM
As a starting point, what about a simple count of tags and their matching end tags (for example, <Amend> occurs 20 times, </Amend> occurs 19 times) and only display a list (or highlight in the document) those tags where the count of start and end values do not match?

saban
01-17-2006, 01:19 AM
I already did counting of the codes now I cant figure out how to higlight non matching tags

Thnx
here is the code:
Sub d()
Dim iCount As Long
Dim strSearch As String
Dim nasel As Boolean
Dim lcount As Long
Dim Mcount As Long
Dim Mecount As Long
Dim numcount As Long
Dim numecount As Long
Dim art As Long
Dim arte As Long
Dim orig As Long
Dim orige As Long
'strSearch = InputBox$("Type in the text you want to search for.")
'iCount = 0

With ActiveDocument.Content.find
.Text = "<Amend>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")


Do While .Execute
iCount = iCount + 1


Loop
End With

With ActiveDocument.Content.find
.Text = "</Amend>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
lcount = lcount + 1


Loop
End With

With ActiveDocument.Content.find
.Text = "<Members>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
Mcount = Mcount + 1


Loop
End With
With ActiveDocument.Content.find
.Text = "</Members>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
Mecount = Mecount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "<NumAm>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
numcount = numcount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "</NumAm>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
numecount = numecount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "<Article>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
artcount = artcount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "</Article>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
artecount = artecount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "<Original>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
origcount = origcount + 1
Loop
End With
With ActiveDocument.Content.find
.Text = "</Original>"
.Format = False
.Wrap = wdFindStop
.Style = ActiveDocument.Styles("HideTWBExt")
Do While .Execute
origecount = origecount + 1
Loop
End With
msgbox "<Amend>" & " sem na?el " & _
iCount & " krat" & vbCrLf & "</Amend>" & " sem na?el " & lcount _
& " krat " & vbCrLf & vbCrLf & "<Members>" & " sem na?el " & Mcount & " krat" & vbCrLf _
& "</Members>" & " sem na?el " & _
Mecount & " krat" & vbCrLf & vbCrLf & "<NumAm>" & " sem na?el " & _
numcount & " krat" & vbCrLf & "</NumAm>" & " sem na?el " & _
numecount & " krat" & vbCrLf & vbCrLf & "<Article>" & " sem na?el " & _
artcount & " krat" & vbCrLf & "</Article>" & " sem na?el " & _
artecount & " krat" & vbCrLf & vbCrLf & "<Original>" & " sem na?el " & _
origcount & " krat" & vbCrLf & "</Original>" & " sem na?el " & _
origecount & " krat"

saban
01-17-2006, 01:21 AM
Would it be possible not to higlight all the tags ( i guess if some of the "<Amend>" or "</Amend>" is not matching it will higlight all the <Amend> and </Amend> codes not just the ones that misses their start or end tags??)

fumei
01-17-2006, 08:09 AM
People, people. The actual point is still being missed!

YES - you can count tags. YES - that would identify that there is a missed tag. Say 20 <Amend> and 19 </Amend>. But it tells you NOTHING about the logic.

Does that means there are really 19 proper tags, and an EXTRA <Amend>?

OR;

Does that mean there are 20 proper tags and a MISSING </Amend>?

See what I mean? There is no way to know unless you parse it. Parsing is a logic operation. A count is a good starting point but it does NOT help (really) at all with the logic needed.

Yes - you can highlight tags....but the logic problem remains.

You need to match, and you need to match in the proper order.

higlight all the <Amend> and </Amend> codes not just the ones that misses their start or end tags??)
I don't know how many more times I can state this. The answer is YES! You can do this. But it requires very detailed, flawlessly convoluted logic. There is no other way.

Further, as I stated before, the logic is not difficult, but it IS tedious. If you have a real need for a tool like this, then by all means do it...and use it.

I mean you could do a superficial count operation. That would at least warn you that something is wrong - but it would not tell you exactly what it is (is it an extra, or a missing tag), nor would it tell where it is.

A really functional tool requires perfect logic. This logic MUST perform a variable number of loopback operations.

A: <Amend> text text <Amend> text text ettstst </Amend>
B: <Amend> text text text text ettstst </Amend>

Which is correct? B: right? That is easy. But how do you KNOW there is not an improper tag between an <Amend> and an </Amend>, as in A:? You can not know - unless you actually check. There is no other way. You must check. Period.

Further there must be logic test to see - is the first <Amend> correct......and the second one needs to be removed; or is the second one correct, and the first one needs to be removed. Further, if the second one is correct...are you sure the first one needs to be removed...or does it actually need a closing tag? Further, if the first one is correct does the second one ned to be removed...or does IT need a closing tag?

These are logic tests. And this thing just ain't gonna fly without them.

saban
01-17-2006, 08:27 AM
gerry you seemed to know what are you talking about, but I just cant figure out how to write this in code?? can you please write me this code for parsing logic

mdmackillop
01-17-2006, 02:33 PM
This is definitely not my area of expertise, but in an effort to assist I'll offer my thoughts.
I totally agree with Gerry, you have to solve the logic; however, I don't know how complicated your web pages are. Do the start tags contain more text than the end tags? This obviously causes comparison problems. How many tags are you actually using? With a limited number, I can see how some array comparisons may assist. Is it really necessary to use VBA to solve your problem? With a printout and a pencil, I'm sure I could check simple web pages quite quickly, using my earlier code.
If I was composing stuff, I'd probably enter start/end codes as a "pair" and infill the text and other tags (in pairs) between as required, but maybe I'm being too simplistic.
Regards
MD

saban
01-18-2006, 12:53 AM
it is not a web page. It is a word document which has XML tags in it. I will attach an example of it

saban
01-18-2006, 01:02 AM
Lets say I need to check just <Amend> and </Amend> codes cause I guess it is the same procedure for every other code

Thnx

TonyJollans
01-18-2006, 04:32 AM
This is, as Gerry keeps saying, complex. If I had more time (and knowledge) I would be interested in creating a Word AddIn to do this. Meanwhile I would suggest that you investigate other tools. Have you tried google for xml validators (or similar)? There are tools out there and you should be able to find something either to check for well-formedness (what a horrible word - is it correct) or validity against a style sheet or transform.

saban
01-18-2006, 05:52 AM
ok thnx

I have looked t XML parser but these are to complex i guess

saban
01-18-2006, 06:48 AM
if <Amend> is found then before next <Amend> should find </Amend> else give me an error and select <Amend> that does not have matching pair :)

Would be cool if I could tell the computer just that and he would understand
Anyway I will try and let you guys know how far did i manage to come :)

saban
01-18-2006, 07:37 AM
is it possible to find one start tag and then another one and then check if between those two there is an end tag?? Or this can not be done this way

saban
01-18-2006, 08:11 AM
how is this done: find <Amend> and then another <Amend> and select all the text beetwen these two codes including codes and then search in this selection if </Amend> exists and if exists move and find another instance like this and so on till end of the document ?????? Any suggestions

fumei
01-18-2006, 12:26 PM
OK...all right, all right. Geeeez. I have no real use for this, but because it seems to be a big issue, I will write a SMALL portion of this to give you an idea of how it could be done.

I can not do this immediately, but I will do it shortly. As stated, it is not really difficult, just boring and tedious.

Give me a day or so. I will use your sample doc to demonstrate.

fumei
01-18-2006, 12:29 PM
Oh, and md has the correct idea. It would be MUCH MUCH better to have correct input in the first place. Put in proper opening and closing tags.

Which brings up a point....how are the tags being inserted in the first place????

mdmackillop
01-18-2006, 01:03 PM
Here is a simple way to add your tags.

saban
01-19-2006, 02:56 AM
Thnx gerry

saban
01-19-2006, 03:01 AM
I know it would be better to have correct input but that is the problem I do not make this documents I just get them from other translation divisions and distribute them to our translators.

fumei
01-19-2006, 08:35 AM
Yeah....well, I have started on this. Yikes! So far I have a UserForm that lists ALL tags (in order) so it is fairly easy to see errors; as well as a list of tags by type. Now comes the tricky part of actually doing the logic. As stated....tedious and boring.

To make the demo easier (on me) I am NOT going to make it recursive and check tags within tags. In other words, I am NOT going to check nestings - which if you want a full parsing I would suggest this actually get done. (Good luck with that....) But I ain't gonna do it.

This is pretty complex and I am not sure how long this will take, as I have other more.....hmmmmm....crucial tasks that need to get done.

Patience.

saban
01-19-2006, 08:59 AM
np
Just check the Amend tags i dont need to know if tags inside are correct

thnx again

saban
01-25-2006, 08:45 AM
any luck Gerry?

thnx for all your help
saban

saban
01-25-2006, 09:04 AM
just one question

when I loop throgh text looking for </Amend> when it is found i 'do something
then i search further. But how can I find one more time this same </Amend>
(Not to find another instance of </Amend> but the same as previous and after i have found it 2 times i move to the next instance

Thnx

fumei
01-26-2006, 09:35 AM
see your other thread

saban
01-31-2006, 02:10 AM
did you manage yet to write code for <Amend> checking

Thnx

mdmackillop
01-31-2006, 11:50 AM
Hi Saban,
I'm playing around with some code. Do you have a larger sample to test.
Regards
MD

mdmackillop
01-31-2006, 01:50 PM
Try the attached. Run Tags in the Word document. To help track down the error, paste the result from the output document into Cell A1 in Excel and run the Layout code.

saban
02-03-2006, 08:43 AM
thanks for all your help
Here is the large sample

saban
02-03-2006, 08:56 AM
i get subscript out of range in :
Selection.TypeText MyData(i, 1) & vbCr

saban
02-03-2006, 09:02 AM
any ideas why

mdmackillop
02-03-2006, 03:06 PM
If you run my original code and paste the result into the spreadsheet of my last attachment, a quick inspection shows that \Amjust is missing. My later code identifies another error, an extra bracket two lines before the last table on Page 5. I'm looking into ways of highlighting the errors, to save the manual inspection.

matthewspatrick
02-03-2006, 08:36 PM
Hey all, please check out my post in:
http://vbaexpress.com/forum/showthread.php?t=6956

I think you'll like it :devil:

Patrick

saban
02-04-2006, 03:50 AM
that would be nice
thnx for all your help

mdmackillop
02-06-2006, 01:39 AM
Here's something to try. It contains your sample as posted.
Running Patrick's "CheckTags" code will identify any mismatched items.
Running my Tags code will identify an extra bracket. Delete this and rerun the code.
This should highlight the end of an open/close which contains an error; in this case an opening Amjust with no \Amjust.

BTW Gerry, I tried to incorporate your neat CountOfWords, but had problems with a search for <*>, so for now, I'm using my clumsy xx*zz alternative.

saban
02-06-2006, 04:28 AM
ok i will try and let you know

saban
02-06-2006, 06:40 AM
if i understand this right first your sub identify any extra bracket, then when i delete extra bracket and rerun your sub it will locate any missing or damaged tags and higlight them with yellow

Am I correct
thnx for all your time

mdmackillop
02-06-2006, 12:40 PM
A extra bracket may not be the error, it could be a missing bracket. The code identifies a problem, not the solution. In your sample, it appears there is one extraneous extra bracket.

With regard to tags, the code should identify the location of an error, but not necssarily the error itself. The attached Excel version may assist in the debugging of any errors.

saban
02-07-2006, 03:20 AM
can you pliz explain how this works if i add or delete bracket there is no data in excel sheet or if i damage or delete some tag also there are no data in debug.xls
why

thnx

mdmackillop
02-07-2006, 08:57 AM
I don't think I can give you an "all encompassing" solution. There are too many potential variables. My code in post 46 is supplementary to that in Post 43. I've assumed that bracket errors are corrected before running the Excel code, and Patrick's code will highlight any missing tag terms.

saban
02-08-2006, 02:13 AM
but patricks code does not higlight any missing term it just count them and give me msgbox with asterisks where tag is missing or damaged

mdmackillop
02-08-2006, 10:31 AM
By "highlight" I mean it lets you know what to look for. The codes provided should assist in your debugging, but a full blown solution is beyond my limits as a helper here. Perhaps you need to commission someone to create a "professional" solution.
Regards
MD

saban
02-09-2006, 11:35 AM
man thnx for all your help i really appreciate it

thnx to all of you
saban