PDA

View Full Version : Create HTML From a Word Doc



Bob Phillips
06-15-2005, 11:44 AM
I have a number of papers that I write in Word, and I want to publish them on the web.

BUT ... I am not interested in using Word's HTML output, I want to be able to tweak the output, and I don't need all that garbage. Plus, I want to add my own bells and whistles, metadata, stylesheets, Javascript, etc.

What I am looking for is a good example of working through a word doc, and examing each paragraph and taking action based upon it's style.

I am quite happy that I have the VBA to do the conversion from Word text to HTML text, but I don't have a lot of experience of the Word object model, or how to process it, so it is that part I am looking for.

Any ideas?

Thanks

MOS MASTER
06-15-2005, 01:17 PM
Hi Bob, :yes

Have very little time but you should loop through the paragraps.

From the top of my head something like:
Sub ProcessHTML()
Dim oPara As Word.Paragraph
For Each oPara In ActiveDocument.Paragraphs
Call MakeHTML(oPara.Range.Text, oPara.Style)
Next

End Sub
Sub MakeHTML(sRange As String, sStyle As String)
MsgBox sRange 'Proces text
MsgBox sStyle 'The style used
End Sub


Enjoy! :whistle:

Howard Kaikow
06-15-2005, 01:43 PM
I have a number of papers that I write in Word, and I want to publish them on the web.

BUT ... I am not interested in using Word's HTML output, I want to be able to tweak the output, and I don't need all that garbage. Plus, I want to add my own bells and whistles, metadata, stylesheets, Javascript, etc.

What I am looking for is a good example of working through a word doc, and examing each paragraph and taking action based upon it's style.

I am quite happy that I have the VBA to do the conversion from Word text to HTML text, but I don't have a lot of experience of the Word object model, or how to process it, so it is that part I am looking for.

Any ideas?

Thanks

a first attempt might be:

1. In Word, save as a filtered HTML file, which is alleged to remove a lot of the crap.

2. Take the resultant file and pass it thri an program such as HTML Tidy or CSE HTML Validator.

Start at HTML Tidy Project Page (http://tidy.sourceforge.net/) and CSE HTML Validator (http://www.htmlvalidator.com/) and W3C HTML Validation Service (http://validator.w3.org/) and WDG HTML Validator (http://www.htmlhelp.com/tools/validator/) and ... .

Bob Phillips
06-15-2005, 03:53 PM
Howard,

Thanks for the suggestion, but not really the way I think I want to go. I want to use my 'style' so I think I need to intercept it. I will put it through the W3C validator when I have done with it.

Joost,

That looks the sort of thing I want. Started playing with it and it is going well.

A couple of other question sthough if I may.

I have some lines, drawing objects in the document. Any idea how I can tell when I come acroos one?

I also have some paragraphs that are of the usual Normal style, but they have some text within them of another style. Know of any easy way that I can (easily) identify text within sRange that is of another style (i.e. not by parsing the whole paragraph)?

How do I identify bullets?

Thanks

Bob

MOS MASTER
06-15-2005, 04:05 PM
Hi Bob, :yes

Your welcome.
I'm almost going to bed so I'm keeping this short have to come back to yah.

The best method is to make sure the document is Strictly Formatted. (Use of defined styles and applied only)

In that way you could make this work 100%. (I think untested by the way)
If there's manual formatting I'm afraid the Paragraph collections ain't gonna do it for yah.

In that case you'd have to loop through all characters in the document and examine its formats. (Mucho slow and cumbersome!) (Not to say..NOT foolproof)

There are 4 types of styles so if the document is stricly formatted it would be possible to check for all four. If you're also using Character styles (To format someting bold in a sentence for example) then I think you'd have to loop through the Words of the document.

I must admit I have no idea of what your documents look like and if you have some kind of formatting guidelines you can depend upon.

Do you have a little example document so I can examine for you what kind of things would be possible to get this done?

To get the shapes you have to loop the Shapes collection but again if the shapes are InlineShapes (Treaded as a range) then you'd need the InlineShapes collection...

By now this thread isn't as short as I'd hope for but you'r dealing with the biggest Object model in Office here...

So the big issue here is defining the Scope of what we do and don't need...

I hope I can drop by tomorrow but I'm propably working so more to come the day after tomorrow.

Have a good one! :whistle:

Bob Phillips
06-16-2005, 02:45 AM
There are 4 types of styles so if the document is stricly formatted it would be possible to check for all four.

Hi Joost,

My documented is well formatted (ish! :(), but as ever I am sure it could be better. But what does that quoted statement mean exactly, i.e. 4 styles, I have 40?

The data within a paragraph that I am after will have a style, it isn't just additional formatting. Parsing the lot is not a good idea we both know that, but does my use of style rather than just formats make it easier?

MOS MASTER
06-16-2005, 01:07 PM
Hi Joost,

My documented is well formatted (ish! :(), but as ever I am sure it could be better. But what does that quoted statement mean exactly, i.e. 4 styles, I have 40?

The data within a paragraph that I am after will have a style, it isn't just additional formatting. Parsing the lot is not a good idea we both know that, but does my use of style rather than just formats make it easier?
Hi Bob, :yes

I'm sorry was to tired last night so I forgot one Word! :banghead:

I meant...Word has 4 style types:

Paragraph
Character
Table
List
And to be more clear not all versions have 4 style types to choose from but versions > 2002 do.

I think there's a though job ahead anyhow. If you use strict formatting then it will make your live easier cause you can check for a style. If you don't use it you have to check each characters format to duplicate that in your HTML-code. (Which will become a mess as well I presume)

Perhaps we can do this in little blocks.
I was thinking if you provide a document (Small) which is formatted in styles then I can try to output for you some code that will reproduce a new document that will state in which style a paragraph is formatted. (If I can make it that is) :rofl:

If that's possible then you should be able to build your HTML generating code in to that code.

But it won't be easy but I like to believe nothing is imposible so I'm willing to try. :whistle:

Bob Phillips
06-16-2005, 01:24 PM
I meant...Word has 4 style types:

Paragraph
Character
Table
List
Okay that makes more sense. Is there anyway that you can identify which type it is? The problem I have is that I have a paragraph with a style of say Normal, but some of the text within that paragraph is a different style, named 'FunctionName'. I suppose this must be a character style as it only applies to some characters. When I encounter the paragraph, I can get the style and setup my HTML accordingly. But what I want to is to add span tags round the 'FunctionName' text.

As an example

This is an example paragraph with som key words embedded within the text, and some more here.

The bold words here are those of a different style. This would transcribe to HTML as

<p>This is an example paragraph with som <span class="functionname">key words</span> embedded within the text, and <span class="functionname">some more</span> here.</p>


iF you don't use it you have to check each characters format to duplicate that in your HTML-code. (Which will become a mess as well I presume)

Exactly what I want to avoid. I do use strict formatting, so can I do it?


I was thinking if you provide a document (Small) which is formatted in styles then I can try to output for you some code that will reproduce a new document that will state in which style a paragraph is formatted. (If I can make it that is)

I don't think that is necessary, thanks anyway. I already have some code that is managing all of my paragraph styles and creating strictly formatted HTML. I have my metadata, stylesheets, JavaScript files etc. so it is a good first step.

The things I still need to do are:

the embedded character styles we have talked about
the shapes, my drawing lines
images
bullets
So I have made progress, it is just those few difficult bits left (bullets will be easy I think).

Regards

Bob

MOS MASTER
06-16-2005, 02:16 PM
Hi Bob, :yes

Well the hard parts still finding the character style.
I have an idea that I need to work out in a little example.

Instead of looping through para's we can also combine things with looping through your styles (You have 40 so we can build an array of there names)
Then we can also include Find object to speed up things..

Have it in my head just have to make an example to get it visualised and see if it works...

I'm starting on it right now...(if they don't bother me that is...) :rofl:

MOS MASTER
06-16-2005, 03:03 PM
Hi Bob, :yes

Okay cooked something up for you and I'm sure this will get you started.
I've added a (simple) testdocument for you to see how it works.

Had to make my own HTML converter to do a proper test!:rofl:

The code:
Option Explicit
Const i_STYLE As Integer = 0
Const i_TAG As Integer = 1

Sub ConvertToHTML()
Dim oPara As Word.Paragraph
Dim vStyles As Variant
Dim lCount As Long

vStyles = Array(Array("Bold", "b"), _
Array("Heading 1", "h1"), _
Array("Heading 2", "h2"), _
Array("Heading 3", "h3"))

Application.ScreenUpdating = False

Call CharacterStyles("C_Italic", "i") 'Do Character style

For Each oPara In ActiveDocument.Paragraphs
For lCount = LBound(vStyles) To UBound(vStyles)
If oPara.Style = vStyles(lCount)(i_STYLE) Then
With oPara.Range
.MoveEnd Unit:=wdCharacter, Count:=-1
.InsertBefore Text:=("<" & vStyles(lCount)(i_TAG) & ">")
.InsertAfter Text:=("</" & vStyles(lCount)(i_TAG) & ">")
End With

Exit For
End If
Next
Next

End Sub

Sub CharacterStyles(sStyle, sTag)
With ActiveDocument.Range.Find
.ClearFormatting
.Style = ActiveDocument.Styles(sStyle)
.Text = ""
.Replacement.ClearFormatting
.Replacement.Text = "<" & sTag & ">^&</" & sTag & ">"
.Forward = True
.Execute Replace:=wdReplaceAll
End With
End Sub

I'm combining methods here (don't know what's efficient yet)

I'm using: "CharacterStyles" sub to get those nasty character styles and you could get the list styles with them as well. (Ahums..or you could descide to do everyting with the "CharacterStyles" which perhaps would be best)

Guess you just have to play with it!

Well good luck and let me know if this helps! :whistle:

Bob Phillips
06-16-2005, 04:09 PM
I'm using: "CharacterStyles" sub to get those nasty character styles and you could get the list styles with them as well.

Interesting!

I had a look and it seems to me that all I needed to do was to replace the characterv styles using that little function, didn't need to change anything else I had. I had a quick play with one style and there were a couple of interesting side-effects.

First, I want to embed the text in

<span class="functionName">text</span>

Unfortunately, it generated in all upper case, so I get

<SPAN CLASS="FUNCTIONNAME">text</SPAN>

which fails as my class is called functionName not FUNCTIONNAME. Anyway I can get the case I enter?

Secondly, I didn't actually get a ", I got some unprintable character. Are Word quotes not ASC(34)?

Also, what's the Word equivalent on an Excel add-in?

Cheers

Bob

MOS MASTER
06-18-2005, 10:53 AM
Hi Bob, :yes

You're welcome and I thought as well that little function would be a help to you.

About the lcase to ucase problem you're having I have no clue cause I've not seen the code you're using. Please post it so I can look at it.

Also Chr$(34) should produce a " in Word. But again something in the code (or perhaps keyboardsettings dunno...could cause a different result)

Word has addins just like XLA in Excel.
XLA is a special file and so is the *dot file in Word.
*Dot files are templates in if you put them in a special place e.g.: Words startup folder or Office Startup folder they will be loaded globally. (just like xl addin)

You can attacht them on runtime or use the folderplacement method I said. You can also attach them in the Interface via Tools/Addin & Templates.

Enjoy! :whistle:

Bob Phillips
06-18-2005, 01:03 PM
About the lcase to ucase problem you're having I have no clue cause I've not seen the code you're using. Please post it so I can look at it.

The code is basically what you gave me

With ActiveDocument.Range.Find
.ClearFormatting
.Style = ActiveDocument.Styles(sStyle)
.Text = ""
.MatchCase = True
.Replacement.ClearFormatting
.Replacement.Text = "<span class=""" & LCase(sStyle) & "">^&</span>"
.Forward = True
.Execute Replace:=wdReplaceAll
End With


I am looking to format a character style txet sych as LOOKUP to

<span class="functionName">LOOKUP</span>

bur it comes out as

<SPAN CLASS=?FUNCTIONNAME?>LOOKUP</SPAN>

which is not much use.


Word has addins just like XLA in Excel.
XLA is a special file and so is the *dot file in Word.
*Dot files are templates in if you put them in a special place e.g.: Words startup folder or Office Startup folder they will be loaded globally. (just like xl addin)

That is not how you install add-ins in Excel. You can, but you should install them properly (Tools>Addins). What you describe is like Personal.xls. Excel template files can be in a template directory, just like Word templates.

I was expecting word add-ins to be a file type of say wda, but didn't find this. Are you saying they are .dot files that you just store in the startup folder?

MOS MASTER
06-18-2005, 01:12 PM
That is not how you install add-ins in Excel. You can, but you should install them properly (Tools>Addins). What you describe is like Personal.xls. Excel template files can be in a template directory, just like Word templates.

I was expecting word add-ins to be a file type of say wda, but didn't find this. Are you saying they are .dot files that you just store in the startup folder?
Hi Bob, :yes

Will have a look at your other question when I've had a drink!

I've not told you how you should install XLA files so I'm not getting "That's not how you install add-ins in Excel"

I'm just refering to the fact that Word does have the same capabilities in *Dot files that XLA files have without going in to much detail!

Yes a *dot file in the startup folder (Office or Word) will load globally and thus work like a add-in.

FYI: you can also load a *dot file in the VBE via references to make a connection to that file and use it's functions..

Be back in a while..it's more then 30 degrees over here so I'm going to get a drink! :whistle:

MOS MASTER
06-18-2005, 01:31 PM
Hi Bob, :yes

This rule is missing one "
.Replacement.Text = "<span class=""" & LCase(sStyle) & "">^&</span>"

I think it should be:
.Replacement.Text = "<span class=""" & LCase(sStyle) & """>^&</span>"

I've tried it and it works Ok over here on the textfiles it produced lines like:
<span class="c_italic">quick brown</span>

So to be honest I've got no idea what is causing the Ucase conversion at your side...that's to strange...:whistle:

Bob Phillips
06-18-2005, 02:36 PM
Shall we continue this directly, no-one else seems interested?

MOS MASTER
06-18-2005, 02:38 PM
Shall we continue this directly, no-one else seems interested?
Sure...How shall we do this? :yes

fumei
06-20-2005, 10:49 AM
Gentlemen, can I play too? I am interested in what comes out of this.

MOS MASTER
06-21-2005, 11:45 AM
Gentlemen, can I play too? I am interested in what comes out of this.
Well I don't mind..how's your trowing arm? :rofl:

fumei
06-22-2005, 08:07 AM
I am more of a catcher...see my avatar photo.

MOS MASTER
06-22-2005, 10:24 AM
I am more of a catcher...see my avatar photo.
Ah Ok sowie....

Now go and fetch! :rotflmao: