PDA

View Full Version : [SOLVED:] Extract only the title of an article (first paragraph) from a Word file using VBA



ovisele
03-18-2021, 07:39 AM
I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), and I am stuck in the last phase.


The structure of the Word is as follows (please see two files attached):


--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
Authors list, etc, etc ---


I assume that this is the first paragraph, so I tried a workaround, but without result.

Only the title (in bold) is needed, from each file, pasted in a different file. I can do the iteration to run the code for multiple files, that is not a problem.


I found a similar code, but it copies every sentence from each paraghraph:



Sub test()
Dim doc As Document
Dim p As Paragraph
Dim s As String


Set doc = ActiveDocument


For Each p In doc.Paragraphs
Debug.Print p.Range.Sentences(1)


Next


End Sub


Many thanks in advance!

macropod
03-18-2021, 02:11 PM
The structure of the Word is as follows (please see two files attached):


--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
Authors list, etc, etc ---


I assume that this is the first paragraph
An unfounded assumption, as displaying Word's formatting marks would immediately show.

Your files have quite different structure: one has 'JOURNAL of MEDICINE and LIFE' in the document body; the other has it in the header. Moreover, some of your titles span more than one paragraph. Consequently, you can't work from a paragraph # (4 in one document, 7 & 8 in the other). Instead, you need to use Find to locate the text.

Since your two examples have the titles in the Normal Style and a bold font, try:


Sub Demo()Application.ScreenUpdating = False
Dim Rng As Range
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = ""
.Replacement.Text = ""
.Font.Bold = True
.Style = wdStyleNormal
.Forward = True
.Wrap = wdFindStop
.Format = True
.MatchWildcards = True
End With
Do While .Find.Execute
If Rng Is Nothing Then
Set Rng = .Duplicate
Else
If Rng.End + 1 = .Start Then
Rng.End = .End
Else
Exit Do
End If
End If
Loop
MsgBox Rng.Text
End With
Application.ScreenUpdating = True
End Sub

gmayor
03-18-2021, 09:56 PM
As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question.


Function GetTitle(oDoc As Document) As String
'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021
Dim oRng As Range
Dim sTitle As String
Dim i As Integer
sTitle = ""
For i = 1 To oDoc.Sections(1).Range.Paragraphs.Count
Set oRng = oDoc.Sections(1).Range.Paragraphs(i).Range
If oRng.Font.Size = 16 Then
sTitle = sTitle & oRng.Text
End If
Next i
GetTitle = sTitle
Set oRng = Nothing
End Function

You can then loop through the documents and process the titles however you wish e.g.


Sub BatchTitles()
'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021
Dim strFile As String
Dim strPath As String
Dim oDoc As Document, oNewDoc As Document
Dim fDialog As FileDialog
Set fDialog = Application.FileDialog(msoFileDialogFolderPicker)
With fDialog
.Title = "Select folder and click OK"
.AllowMultiSelect = False
.InitialView = msoFileDialogViewList
If .Show <> -1 Then
MsgBox "Cancelled By User", , "List Folder Contents"
Exit Sub
End If
strPath = fDialog.SelectedItems.Item(1)
Do Until Right(strPath, 1) = Chr(92)
strPath = strPath & Chr(92)
Loop
End With
strFile = Dir$(strPath & "*.docx")
Set oNewDoc = Documents.Add
While strFile <> ""
Set oDoc = Documents.Open(strPath & strFile)
oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
oDoc.Close SaveChanges:=wdDoNotSaveChanges
strFile = Dir$()
Wend
lbl_Exit:
Set oDoc = Nothing
Set oNewDoc = Nothing
Set fDialog = Nothing
Exit Sub
End Sub

macropod
03-18-2021, 11:58 PM
As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question.
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...

gmayor
03-19-2021, 12:52 AM
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...Quite, but you can only make suggestions based on what you are presented with.

ovisele
03-19-2021, 01:08 AM
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
Yes, that is absolutely right. Will test the above options and get back with a feedback. Thank you so very much for all your answers.

ovisele
03-19-2021, 01:41 AM
Thank you so very much! Sub BatchTitles() works like a charm. I get as a result a Word file that has all the title. If your time allows it, just very briefly, can you please let me know what is the logic behind it. I don't quite grasp the logic and I assume that the key is in these lines - Do Until Right(strPath, 1) = Chr(92), strPath = strPath & Chr(92). Only if you have the time. Many thanks again!

gmayor
03-19-2021, 05:40 AM
That simply ensures that the path is correctly terminated. The batch loop is in fact


While strFile <> ""
Set oDoc = Documents.Open(strPath & strFile)
oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
oDoc.Close SaveChanges:=wdDoNotSaveChanges
strFile = Dir$()
Wend

ovisele
03-19-2021, 06:06 AM
That simply ensures that the path is correctly terminated. The batch loop is in fact


While strFile <> ""
Set oDoc = Documents.Open(strPath & strFile)
oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
oDoc.Close SaveChanges:=wdDoNotSaveChanges
strFile = Dir$()
Wend


Aaa...Ok! Thanks again.