View Full Version : [SOLVED:] Extract only the title of an article (first paragraph) from a Word file using VBA
ovisele
03-18-2021, 07:39 AM
I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), and I am stuck in the last phase.
The structure of the Word  is as follows (please see two files attached):
--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
Authors list, etc, etc ---
I assume that this is the first paragraph, so I tried a workaround, but without result. 
Only the title (in bold) is needed, from each file, pasted in a different file. I can do the iteration to run the code for multiple files, that is not a problem. 
I found a similar code, but it copies every sentence from each paraghraph:
Sub test()
Dim doc As Document
Dim p As Paragraph
Dim s As String
Set doc = ActiveDocument
For Each p In doc.Paragraphs
    Debug.Print p.Range.Sentences(1)
Next
End Sub
Many thanks in advance!
macropod
03-18-2021, 02:11 PM
The structure of the Word  is as follows (please see two files attached):
--- JOURNAL of MEDICINE and LIFE
JML | REVIEW
The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
Authors list, etc, etc ---
I assume that this is the first paragraph
An unfounded assumption, as displaying Word's formatting marks would immediately show.
Your files have quite different structure: one has 'JOURNAL of MEDICINE and LIFE' in the document body; the other has it in the header. Moreover, some of your titles span more than one paragraph. Consequently, you can't work from a paragraph # (4 in one document, 7 & 8 in the other). Instead, you need to use Find to locate the text.
Since your two examples have the titles in the Normal Style and a bold font, try:
Sub Demo()Application.ScreenUpdating = False
Dim Rng As Range
With ActiveDocument.Range
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Text = ""
    .Replacement.Text = ""
    .Font.Bold = True
    .Style = wdStyleNormal
    .Forward = True
    .Wrap = wdFindStop
    .Format = True
    .MatchWildcards = True
  End With
  Do While .Find.Execute
    If Rng Is Nothing Then
      Set Rng = .Duplicate
    Else
      If Rng.End + 1 = .Start Then
        Rng.End = .End
      Else
        Exit Do
      End If
    End If
  Loop
  MsgBox Rng.Text
End With
Application.ScreenUpdating = True
End Sub
gmayor
03-18-2021, 09:56 PM
As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question. 
Function GetTitle(oDoc As Document) As String
'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021 
Dim oRng As Range
Dim sTitle As String
Dim i As Integer
    sTitle = ""
    For i = 1 To oDoc.Sections(1).Range.Paragraphs.Count
        Set oRng = oDoc.Sections(1).Range.Paragraphs(i).Range
        If oRng.Font.Size = 16 Then
            sTitle = sTitle & oRng.Text
        End If
    Next i
    GetTitle = sTitle
    Set oRng = Nothing
End Function
You can then loop through the documents and process the titles however you wish e.g.
Sub BatchTitles()
'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021 
Dim strFile As String
Dim strPath As String
Dim oDoc As Document, oNewDoc As Document
Dim fDialog As FileDialog
    Set fDialog = Application.FileDialog(msoFileDialogFolderPicker)
    With fDialog
        .Title = "Select folder and click OK"
        .AllowMultiSelect = False
        .InitialView = msoFileDialogViewList
        If .Show <> -1 Then
            MsgBox "Cancelled By User", , "List Folder Contents"
            Exit Sub
        End If
        strPath = fDialog.SelectedItems.Item(1)
        Do Until Right(strPath, 1) = Chr(92)
            strPath = strPath & Chr(92)
        Loop
    End With
    strFile = Dir$(strPath & "*.docx")
    Set oNewDoc = Documents.Add
    While strFile <> ""
        Set oDoc = Documents.Open(strPath & strFile)
        oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
        oDoc.Close SaveChanges:=wdDoNotSaveChanges
        strFile = Dir$()
    Wend
lbl_Exit:
    Set oDoc = Nothing
    Set oNewDoc = Nothing
    Set fDialog = Nothing
    Exit Sub
End Sub
macropod
03-18-2021, 11:58 PM
As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question.
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
gmayor
03-19-2021, 12:52 AM
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...Quite, but you can only make suggestions based on what you are presented with.
ovisele
03-19-2021, 01:08 AM
Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
Yes, that is absolutely right. Will test the above options and get back with a feedback. Thank you so very much for all your answers.
ovisele
03-19-2021, 01:41 AM
Thank you so very much! Sub BatchTitles() works like a charm. I get as a result a Word file that has all the title. If your time allows it, just very briefly, can you please let me know what is the logic behind it. I don't quite grasp the logic and I assume that the key is in these lines - Do Until Right(strPath, 1) = Chr(92), strPath = strPath & Chr(92). Only if you have the time. Many thanks again!
gmayor
03-19-2021, 05:40 AM
That simply ensures that the path is correctly terminated. The batch loop is in fact
    While strFile <> ""
        Set oDoc = Documents.Open(strPath & strFile)
        oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
        oDoc.Close SaveChanges:=wdDoNotSaveChanges
        strFile = Dir$()
    Wend
ovisele
03-19-2021, 06:06 AM
That simply ensures that the path is correctly terminated. The batch loop is in fact
    While strFile <> ""
        Set oDoc = Documents.Open(strPath & strFile)
        oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
        oDoc.Close SaveChanges:=wdDoNotSaveChanges
        strFile = Dir$()
    Wend
Aaa...Ok! Thanks again.
Powered by vBulletin® Version 4.2.5 Copyright © 2025 vBulletin Solutions Inc. All rights reserved.