Consulting

Results 1 to 9 of 9

Thread: Extract only the title of an article (first paragraph) from a Word file using VBA

  1. #1
    VBAX Newbie
    Joined
    Mar 2021
    Posts
    4
    Location

    Extract only the title of an article (first paragraph) from a Word file using VBA

    I would appreciate your guidance in the following problem. I need to bulk extract only the articles titles from a series of publications. The idea is that I receive the files in PDF, I extract only the first page (done), and I am stuck in the last phase.


    The structure of the Word is as follows (please see two files attached):


    --- JOURNAL of MEDICINE and LIFE
    JML | REVIEW
    The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
    Authors list, etc, etc ---


    I assume that this is the first paragraph, so I tried a workaround, but without result.

    Only the title (in bold) is needed, from each file, pasted in a different file. I can do the iteration to run the code for multiple files, that is not a problem.


    I found a similar code, but it copies every sentence from each paraghraph:


    Sub test()
    Dim doc As Document
    Dim p As Paragraph
    Dim s As String
    
    
    Set doc = ActiveDocument
    
    
    For Each p In doc.Paragraphs
        Debug.Print p.Range.Sentences(1)
    
    
    Next
    
    
    End Sub
    Many thanks in advance!
    Attached Files Attached Files

  2. #2
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by ovisele View Post
    The structure of the Word is as follows (please see two files attached):


    --- JOURNAL of MEDICINE and LIFE
    JML | REVIEW
    The role of novel poly (ADP-ribose) inhibitors in the treatment of locally advanced and metastatic Her-2/neu negative breast cancer with inherited germline BRCA1/2 mutations. A review of the literature
    Authors list, etc, etc ---


    I assume that this is the first paragraph
    An unfounded assumption, as displaying Word's formatting marks would immediately show.

    Your files have quite different structure: one has 'JOURNAL of MEDICINE and LIFE' in the document body; the other has it in the header. Moreover, some of your titles span more than one paragraph. Consequently, you can't work from a paragraph # (4 in one document, 7 & 8 in the other). Instead, you need to use Find to locate the text.

    Since your two examples have the titles in the Normal Style and a bold font, try:

    Sub Demo()Application.ScreenUpdating = False
    Dim Rng As Range
    With ActiveDocument.Range
      With .Find
        .ClearFormatting
        .Replacement.ClearFormatting
        .Text = ""
        .Replacement.Text = ""
        .Font.Bold = True
        .Style = wdStyleNormal
        .Forward = True
        .Wrap = wdFindStop
        .Format = True
        .MatchWildcards = True
      End With
      Do While .Find.Execute
        If Rng Is Nothing Then
          Set Rng = .Duplicate
        Else
          If Rng.End + 1 = .Start Then
            Rng.End = .End
          Else
            Exit Do
          End If
        End If
      Loop
      MsgBox Rng.Text
    End With
    Application.ScreenUpdating = True
    End Sub
    Last edited by macropod; 03-18-2021 at 02:26 PM.
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  3. #3
    As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question.

    Function GetTitle(oDoc As Document) As String
    'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021 
    Dim oRng As Range
    Dim sTitle As String
    Dim i As Integer
        sTitle = ""
        For i = 1 To oDoc.Sections(1).Range.Paragraphs.Count
            Set oRng = oDoc.Sections(1).Range.Paragraphs(i).Range
            If oRng.Font.Size = 16 Then
                sTitle = sTitle & oRng.Text
            End If
        Next i
        GetTitle = sTitle
        Set oRng = Nothing
    End Function
    You can then loop through the documents and process the titles however you wish e.g.

    Sub BatchTitles()
    'Graham Mayor - https://www.gmayor.com - Last updated - 19 Mar 2021 
    Dim strFile As String
    Dim strPath As String
    Dim oDoc As Document, oNewDoc As Document
    Dim fDialog As FileDialog
        Set fDialog = Application.FileDialog(msoFileDialogFolderPicker)
        With fDialog
            .Title = "Select folder and click OK"
            .AllowMultiSelect = False
            .InitialView = msoFileDialogViewList
            If .Show <> -1 Then
                MsgBox "Cancelled By User", , "List Folder Contents"
                Exit Sub
            End If
            strPath = fDialog.SelectedItems.Item(1)
            Do Until Right(strPath, 1) = Chr(92)
                strPath = strPath & Chr(92)
            Loop
        End With
        strFile = Dir$(strPath & "*.docx")
        Set oNewDoc = Documents.Add
        While strFile <> ""
            Set oDoc = Documents.Open(strPath & strFile)
            oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
            oDoc.Close SaveChanges:=wdDoNotSaveChanges
            strFile = Dir$()
        Wend
    lbl_Exit:
        Set oDoc = Nothing
        Set oNewDoc = Nothing
        Set fDialog = Nothing
        Exit Sub
    End Sub
    Graham Mayor - MS MVP (Word) 2002-2019
    Visit my web site for more programming tips and ready made processes
    http://www.gmayor.com

  4. #4
    Knowledge Base Approver VBAX Guru macropod's Avatar
    Joined
    Jul 2008
    Posts
    4,435
    Location
    Quote Originally Posted by gmayor View Post
    As an alternative, both documents have the title in Section 1 and that title is in 16 point TNR font. If that is true for all the documents you wish to process, you could use the following function to grab the paragraphs in question.
    Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
    Cheers
    Paul Edstein
    [Fmr MS MVP - Word]

  5. #5
    Quote Originally Posted by macropod View Post
    Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
    Quite, but you can only make suggestions based on what you are presented with.
    Graham Mayor - MS MVP (Word) 2002-2019
    Visit my web site for more programming tips and ready made processes
    http://www.gmayor.com

  6. #6
    VBAX Newbie
    Joined
    Mar 2021
    Posts
    4
    Location
    Quote Originally Posted by macropod View Post
    Always a problem with documents converted from PDFs generated externally to know which formatting will reliably represent all of them...
    Yes, that is absolutely right. Will test the above options and get back with a feedback. Thank you so very much for all your answers.

  7. #7
    VBAX Newbie
    Joined
    Mar 2021
    Posts
    4
    Location
    Thank you so very much! Sub BatchTitles() works like a charm. I get as a result a Word file that has all the title. If your time allows it, just very briefly, can you please let me know what is the logic behind it. I don't quite grasp the logic and I assume that the key is in these lines - Do Until Right(strPath, 1) = Chr(92), strPath = strPath & Chr(92). Only if you have the time. Many thanks again!

  8. #8
    That simply ensures that the path is correctly terminated. The batch loop is in fact

        While strFile <> ""
            Set oDoc = Documents.Open(strPath & strFile)
            oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
            oDoc.Close SaveChanges:=wdDoNotSaveChanges
            strFile = Dir$()
        Wend
    Graham Mayor - MS MVP (Word) 2002-2019
    Visit my web site for more programming tips and ready made processes
    http://www.gmayor.com

  9. #9
    VBAX Newbie
    Joined
    Mar 2021
    Posts
    4
    Location
    Quote Originally Posted by gmayor View Post
    That simply ensures that the path is correctly terminated. The batch loop is in fact

        While strFile <> ""
            Set oDoc = Documents.Open(strPath & strFile)
            oNewDoc.Range.InsertAfter GetTitle(oDoc) & vbCr
            oDoc.Close SaveChanges:=wdDoNotSaveChanges
            strFile = Dir$()
        Wend
    Aaa...Ok! Thanks again.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •