PDA

View Full Version : Extract text from multiple text files



aries_heng
07-03-2010, 01:56 AM
Hi

I'm completely new to VBA and would also like to know how to extract specific data from thousands of text files. I dun really quite understand shrivallabha's coding (http://www.vbaexpress.com/forum/showpost.php?p=215635&postcount=4). hope someone can enlighten me =)

mdmackillop
07-03-2010, 05:26 AM
Hi

I'm completely new to VBA and would also like to know how to extract specific data from thousands of text files. I dun really quite understand shrivallabha's coding. hope someone can enlighten me =)
Welcome to VBAX
You need to be much more specific.
Extract what? All or part or something specific?
Do what with it, and where?

aries_heng
07-10-2010, 02:56 AM
hi i need to extract data such as part number, serial number and dates into rows into an excel. i had modified shriballabha's coding to extract the data but some of my files have 5 dates in it but i only need the first and the 4th date. (able to extract all 5 dates would also be enough for me)

this is part of my coding:

Sub read_text()

Set wb = Workbooks.Add
workingflnm = ActiveWorkbook.Name
i = 1 'First row in Active Sheet
Set fd = CreateObject("Scripting.Filesystemobject")
pthnm = "C:\Documents and Settings\Heng\Desktop\Test Trial\avi" 'Please change to your desired folder
Set fs = fd.GetFolder(pthnm)
For Each fl In fs.Files

Set Txtobj = CreateObject("Scripting.filesystemobject")
Set Txtfl = Txtobj.getfile(fl)
Set Txtstrm = Txtfl.openastextstream(1, -2)
countDate = 1
Do While Txtstrm.AtEndOfStream <> True
rdln = Txtstrm.readline



If InStr(1, rdln, " DATE ", vbTextCompare) > 1 Then
x1 = InStr(1, " Date : ", vbTextCompare)
strg = Mid(rdln, x1 + Len(" "))
If countDate = 1 Then
Workbooks(workingflnm).Sheets(1).Cells(i - 1, 2) = strg
countDate = 2
Else
Workbooks(workingflnm).Sheets(1).Cells(i - 1, 3) = strg
End If
End If

Is there anyway to extract all 5 dates or maybe just the first and the 4th? either way will be good. Thanks!

GTO
07-10-2010, 08:03 AM
Please zip the text file and a small workbook showing how the data would end up in the worksheet. If the info is actually detailed enough to be 'sensitive', create a textfile layed out exactly the same, but w/fake data. I think this makes it much easier for us to help you :-)

Mark

Edit: Also (sorry, slow thinker on this end), if the sample file is made-up, please indicate if the real one(s) are a few thousand lines, or the lights dim when they load.

aries_heng
07-10-2010, 04:25 PM
hi

attached is the sample file, i have thousands of those kinds of files.there are five date and times in each of those files.

i would need to extract those information into excel (in rows) in this format:

File Name Start Date Start Time End Date End Time Serial Number P/N

start date/time is the first date/time in the file and end date/time would be the 4th date/time in the file. Extracting all 5 date/time would also be good enough. Thanks

GTO
07-10-2010, 06:30 PM
Might we be able to count on the headings in the different areas? Such as:


S T A R T O F C U R R E N T S E L E C T I O N
E N D O F C U R R E N T S E L E C T I O N

That is, are the headings something we can count on, or do they change between files?

Reference your sample file, am I correct that the dates in these areas are the ones you want?

Edit: Sorry, I am now counting six. Which two times would we hope to be extracting?

Also, how many lines (esitmate high) are there in the real files? I am asking, as if they are smaller, we may load all the text into a variable at once. But if they are larger, this may not be a slick idea...


fsedh DATED : FEB 04/08

Under OPERATOR LOG
DATE : APR 28/2010 TIME : 16:16:00

Under START OF CURRRENT SELECTION
DATE : APR 28/2010 TIME : 16:16:29

Under END OF CURRENT SELECTION
DATE : APR 28/2010 TIME : 17:26:10

Under h fjhiojewr ejwio
DATE : APR 28/2010 TIME : 17:28:51

...and in the last couple of lines...

>>> dfhushed >>>> fdhio dhfio hiofhoh f >>>> DATE : Wed Apr 28 16:15:26 2010

aries_heng
07-10-2010, 07:56 PM
Hi

the headings are the same throughout the thousand files. how do we depend on the heading?

this results is great! one file has about 700 lines.

i only need 2 date and times from 2 headings : operator log and h fjhiojewr ejwio (fake title)

i need the data to be extracted into rows of the excel file. Thanks alot!

aries_heng
07-11-2010, 12:58 AM
is macro able to pick out the headers and then do a search on that header's text for the date/time that i wanted?


or is extracting out all date/time makes things easier?

mdmackillop
07-11-2010, 01:32 AM
As a first step try this.

Sub GetData()
Dim fname As String
Sheets(1).Activate
fname = Dir("C:\AAA\*.BR") '<=== Change to suit
Do Until fname = ""
Call Process(fname)
fname = Dir
Loop
Sheets(2).Activate
End Sub

Sub Process(fname As String)

Dim rng As Range
Cells.ClearContents
With ActiveSheet.QueryTables.Add(Connection:="TEXT;C:\aaa\SAMPLE.BR", _
Destination:=Range("A1"))
.Name = fname
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.TextFilePromptOnRefresh = False
.TextFilePlatform = xlWindows
.TextFileStartRow = 1
.TextFileParseType = xlDelimited
.TextFileTextQualifier = xlTextQualifierDoubleQuote
.TextFileConsecutiveDelimiter = False
.TextFileTabDelimiter = False
.TextFileSemicolonDelimiter = False
.TextFileCommaDelimiter = False
.TextFileSpaceDelimiter = False
.TextFileColumnDataTypes = Array(1)
.Refresh BackgroundQuery:=False
End With

Set rng = Range(Cells(1, 1), Cells(Rows.Count, 1).End(xlUp)).Offset(, 1)
With rng
.FormulaR1C1 = "=IF(SEARCH(""Date "",RC[-1]),RC[-1],"""")"
.SpecialCells(xlCellTypeFormulas, 16).Delete xlUp
End With

For Each cel In rng
cel.Value = Trim(cel.Value)
Next

Set tgt = Sheets(2).Cells(Rows.Count, 2).End(xlUp)(2)
tgt.Offset(, -1) = fname
rng.Copy tgt

End Sub

aries_heng
07-11-2010, 01:49 AM
i got the dates out but the dates are in column B, i need them in rows instead. This can only work for one .br file?

thanks!

aries_heng
07-11-2010, 01:52 AM
i have this macro with me now, is it possible to edit this macro to get all/specific date/time out?



Sub read_text()

Set wb = Workbooks.Add
workingflnm = ActiveWorkbook.Name
i = 1 'First row in Active Sheet
Set fd = CreateObject("Scripting.Filesystemobject")
pthnm = "C:\Documents and Settings\Heng\Desktop\Test Trial\avi" 'Please change to your desired folder
Set fs = fd.GetFolder(pthnm)
For Each fl In fs.Files
countDate = 1
Set Txtobj = CreateObject("Scripting.filesystemobject")
Set Txtfl = Txtobj.getfile(fl)
Set Txtstrm = Txtfl.openastextstream(1, -2)
Do Until Txtstrm.AtEndOfStream '<> True
rdln = Txtstrm.readline



If InStr(1, rdln, " DATE : ", vbTextCompare) > 1 Then
x1 = InStr(1, " Date : ", vbTextCompare)
strg = Mid(rdln, x1 + Len(" "))
If countDate = 1 Then
Workbooks(workingflnm).Sheets(1).Cells(i - 1, 2) = strg
countDate = 2
Else
Workbooks(workingflnm).Sheets(1).Cells(i - 1, 3) = strg
End If
End If


If InStr(2, rdln, "Serial-Number", vbTextCompare) > 1 Then
x2 = InStr(2, "Serial-Number", vbTextCompare)
strg = Mid(rdln, x2 + Len(" "))
Workbooks(workingflnm).Sheets(1).Cells(i, 4) = strg
End If

If InStr(3, rdln, "UUT PART-NUMBER :", vbTextCompare) > 1 Then
x3 = InStr(3, "UUT PART-NUMBER :", vbTextCompare)
Workbooks(workingflnm).Sheets(1).Cells(i, 1) = fl.Name
strg = Mid(rdln, x3 + Len(" "))
Workbooks(workingflnm).Sheets(1).Cells(i, 5) = strg
End If

If InStr(4, rdln, "Test Justification", vbTextCompare) > 1 Then
x4 = InStr(4, "Test Justification", vbTextCompare)
strg = Mid(rdln, x4 + Len(" "))
Workbooks(workingflnm).Sheets(1).Cells(i, 6) = strg
End If

If InStr(5, rdln, "OPERATOR :", vbTextCompare) > 1 Then
x5 = InStr(5, "OPERATOR :", vbTextCompare)
strg = Mid(rdln, x5 + Len(" "))
Workbooks(workingflnm).Sheets(1).Cells(i, 7) = strg
End If

If InStr(6, rdln, "ATE NAME", vbTextCompare) > 1 Then
x6 = InStr(6, "ATE NAME", vbTextCompare)
strg = Mid(rdln, x6 + Len(" "))
Workbooks(workingflnm).Sheets(1).Cells(i, 8) = strg
i = i + 1
End If

Loop


Next
End Sub

mdmackillop
07-11-2010, 01:56 AM
Can you post a workbook showing desired end result?

mdmackillop
07-11-2010, 01:57 AM
it says object-defined or application-defined error.

=(

What is highlighted?

aries_heng
07-11-2010, 02:10 AM
Hi

attached is how i would need the result to look like... currently my macro can only pick out the first and the last date/time.

it would be good if all the date/time can be showed so that i can record a macro to delete away the useless ones.


it's really urgent and thanks alot for the quick reply! =)

GTO
07-11-2010, 08:37 AM
'"UUT PART-NUMBER :" < Is this supposed to be 'dst PART-NUMBER'?

aries_heng
07-11-2010, 03:59 PM
yup it is

aries_heng
07-11-2010, 04:52 PM
hi GTO

how do we make use of the header to extract out the dates/times?

GTO
07-11-2010, 04:53 PM
Greetings,

I was having a bit of a time following differences in your code vs the sample textfile. Nothing big, but the differences like at #15 can effect the pattern. Here is my best guess using the patterns shown in the sample textfile.

Importing the file as Malcom showed may be faster, but with only 700+- rows, I thought to try and read through it. In short, as we read the lines, as we find one thing, we change to looking for the next (.Pattern change).

In a Standard Module:


Option Explicit

Dim REX As Object

Sub StripDataFromTextFile()
Dim _
FSO As Object, _
fsoTxtSource As Object, _
fsoFolder As Object, _
fsoFiles As Object, _
fsoFile As Object, _
wks As Worksheet, _
FList As Variant, _
TempString As Variant, _
vSubMatch As Variant, _
aryOutput As Variant, _
bolChooseFiles As Boolean, _
strGetOpenTitle As String, _
i As Long, _
lPattern As Long

'Patterns for:
'O P E R A T O R L O G
'PART-NUMBER : 156464189
' SERIAL-NUMBER : 18406156
'OPERATOR :
' TEST JUSTIFICATION : FINAL TEST
' HFH NAME : dbfjke
'DATE : APR 28/2010 TIME : 16:16:00
'h fjhiojewr ejwio *
Const PATTERN_START_HEADER As String = _
"(O\ {0,3}P\ {0,3}E\ {0,3}R\ {0,3}A\ {0,3}T\ {0,3}O\ {0,3}R\ +L\ {0,3}O\ {0,3}G)(\ *)"
Const PATTERN_PART_NO As String = "(PART-NUMBER\s+\:\s+)([0-9]+)"
Const PATTERN_SER_NO As String = "(SERIAL-NUMBER\s+\:\s+)([0-9]+)(\s*)"
Const PATTERN_OPER As String = "(OPERATOR\s+\:\s+)([A-Za-z]+)(\s*)"
Const PATTERN_JUST As String = "(TEST\ {1,3}JUSTIFICATION\s*\:\s+)([A-Za-z\ ]+)"
Const PATTERN_NAME As String = "(NAME\s+\:\s+)([A-Za-zz\ ]+)"
Const PATTERN_DATE As String = _
"(DATE\s{0,3}\:\s{0,3})([A-Z]{3,9})(\ {0,3})([0-9]{1,2})(\/)([0-9]{2,4})" & _
"(\ +TIME\ {0,3}\:\ {0,3})([0-9]{1,2}\:[0-9]{1,2}\:[0-9]{1,2})"
Const PATTERN_END_HEADER As String = "(h\ +fjhiojewr\ +ejwio\s+\*)(\s*)"
'Const MSG_RET_FOL As String = "Pick ANY file in the Folder you want"
'Const MSG_RET_FIL As String = "Pick the file(s) to rip data from"

Select Case MsgBox("Would you like to rip data from all text files in a folder?" & _
vbCrLf & vbCrLf & "Click on <Yes> to run all files." & _
vbCrLf & "Click <No> to pick file(s) to run." & vbCrLf & _
"Click <Cancel> to cancel.", _
vbYesNoCancel + vbDefaultButton1 + vbQuestion, _
"Operations Choice")

Case vbYes: strGetOpenTitle = "Pick ANY file in the Folder you want"
bolChooseFiles = False
Case vbNo: strGetOpenTitle = "Pick the file(s) to rip data from"
bolChooseFiles = True
Case vbCancel: Exit Sub
End Select

FList = Application.GetOpenFilename(FileFilter:="Text Files (*.txt; *.BR), *.txt; *.BR", _
Title:=strGetOpenTitle, _
MultiSelect:=bolChooseFiles)

If bolChooseFiles Then
If Not IsArray(FList) Then Exit Sub
Else
If CStr(FList) = "False" Then Exit Sub
FList = Left$(FList, InStrRev(FList, "\"))
End If

Set FSO = CreateObject("Scripting.FileSystemObject")
Set REX = CreateObject("VBScript.RegExp")
With REX
.Global = False
.MultiLine = False
End With

If Not IsArray(FList) Then
With FSO
TempString = FList
ReDim FList(0 To 0)
Set fsoFolder = .GetFolder(TempString)
Set fsoFiles = fsoFolder.Files

For Each fsoFile In fsoFiles
TempString = UCase(Mid(fsoFile.Name, InStrRev(fsoFile.Name, ".") + 1))
If TempString = "BR" Or TempString = "TXT" Then
ReDim Preserve FList(1 To UBound(FList) + 1)
FList(UBound(FList)) = fsoFile.Path
End If
Next
End With
End If

ReDim aryOutput(1 To UBound(FList), 1 To 10)
For i = 1 To UBound(FList, 1)
aryOutput(i, 1) = Mid(FList(i), InStrRev(FList(i), "\") + 1)
Next

For i = LBound(FList) To UBound(FList)

Set fsoTxtSource = FSO.OpenTextFile(FileName:=FList(i), IOMode:=1, Format:=-2)
lPattern = 0

With fsoTxtSource
Do While Not .AtEndOfStream And lPattern < 9

lPattern = lPattern + 1

REX.Pattern = Choose(lPattern, PATTERN_START_HEADER, PATTERN_PART_NO, _
PATTERN_SER_NO, PATTERN_OPER, PATTERN_JUST, _
PATTERN_NAME, PATTERN_DATE, PATTERN_END_HEADER, _
PATTERN_DATE)

'// Date/Time SubMatches array - 1 month, '3 day, '5 year, '7 time //
vSubMatch = Choose(lPattern, 0, 1, 1, 1, 1, 1, Array(1, 3, 5, 7), 0, _
Array(1, 3, 5, 7))

Do While Not .AtEndOfStream

TempString = .ReadLine

If RetCol(TempString, vSubMatch) Then
If lPattern > 1 And Not lPattern = 8 Then
If Not IsArray(TempString) Then
aryOutput(i, lPattern) = TempString
Else
aryOutput(i, lPattern) = _
DateSerial( _
TempString(2), Switch(TempString(0) = "JAN", 1, _
TempString(0) = "JANUARY", 1, TempString(0) = "FEB", 2, _
TempString(0) = "FEBRUARY", 2, TempString(0) = "MAR", 3, _
TempString(0) = "MARCH", 3, TempString(0) = "APR", 4, _
TempString(0) = "APRIL", 4, TempString(0) = "MAY", 5, _
TempString(0) = "JUN", 6, TempString(0) = "JUNE", 6, _
TempString(0) = "JUL", 7, TempString(0) = "JULY", 7, _
TempString(0) = "AUG", 8, TempString(0) = "AUGUST", 8, _
TempString(0) = "SEP", 9, TempString(0) = "SEPTEMBER", 9, _
TempString(0) = "OCT", 10, TempString(0) = "OCTOBER", 10, _
TempString(0) = "NOV", 11, TempString(0) = "NOVEMBER", 11, _
TempString(0) = "DEC", 12, TempString(0) = "DECEMBER", 12), _
TempString(1))

aryOutput(i, lPattern + 1) = CDate(TempString(3))
End If
End If
Exit Do
End If
Loop
Loop
.Close
End With
Next

With ThisWorkbook
Set wks = .Worksheets.Add(After:=.Worksheets(.Worksheets.Count))
With wks.Range("A1:J1")
.Value = Array("File Name", "Part No.", "Serial Number", "Operator", _
"Justification", "Name", "Start Date", "Start Time", _
"End Date", "End Time")
.Font.Bold = True
.HorizontalAlignment = xlCenter
.Parent.Range("A2").Resize(UBound(aryOutput, 1), 10).Value = aryOutput
.EntireColumn.AutoFit
End With
End With
End Sub

Function RetCol(InputOutput As Variant, SubMatchez As Variant) As Boolean

With REX
If .test(InputOutput) Then
If Not IsArray(SubMatchez) Then

InputOutput = .Execute(InputOutput)(0).SubMatches(SubMatchez)
Else
InputOutput = Array(.Execute(InputOutput)(0).SubMatches(SubMatchez(0)), _
.Execute(InputOutput)(0).SubMatches(SubMatchez(1)), _
.Execute(InputOutput)(0).SubMatches(SubMatchez(2)), _
.Execute(InputOutput)(0).SubMatches(SubMatchez(3)))
End If
RetCol = True
End If
End With
End Function

Hope that helps,

Mark

aries_heng
07-11-2010, 05:02 PM
hi this works fine but it can only extract data one file at a time?

cos i have thousands of those files, it would take up alot of time if i had to do one by one. =(

GTO
07-11-2010, 05:39 PM
hi this works fine but it can only extract data one file at a time?

You are kidding, right? Did you read the msgbox or the code? You can choose to rip from all .txt/.BR files in a folder, or pick certain ones.

aries_heng
07-11-2010, 06:13 PM
oops... so sorry my bad. i was rushing off to work =)

one last question:
how do i change
Const PATTERN_END_HEADER As String = "(h\ +fjhiojewr\ +ejwio\s+\*)(\s*)" to the header as
"* E N D O F U U T T E S T *"?


thanks alot!

aries_heng
07-11-2010, 06:18 PM
thanks alot for the help.

greatly appreciated!

GTO
07-11-2010, 07:05 PM
Not tested; try: "(E\s*N\s*D\s*O\s*F\s*U\s*U\s*T\s*T\s*E\s*S\s*T)(\s*)"

I have no doubt that my patterns could be improved upon, you may wish to study here:
http://msdn.microsoft.com/en-us/library/1400241x(VS.85).aspx