PDA

View Full Version : Solved: Get a line from a text file.



Danny
05-31-2010, 03:37 PM
Greetings,

I have a text file with about 20 million rows that I need to get certain rows into excel. Currently I am using a scripting.filesystemobject to open the text file as a stream and looping through each line until a counter get to whatever row I need. This seems to be a bit faster than ADO selecting a unique column with the record number, but I would really like to be able to just 'goto' or 'get' a specific row number without using a loop. I have read that this can be done with other languages, but I haven't found how to do it in vba. Any ideas would be appreciated.

GTO
05-31-2010, 10:47 PM
Hi there,

AFAIK, there is no 'go to' in FSO. With 20 million rows, if you are currently reading each line, .Skipline might speed it up a bit? Then again, you may already be employing this.

Mark

mdmackillop
06-01-2010, 12:38 AM
Is their a line break or repeating character you can use to split the text into an array using the Split function? As here (http://www.vbaexpress.com/forum/showthread.php?t=32361)

GTO
06-01-2010, 05:07 AM
Is their a line break or repeating character you can use to split the text into an array using the Split function? As here (http://www.vbaexpress.com/forum/showthread.php?t=32361)

Nice! I will have to try that :thumb

mdmackillop
06-01-2010, 05:47 AM
Hi Mark,
Did you see this method (http://vbaexpress.com/forum/showthread.php?t=32367)to extract a row (column) from an array?

GTO
06-01-2010, 07:41 AM
Hi Malcom,

Yes I did. I tried a couple of different things and will probably ask a question in the other thread (blonde brain wasn't quite wrapping around it).

Have a great day :-)

Mark

Tommy
06-01-2010, 08:23 AM
This will do the same as Malcom's only without all of the extra overhead of the filesystem object. This is assuming that the split will occur on the vbCrLf. It can be changed to your particular flavor. :)


Sub SplitOnLFCR()
Dim FileData() As String
Dim FileName As String
FileName = "C:\SomeDirectory\SomeFile.Ext"
FileData = Split(GetStrFile(FileName), vbCrLf)
'Do what you need here

End Sub
Public Function GetStrFile(iFName As String) As String
Dim mFNo As Long, mLen As Long
mFNo = FreeFile
Open iFName For Input As #mFNo
mLen = LOF(mFNo)
GetStrFile = Input(mLen, #mFNo)
Close #mFNo
End Function

Danny
06-01-2010, 12:03 PM
Hi All,

I tried the skipline method and it does run a little faster. I also tried to pull the entire file into an array and I got an error that I was out of memory. Maybe looping is the best I can hope for with such a large file. Here is what I used :
Text file example:
"abc",123,"ROW1",456|
"efg",123,"ROW2",789|

dim FS as Scripting.FileSystemObject
dim Strm as Scripting.TextStream
Set strm = fs.OpenTextFile("C:\\file1.txt", ForReading, False)
arr = split(strm.readall,"|")

mdmackillop
06-01-2010, 12:24 PM
How about

Sub test()
Dim Match, Matches
Dim RegExp As Object
Dim t As String
Dim Strm
Dim FS

Set FS = CreateObject("Scripting.FileSystemObject")
Set Strm = FS.OpenTextFile("C:\AA\test.txt", 1, False)
t = Strm.readall
Set RegExp = CreateObject("VBScript.RegExp")
With RegExp
.Global = True
.IgnoreCase = True
End With
RegExp.Pattern = ".*\|"
Set Matches = RegExp.Execute(t)
MsgBox Matches(10)
End Sub

GTO
06-01-2010, 01:39 PM
Hi Danny,

None too lucid in thought, but for what its worth:

2 million rows sounds like a lot to hold in a variable to me, but while I'm not sure how much FSO sucks up (overhead), did you give Tommy's a shot?

At Malcom's point (least if brain cells are still creeping along), did you split your test to see if .ReadAll failed, or the Split thereafter?

Unless the data is sensitive and discernable as such, any shot at us seeing this behemothic textfile? I would think that it would fit if zipped.

Note: I have zero experience with textfiles this size and imagine I would rather stick a needle in my eye than gain said. That said, it seems interesting :-)

Mark

Shred Dude
06-01-2010, 02:10 PM
I was curious as to how Tommy's offering fared on the large file too. I liked the simplicity of that approach.

Further, reading the helpfile on the FSO methods it comes right out and says:


For large files, using the ReadAll method wastes memory resources. Other techniques should be used to input a file, such as reading a file line by line.

If .readall is not efficient on large files, I can't imagine reading line by line would be the way to go either. The all in one shot of using the INPUT function to create a string variable would appear to be the most streamlined.

Shred Dude
06-01-2010, 02:37 PM
Help file suggests that a variable length string variable can have about 2 billion characters (2^31). If your 20,000,000 lines have more than 100 characters per row, that could explain the memory error.

Danny
06-01-2010, 08:43 PM
I have tried both Tommy and Malcom's code (posts #7 and #9) and I am still getting an out of memory error. I don't think i can post the file, but it is well over 100 characters per row... it is a very large file; over 600,000 KB, and can not even be opened from note pad.

shrivallabha
06-01-2010, 10:54 PM
It could be computer's physical memory. In Windows Task Manager, check how much space the program is eating up!

Instead of testing the code for 2 million rows at a time. You can give a shot to first few thousand lines. And if it is success, then you can always increase the physical memory.

Blade Hunter
06-02-2010, 12:02 AM
Does this help you??


Sub Read_text_File()
Dim FSO As New FileSystemObject
Dim FS
Dim MyText As String
Dim RowNum As Long
Dim MyRow As Long
MyRow = 1260 'Change this to the row number you are looking for
RowNum = 0
Set FS = FSO.OpenTextFile("c:\textfile.TXT")
Do Until RowNum = MyRow Or FS.AtEndOfStream
RowNum = RowNum + 1
MyText = FS.ReadLine
Loop
If RowNum = MyRow Then
MsgBox MyText
Else
MsgBox "Sorry, File was too short"
End If
End Sub


Edit: Scratch that it is still reading the file line by line so will be slow.

Edit2: In fact on closer inspection I seem to have done exactly what you are already doing which is too slow :p

GTO
06-02-2010, 07:56 AM
Hi again,

Not to beat a dead horse, but I tried creating an example text file to play with. If I put in anywhere near 100+ characters per line, it would seem that the file would end up far bigger than 600000kb. To make a file about the same size with 20 mil rows, I limited ea line to a max of 26 characters.

Even at 546,875kb I got to find out that attempting to open in Word results in a nasty note advising that "I'm not opening anything larger than 32mb" or similar. (Just curious, what app do you actually use to open the file with?)

Anyways, very sloppy, but here is how I created the textfile:


Option Explicit

Sub Setup()
Dim FSO As FileSystemObject
Dim fsoTxtFile As TextStream
Dim a As Long, b As Long, c As Long, d As Long, e As Long, f As Long, i As Long
Dim letter_1 As String, letter_2 As String, letter_3 As String, letter_4 As String, _
letter_5 As String, letter_6 As String

Dim Start As Double: Start = Timer
Set FSO = New FileSystemObject

Set fsoTxtFile = FSO.CreateTextFile(ThisWorkbook.Path & "\BiggestTest.txt", True)

With fsoTxtFile

For a = 1 To 2
letter_1 = Letter_Ret(a)
'A-Z
For b = 1 To 26
letter_2 = Letter_Ret(b)
For c = 1 To 26
letter_3 = Letter_Ret(c)
For d = 1 To 26
letter_4 = Letter_Ret(d)
For e = 1 To 26
letter_5 = Letter_Ret(e)

For f = 1 To 26

letter_6 = Letter_Ret(f)

i = i + 1

.WriteLine ( _
Left(i & "," & letter_1 & letter_2 & letter_3 & letter_4 & letter_5 & letter_6 & "," & (i * 2) & "," _
& i & "," & letter_1 & letter_2 & letter_3 & letter_4 & letter_5 & letter_6 & "," & (i * 2) & "," _
& i & "," & letter_1 & letter_2 & letter_3 & letter_4 & letter_5 & letter_6 & "," & (i * 2) & "," _
& i & "," & letter_1 & letter_2 & letter_3 & letter_4 & letter_5 & letter_6 & "," & (i * 2) & "," _
& i & "," & letter_1 & letter_2 & letter_3 & letter_4 & letter_5 & letter_6 & ",", 25) & "|" _
)

If i >= 20000000 Then
GoTo QuickExit
End If

Next f
Next e
Next d
Next c
Next b
Next a
QuickExit:
.Close
End With

Debug.Print Timer - Start
MsgBox Timer - Start


'=26*26*26*26**26*2
End Sub
Function Letter_Ret(Index As Long) As String
Letter_Ret = Application.Choose(Index, "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", _
"L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", _
"W", "X", "Y", "Z")
End Function

I used these for testing:


Option Explicit

Sub TestAll()
Dim i As Long

For i = 1 To 3
Call FSOEarly
Call FSOLate
Call NoFSO
Next
End Sub

Sub FSOEarly()
Dim FSO As FileSystemObject
Dim fsoTxtFile As TextStream
Dim i As Long

Dim Start As Double: Start = Timer

Set FSO = New FileSystemObject
Set fsoTxtFile = FSO.OpenTextFile(ThisWorkbook.Path & "\BigTest.txt")

With fsoTxtFile
For i = 1 To 4999950 '<--- for 5mil line test, 19999950 for 20 mil
.SkipLine
Next

Debug.Print .ReadLine

.Close
End With

Debug.Print "FSO Early: " & Timer - Start
End Sub

Sub NoFSO()
Dim Data$
Dim fFile&, i&
Dim Start#

Start = Timer

fFile = FreeFile

Open ThisWorkbook.Path & "\BigTest.txt" For Input As #fFile

Do While Not EOF(fFile) And Not i = 4999950
i = i + 1
Line Input #fFile, Data
Loop

Line Input #fFile, Data

Debug.Print i; Data

Close #fFile

Debug.Print "Open: " & Timer - Start
End Sub

Sub FSOLate()
Dim FSO As Object
Dim fsoTxtFile As Object
Dim i As Long

Dim Start As Double: Start = Timer

Set FSO = New FileSystemObject
Set fsoTxtFile = FSO.OpenTextFile(ThisWorkbook.Path & "\BigTest.txt")

With fsoTxtFile
For i = 1 To 4999950
.SkipLine
Next

Debug.Print .ReadLine

.Close
End With

Debug.Print "FSO Late: " & Timer - Start
End Sub

I ran this to "only" 5 mil rows and tested against FSO, both early and late bound, as well as the Open for Input similar to what Tommy showed (I hope). I also ran against the 20 mil file. Results at bottom.

In regards to Early vs Late Binding for FSO, I've noticed this before and it seems that if there is much (looping, reading, writing, grabbing file
properties, etc) going on with FSO, binding seems to make a noticeable difference. Anyways, just in case you are using late-binding, thought to include what I found.

Hope we were able to help a little at least,

Mark

shrivallabha
06-02-2010, 08:59 AM
@GTO PFE32 is one I know that can handle huge text files.

Apart from this, if this is eating up a lot of working time then:
1. Create a Somefilename.bat file typing: %Filepath%\ExcelFile
2. Add this i.e. running the Somefilename.bat to scheduled tasks and have it run at night (if possible!)
3. And for the macro to run automatically:Call it from subs such as auto_open or Workbook_Open() under Thisworkbook.
Edit: my 2cents!

Shred Dude
06-02-2010, 10:32 AM
Could we maybe use the Seek# Statement to facilitate breaking this problem up into smaller pieces so it could be handled.

Something along these lines would let you begin breaking up the file into smaller pieces for processing purposes. Code would need to be added obviously to clean this up...maybe to reorient the position at end of lines...and to deal with the length of the resulting last chunk, if you were to get that far.

Just a thought...


Public Sub splitTextFileintoChunks(iFName As String)
Dim mFNo As Long, mLen As Long
Dim chunks As Integer
Dim lenChunk As Long
Dim strchunk As String

mFNo = FreeFile
Open iFName For Input As #mFNo
mLen = LOF(mFNo)
chunks = 10 ' set denominator to however many pieces you want
lenChunk = Int(LOF(mFNo) / chunks)

For c = 0 To chunks - 1
'set position in file at beginning of each chunk
newposition = (c * lenChunk) + 1
'add code to reorient at a line break for example

Seek #mFNo, newposition

' put that chunk into a string variable
strchunk = Input(lenChunk, #mFNo) 'NOTE: This stmt changes file position to newposition+lenchunk

'process that string
'if found, exit, else, do next chunk
'...
Next c

Close #mFNo
End Sub

Danny
06-08-2010, 08:44 PM
Hi All,

Sorry for the late reply. Shred Dudes post seemed to be he quickest method with the text files, but I have decided to go another way with this and break the file up into multiple DBF files that are indexed and query them with ADO. So far this seems to be the fastest solution... once the files were created.
Thanks for all the help.

Shred Dude
06-09-2010, 07:32 AM
Thanks for the feedback on your solution. I'm curious, did you find it necessary to break the large file up into multiple DBF files because of a memory constraint similar to what we ran into with the text files? Did it not work as one large DBF file and then processing with ADO?

Danny
06-14-2010, 06:00 PM
I had to keep the dbf files under 2GB or I would get a 'disk network error',
I tried using 2 different connection strings when creating the dbf files but both gave the same error. I would like to be able to have them in 1 file, or even fewer files because I have to create a SQL string (sometimes for multiple files) based on the users parameters.

here are the connection strings :

Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\folder;Extended Properties=dBASE IV;User ID=Admin;Password=;
&
Driver={Microsoft dBASE Driver (*.dbf)};DriverID=277;Dbq=c:\mydbpath;