PDA

View Full Version : There has to be a faster way.



aaronreyna
09-10-2009, 04:57 PM
So I recently wrote my first VBA script which pulls in a list of text files, each containing a coma delimited file which contains and ID# and a corresponding Z value.

ex.
1,20
2,19
3,17
4,20
5,22
...

The script pulls the files into excel, one by one, and then iterates through the coma delimited file to calculate the 97th and 3rd percentile. The script calculates percentile for the entire text file, then kicks the min and max value to a text tile. Here is the kicker... each individual text files has somewhere between 3000 and 50000 entries. The smaller ones dont take so long, but the 50000 entry text files can take upwards to 45 mins. I calculated the total time to run this on all my files, upwards of 4000 text files, and yeah... it'll be done next week. I know in python you can stream things to memory, but I have been having a heck of a time trying to figure out memory streaming in VBA.


Sub run_Click()

'Clear Contents
Range("A2:B100000").Select
Selection.ClearContents
Dim i As Long, nd As Long, MaxBuilding As Integer, MinBuilding As Integer
Dim Temp(100000) As Double, Dens(100000) As Double, File_name As String, inc As Long, j As Integer, volume As Long

inc = 1
For j = 0 To inc
Do
If Sheet1.Cells(inc, 17).Value = "" Then Exit Do
File_name = Sheet1.Cells(inc, 17).Value
'fetch data from the file
Open File_name For Input As #1
Do
If EOF(1) Then Exit Do
nd = nd + 1
Input #1, Temp(nd), Dens(nd)
Loop
Close #1
'Display values on the worksheet
Sheets("Export_Output1").Select

Range("a2").Select
For i = 0 To nd
ActiveCell.Value = Temp(i)
ActiveCell.Offset(0, 1).Select
ActiveCell.Value = Dens(i)
ActiveCell.Offset(1, -1).Select
Next i
nd = -1
Erase Dens
Erase Temp
Range("a1").Select
'create a file
Open "c:\z_unit_write.txt" For Append As #2
MaxBuilding = Range("h12")
MinBuilding = Range("k12")
volume = Range("m13")
Write #2, MaxBuilding, MinBuilding, volume, File_name
Close #2
'clear Contents
Range("A2:B100000").Select
Selection.ClearContents
inc = inc + 1
Loop
Next j
End Sub

What is good about this script is that it iterates through a list of text files (in a range) until the range = "", something that took me forever to figure out how to do in a batch process. My question is, does anyone happen to know of a faster way to process all this data? Maybe instead of loading in each file and loading each coma delimited entry one at a time, you can set something in vba to load the entire list in at once?
:banghead:

Thanks for any tips!

aaron

nst1107
09-10-2009, 09:36 PM
Not sure about streaming, either, but I can offer one tip for speeding up your procedure. Get rid of the Range.Select, Selection...., or ActiveCell.... manner of doing things. Using .Select is always the slow way. Here's a cleaner version of your code, minus the .Select and a few other things that create useless overhead. It's not much, but it should speed it up at least a little bit. Hope it helps.Sub run_Click()
'Clear Contents
Range("A2:B100000").ClearContents
Dim i As Long, nd As Long, MaxBuilding As Long, MinBuilding As Long
Dim Temp(100000) As Double, Dens(100000) As Double, File_name As String, inc As Long, j As Long, volume As Long
inc = 1
For j = 0 To inc
Do
If Sheet1.Cells(inc, 17) = vbNullString Then Exit Do
File_name = Sheet1.Cells(inc, 17)
'fetch data from the file
Open File_name For Input As #1
Do
If EOF(1) Then Exit Do
nd = nd + 1
Input #1, Temp(nd), Dens(nd)
Loop
Close #1
'Display values on the worksheet
With Sheets("Export_Output1").Cells(2, 1)
For i = 0 To nd
.Value = Temp(i)
.Offset(0, 1).Value = Dens(i)
Next i
End With
nd = -1
Erase Dens
Erase Temp
'create a file
Open "c:\z_unit_write.txt" For Append As #2
MaxBuilding = Cells(12, 8)
MinBuilding = Cells(12, 11)
volume = Cells(13, 13)
Write #2, MaxBuilding, MinBuilding, volume, File_name
Close #2
'clear Contents
Range("A2:B100000").ClearContents
inc = inc + 1
Loop
Next j
End Sub

Note the integers are all longs in this version, and most of your Range() are now Cells().

Paul_Hossler
09-11-2009, 08:03 AM
Why are you reading the CSV a record at a time? You could load open it in Excel, call WS functions on the data, save it in a 'Summary WS', and write that sheet at the end.

All that looping takes time

If you post a Zip with the WB, and 2-3 SMALL csv files, it'd be easier to offer suggestions

Paul

p45cal
09-11-2009, 08:46 AM
What probably takes the longest time is writing to the sheet.
Try this:Sub run_Click()
'Clear Contents:
Sheets("Export_Output1").Range("A2:B100000").ClearContents
Dim i As Long, nd As Long, MaxBuilding As Long, MinBuilding As Long
Dim File_name As String, inc As Long, j As Long, volume As Long
Dim TempDens() As Double
ReDim TempDens(1 To 2, 1 To 1) 'ReDim TempDens(1 To 2, 0 To 0)
inc = 1
'For j = 0 To inc
Do
If Sheet1.Cells(inc, 17).Value = "" Then Exit Do
File_name = Sheet1.Cells(inc, 17).Value
'fetch data from the file:
Open File_name For Input As #1
Do
If EOF(1) Then Exit Do
nd = nd + 1
ReDim Preserve TempDens(1 To 2, 1 To nd) 'ReDim Preserve TempDens(1 To 2, 1 To nd)
Input #1, TempDens(1, nd), TempDens(2, nd)
Loop
Close #1
'Display values on the worksheet
With Sheets("Export_Output1")
.Range("A2").Resize(nd, 2) = Application.Transpose(TempDens)
nd = -1
Erase TempDens
'create a file
Open "c:\z_unit_write.txt" For Append As #2
MaxBuilding = .Range("h12")
MinBuilding = .Range("k12")
volume = .Range("m13")
Write #2, MaxBuilding, MinBuilding, volume, File_name
Close #2
'clear Contents:
.Range("A2:B100000").ClearContents
End With
inc = inc + 1
Loop
'Next j
End Sub
You had 0,0 as the first values (maybe you have Option Base 1?), and I'm not sure whether you want them. The code above puts the first row of values at row 2. There are commented out lines as suggested replacements for my code to reinstate these zero values.

I took out the j loop (I don't think it's necessary but you may have it there for your own reasons).
I think you'll be very pleased with the speed; I tested with 10,000 values in the text file and some simple max/min/sum formulae in the three cells, it took 0.062 seconds (6.2 seconds to do it 100 times). (xp pro/excel 2003)

ps. you say the script does the percentile stuff? Not in the code you supplied. Is it worksheet formulae which do this? I suspect it is, and if those formulae take any time at all to calculate, then your code, which writes values one a t a time to the sheet will have the sheet recalculating at each change. If (and this only applies to your code, not mine) you were to switch automatic calculation off, then calculate only once all the data was in place, that would be a route to speeding up your routine.

Oorang
09-11-2009, 09:03 AM
Hi Aaron,
Welcome to the board. As alluded to, a big part of your problem is the use of Range.Select and Range.Offset. But what I don't see mentioned is that you can also realize huge gains by turning of the interface. Events, Screen-Updating, and most importantly Auto-Calculation. In Excel, all of these things keep happening while your script is running unless you turn them off. Once I corrected the Select/Offset issue and turned off the interface I was able to get your code to do a 50,000 record file in about a second.

However a few other things are worth noting. You seem to be calculating the percentile and volume from formulas in a cell. You don't really need to do this, you can use Excel.WorksheetFunction.Percentile to get a result directly.

Secondly, you asked about memory. You can move an entire file into memory quite quickly (see the GetFileBytes procedure in code below) but it will be pulled in in byte form. Meaning a number will actually be a series of Unicode values. So 971 in memory would be spread across 6 elements as 57-0-55-0-49-0. And this is not even address extracting data from the second column in a non-fixed width file. Obviously this is sub-optimal.

It's easier to pull the data in via a loop using the input method (similar to what you have already done) thus extracting the second column and allowing passive type coercion to convert the bytes to numeric values.

That said you still have the issue of calculating the Percentile in Memory. This calls for the implementation of a sort algorithm and results a quote a lot of code. However the end result is faster by about half a a second. So if you really want to you can.

I have set up a few;) examples for you to look over, along with a test harness to see the results. I wasn't really sure what you meant by "Volume" so I just assumed you meant "count".

In my "professional opinion" unless you are processing thousands of files, 1 second per file is adequate, and while a half a second is twice as fast it doesn't create enough productivity gains to justify bloating the code base like that. (See also: "Premature optimization is the root of all evil.")

Option Explicit

Private Declare Function GetTickCount Lib "Kernel32.dll" () As Long
Private Declare Sub CopyMemory Lib "Kernel32.dll" Alias "RtlMoveMemory" (ByRef _
Destination As Any, ByRef Source As Any, ByVal Length As Long)

Public Sub TestHarness()
Const strPath_c As String = "C:\Test\test.csv"
Const lngRecords_c As Long = 50000
Dim lngStartVal As Long
Dim lngEndVal As Long
Dim ws As Excel.Worksheet
Dim dblData() As Double
Dim strOutputPath As String
On Error GoTo Err_Hnd
ToggleInterface False

GenerateTestFile strPath_c, lngRecords_c

strOutputPath = Left$(strPath_c, InStrRev(strPath_c, "\")) & _
"z_unit_write.txt"

'Demonstrate problems with working with memory:
'MsgBox ByteArrayToString(GetFileBytes(strPath_c), 25)

'Try corrected original method:
lngStartVal = GetTickCount
Set ws = Sheet1
ws.UsedRange.Delete
OriginalMethod strPath_c, ws
With Excel.WorksheetFunction
OutputResults strOutputPath, .percentile(ws.Columns(2), 0.97), _
.percentile(ws.Columns(2), 0.03), .Count(ws.Columns(2))
End With
lngEndVal = GetTickCount
Debug.Print lngEndVal - lngStartVal, "Original Input Method"

'Do import method:
lngStartVal = GetTickCount
Set ws = ImportMethod(strPath_c)
OriginalMethod strPath_c, ws
With Excel.WorksheetFunction
OutputResults strOutputPath, .percentile(ws.Columns(1), 0.97), _
.percentile(ws.Columns(1), 0.03), .Count(ws.Columns(1))
End With
ws.Parent.Close False
lngEndVal = GetTickCount
Debug.Print lngEndVal - lngStartVal, "New Import Method"

'Try In Memory Method
lngStartVal = GetTickCount
dblData = GetData(strPath_c)
OutputResults strOutputPath, ArrayPercentile(dblData, 0.97), _
ArrayPercentile(dblData, 0.03), UBound(dblData) + 1&
lngEndVal = GetTickCount
Debug.Print lngEndVal - lngStartVal, "Memory Method"

Exit_Proc:
On Error Resume Next
ToggleInterface True
Exit Sub
Err_Hnd:
MsgBox Err.Description, vbCritical + vbMsgBoxHelpButton, Err.Number, _
Err.HelpFile, Err.HelpContext
Resume Exit_Proc
End Sub

Public Sub GenerateTestFile(ByVal path As String, ByVal records As Long)
Dim lngFileNum As Long
Dim bytHeader(1) As Byte
lngFileNum = FreeFile
Open path For Binary Access Write As lngFileNum
Put lngFileNum, , bytHeader
Close lngFileNum
Open path For Binary Access Write As lngFileNum
Put lngFileNum, , CreateTempValues(records)
Close lngFileNum
End Sub

Private Function CreateTempValues(ByVal records As Long) As Byte()
Const lngRecordWidth_c As Long = 14&
Const bytByte0Unicode_c As Byte = 255
Const bytByte1Unicode_c As Byte = 254
Const bytComma_c As Byte = 44
Const bytVBCR_c As Byte = 13
Const bytVBLF_c As Byte = 10
Dim lngIndex As Long
Dim lngUpperBnd As Long
Dim bytRtnVal() As Byte
Randomize
lngUpperBnd = records * lngRecordWidth_c + 1&
ReDim bytRtnVal(lngUpperBnd) As Byte
'Add Header:
bytRtnVal(0) = bytByte0Unicode_c
bytRtnVal(1) = bytByte1Unicode_c
For lngIndex = 2& To lngUpperBnd Step lngRecordWidth_c
bytRtnVal(lngIndex) = RandBetween(vbKey0, vbKey9)
bytRtnVal(lngIndex + 2&) = RandBetween(vbKey0, vbKey9)
bytRtnVal(lngIndex + 4&) = bytComma_c
bytRtnVal(lngIndex + 6&) = RandBetween(vbKey0, vbKey9)
bytRtnVal(lngIndex + 8&) = RandBetween(vbKey0, vbKey9)
bytRtnVal(lngIndex + 10&) = bytVBCR_c
bytRtnVal(lngIndex + 12&) = bytVBLF_c
Next
CreateTempValues = bytRtnVal
End Function

Private Function RandBetween(ByVal lowerBound As Byte, ByVal upperBound As _
Byte) As Byte
RandBetween = CByte((upperBound - lowerBound) * Rnd + lowerBound)
End Function

Public Sub OriginalMethod(ByVal path As String, ByVal ws As Excel.Worksheet)
Dim nd As Long, inc As Long, i As Long
Dim temp(100000) As Double, Dens(100000) As Double
Dim lngFileNum As Long
lngFileNum = FreeFile
'fetch data from the file
Open path For Input As lngFileNum
Do
If EOF(lngFileNum) Then Exit Do
nd = nd + 1
Input #lngFileNum, temp(nd), Dens(nd)
Loop
Close lngFileNum
'Display values on the worksheet
With ws
For i = 2 To nd + 2
.Cells(i, 1).value = temp(i)
.Cells(i, 2).value = Dens(i)
Next i
End With
Erase Dens
Erase temp
'create a file
'clear Contents
inc = inc + 1
End Sub

Public Sub OutputResults(ByVal path As String, ByVal maxBuilding As Double, _
ByVal minBuilding As Double, ByVal volume As Double)
Dim lngFileNum As Long
lngFileNum = FreeFile
Open path For Append As lngFileNum
Write #lngFileNum, maxBuilding, minBuilding, volume, path
Close lngFileNum
End Sub

Public Sub ToggleInterface(ByVal interfaceOn As Boolean)
With Excel.Application
.EnableEvents = interfaceOn
.ScreenUpdating = interfaceOn
If interfaceOn Then
.Calculation = xlCalculationAutomatic
.Cursor = xlDefault
.EnableCancelKey = xlInterrupt
Else
.Calculation = xlCalculationManual
.Cursor = xlWait
.EnableCancelKey = xlErrorHandler
End If
End With
End Sub

Private Function ImportMethod(ByVal path As String) As Excel.Worksheet
Workbooks.OpenText path, xlWindows
Set ImportMethod = Excel.Workbooks(Excel.Workbooks.Count).Worksheets(1)
End Function

Public Function ByteArrayToString(ByRef vals() As Byte, ByVal stopAtIndex As _
Long) As String
Dim lngIndex As Long
Dim lngUpper As Long
Dim strResult As String
lngUpper = UBound(vals)
strResult = "Index" & vbTab & "Value" & vbTab & "Char Val" & vbLf
If lngUpper > stopAtIndex Then lngUpper = stopAtIndex
For lngIndex = 0& To lngUpper Step 2&
strResult = strResult & (lngIndex & vbTab & vals(lngIndex) & vbLf)
strResult = strResult & ((lngIndex + 1) & vbTab & vals(lngIndex + 1) & _
vbTab & GetChar(vals(lngIndex), vals(lngIndex + 1&)) & vbLf)
Next
ByteArrayToString = strResult
End Function

Private Function GetChar(ByVal byte1 As Byte, ByVal byte2 As Byte) As String
Dim strRtnVal As String
strRtnVal = Excel.WorksheetFunction.Clean(ChrW(byte1 + byte2 * 256&))
If LenB(strRtnVal) = 0& Then strRtnVal = "?"
GetChar = strRtnVal
End Function

Private Function GetFileBytes(ByVal path As String) As Byte()
Dim lngFileNum As Long
Dim bytRtnVal() As Byte
lngFileNum = FreeFile
Open path For Binary Access Read Lock Write As lngFileNum
ReDim bytRtnVal(LOF(lngFileNum)) As Byte
Get lngFileNum, , bytRtnVal
Close lngFileNum
GetFileBytes = bytRtnVal
End Function

Private Function GetData(ByVal path As String) As Double()
Const lngIncrement_c As Long = 40000
Const lngErrSOOR_c As Long = 9& '<Subscript out of range
Dim lngFileNum As Long
Dim dblRtnVal() As Double
Dim dblNum As Double
Dim lngRecord As Long
Dim lngUprBnd As Long
On Error GoTo Err_Hnd
lngFileNum = FreeFile
Open path For Input Access Read Lock Write As lngFileNum
lngUprBnd = lngIncrement_c
ReDim dblRtnVal(lngUprBnd) As Double
Do Until EOF(lngFileNum)
'Value from the first column is discarded in favor of the second:
Input #lngFileNum, dblRtnVal(lngRecord), dblRtnVal(lngRecord)
lngRecord = lngRecord + 1&
Loop
Close lngFileNum
ReDim Preserve dblRtnVal(lngRecord + 1&) As Double
GetData = dblRtnVal
Exit Function
Err_Hnd:
If Err.Number = lngErrSOOR_c Then
If lngRecord > lngUprBnd Then
lngUprBnd = lngUprBnd + lngIncrement_c
ReDim Preserve dblRtnVal(lngUprBnd) As Double
Resume
End If
End If
Close lngFileNum
'Reraise error
Err.Raise Err.Number, Err.Source, Err.Description, Err.HelpFile, _
Err.HelpContext
End Function

Private Function ArrayPercentile(ByRef vals() As Double, ByVal percentile As _
Double) As Double
Dim lngIndx As Long
Dim lngLwrBnd As Long
Dim lngUprBnd As Long
lngLwrBnd = LBound(vals)
lngUprBnd = UBound(vals)
MergeSort vals, lngLwrBnd, lngUprBnd
ArrayPercentile = vals(Excel.WorksheetFunction.RoundUp((lngUprBnd - _
lngLwrBnd + 1) * percentile, 0))
End Function

Public Sub MergeSort(list() As Double, ByVal first_index As Long, ByVal _
last_index As Long)
'Merge sort code adapted to doubles from code located on www.vb-helper.com (http://www.vb-helper.com).
'http://www.vb-helper.com/howto_mergesort.html
Dim middle As Long
If (last_index > first_index) Then
' Recursively sort the two halves of the list.
middle = (first_index + last_index) \ 2
MergeSort list, first_index, middle
MergeSort list, middle + 1, last_index
' Merge the results.
Merge list, first_index, middle, last_index
End If
End Sub

' Merge two sorted sublists.
Public Sub Merge(list() As Double, ByVal beginning As Long, ByVal middle As _
Long, ByVal ending As Long)
'Merge sort code adapted to doubles from code located on www.vb-helper.com (http://www.vb-helper.com).
'http://www.vb-helper.com/howto_mergesort.html
Const lngDataTypeByteLen_c As Long = 8&
Dim temp_array() As Double
Dim temp As Integer
Dim counterA As Long
Dim counterB As Long
Dim counterMain As Long
' Copy the array into a temporary array.
ReDim temp_array(beginning To ending)
CopyMemory temp_array(beginning), list(beginning), (ending - beginning + 1) _
* lngDataTypeByteLen_c
' counterA and counterB mark the next item to save
' in the first and second halves of the list.
counterA = beginning
counterB = middle + 1
' counterMain is the index where we will put the
' next item in the merged list.
counterMain = beginning
Do While (counterA <= middle) And (counterB <= ending)
' Find the smaller of the two items at the front
' of the two sublists.
If (temp_array(counterA) <= temp_array(counterB)) Then
' The smaller value is in the first half.
list(counterMain) = temp_array(counterA)
counterA = counterA + 1
Else
' The smaller value is in the second half.
list(counterMain) = temp_array(counterB)
counterB = counterB + 1
End If
counterMain = counterMain + 1
Loop
' Copy any remaining items from the first half.
If counterA <= middle Then
CopyMemory list(counterMain), temp_array(counterA), (middle - counterA _
+ 1) * Len(list(beginning))
End If
' Copy any remaining items from the second half.
If counterB <= ending Then
CopyMemory list(counterMain), temp_array(counterB), (ending - counterB _
+ 1) * Len(list(beginning))
End If
End Sub

aaronreyna
09-11-2009, 09:10 AM
ok, I added the zip attachment

Ah,

I see. Writing it to the sheet it what's bogging it down. I figured so. Man, you guys are awesome! In the excel file I attached, i cleared out some of the other equations that were just making the file too big to post. They can always be put back in later.

Yeah, the streaming thing to memory seemed like such an extra step to do the same thing. I never though about turning certain features off.

Thanks again guys! I appreciate it so much.

Paul_Hossler
09-11-2009, 10:41 AM
Oorang --



It's easier to pull the data in via a loop using the input method ...

That said you still have the issue of calculating the Percentile in Memory. This calls for the implementation of a sort algorithm and results a quote a lot of code. However the end result is faster by about half a a second. So if you really want to you can


Question: Is the Input loop faster that opening the CSV file as a new workbook, using the worksheet in that WB to Sort if necessary, and going right to the WorksheetFunction.Percentile, closing the CSV WB, and getting the next one?

I couldn't find the zip file to try it out my self




(See also: "Premature optimization is the root of all evil.")


Can you post a link or reference, please?

Paul

p45cal
09-11-2009, 11:00 AM
I timed my code in your workbook and to do 4000 files with 10,000 records in each takes 15 mins - with screen updating still on. Turning it off yielded only a slight improvement to 12 minutes.


I couldn't find the zip file to try it out my selfIt's attached to the first post in this thread.

aaronreyna
09-11-2009, 12:03 PM
I took off the auto calc and changed it to "until WB a save" and it cut down my time to about 4 seconds a cycle.

p45cal,

I ran your code and It processed the first file in no time flat, but upon opening the second file, it returned an error 9 message for the "ReDim Preserve TempDens(1 To 2, 1 To nd)" under the second Do Loop. Originally, I had got around this by declaring Dim Temp(100000) As Double, Dens(100000) As Double
and when I try to do that again, VBA simply tells me those variables have already been declared.

Any insight?

aaron

sorry,

Subscript out of range (Error 9)

aaron

Oorang
09-11-2009, 01:32 PM
Question: Is the Input loop faster that opening the CSV file as a new workbook, using the worksheet in that WB to Sort if necessary, and going right to the WorksheetFunction.Percentile, closing the CSV WB, and getting the next one? Yes the input loop tested faster than opening the file.


Can you post a link or reference, please?
Paul Premature optimization is the root of all evil. (http://en.wikipedia.org/wiki/Program_optimization#Quotes)

I timed my code in your workbook and to do 4000 files with 10,000 records in each takes 15 mins - with screen updating still on. Turning it off yielded only a slight improvement to 12 minutes. I have questions:)
1.) Did you turn off events and calculation? Those are where the large gains were.
2.) How fast is your computer?
3.) Are you just processing all files in a particular folder?

Aaron R. -> Are the files already sorted, because you could shave off even more time there.

Note: The memory method I laid out took 6 minutes on 4000 10000-record files.

aaronreyna
09-11-2009, 02:42 PM
Oorang,

yes, the files are sorted. The files i posted were not only because it's just a test example.

Thanks!

aaron

p45cal
09-11-2009, 03:54 PM
but upon opening the second file, it returned an error 9 message for the "ReDim Preserve TempDens(1 To 2, 1 To nd)" under the second Do Loop. aaronreyna:
I don't get the error, post your exact code or try with the code posted lower down in this message. (note I've tacked 'xx' to the end of the sub name)

**************************************************



Yes the input loop tested faster than opening the file.
Premature optimization is the root of all evil. (http://en.wikipedia.org/wiki/Program_optimization#Quotes)
I have questions:)
1.) Did you turn off events and calculation? Those are where the large gains were.
2.) How fast is your computer?
3.) Are you just processing all files in a particular folder?
Oorang:
Answers:
1. No, neither were switched off. Don't forget, there is only one line in my code that changes what's on the sheet, and that line is executed only once per file. viz.
.Range("A2").Resize(nd, 2) = Application.Transpose(TempDens)
which puts the whole array in at once. So I didn't see the point in turning it off at all. (Actually, there is another, the one that clears the first two columns.)

2. I don't know, what sort of info do you need? AMD Athlon 64x2 Dual Core 3800+, 2.00 GHz, 1.96 GB Ram. 3+ years old.

3. It's processing the same file over and over again. The file has been constructed with the code:For i = 1 To 10000
Write #2, Int((500) * Rnd + 1), Int((50000) * Rnd + 1)
Next i which produces the likes of:
90,14479
151,38738
8,38037
408,35452
23,20702
432,39525
to 10000 rows.
I checked that the entire array was correctly placed in the sheet- it was.

I had to make a guess at the formulae in columns H and I since I only have xl2003, and when the 2007 compatibility whatsit converted the file it reported some formulae would have #REF! in because of the number of rows available in 2007 being so much larger than 2003.
So where I saw the likes of
=PERCENTILE(#REF!,0.4)
I put
=PERCENTILE(A:A,0.4)
and
=PERCENTILE(B:B,0.4)
etc.

While writing this I noticed, while debugging, some values appearing in column C, so I changed the formulae to:
=PERCENTILE(B$2:B$65536,0.4)
and
=PERCENTILE(C$2:C$65536,0.4).
Incidentally, while doing this I came across this in the Percentile Help:
"If array is empty or contains more than 8,191 data points, PERCENTILE returns the #NUM! error value."
However I never did get an #NUM (unless the array was empty) and I got the same result whether I used:
=PERCENTILE(B$2:B$65536,0.4)
or
=PERCENTILE(B$2:B$10001,0.4)
(and to double check I even made sure that
=PERCENTILE(B$2:B$8192,0.4)
gave a different answer, so they may have fixed this limitation)

If I'm wrong with my guesses, please tell me.

So having made these changes, I created 100 different text files, added their names in column Q and ran the code:Sub run_Clickxx()
Application.ScreenUpdating = False
Starttime = Timer
'Clear Contents
Range("A2:B20000").ClearContents '"A2:B100000"
Dim i As Long, nd As Long, MaxBuilding As Long, MinBuilding As Long
Dim File_name As String, inc As Long, j As Long, volume As Long
Dim TempDens() As Double
inc = 1
'For j = 0 To inc
Do
ReDim TempDens(1 To 2, 1 To 1) 'ReDim TempDens(1 To 2, 0 To 0)
If Sheet1.Cells(inc, 17).Value = "" Then Exit Do
File_name = Sheet1.Cells(inc, 17).Value
'fetch data from the file
Open File_name For Input As #1
Do
If EOF(1) Then Exit Do
nd = nd + 1
ReDim Preserve TempDens(1 To 2, 1 To nd) 'ReDim Preserve TempDens(1 To 2, 0 To nd)
' Input #1, Temp(nd), Dens(nd)
Input #1, TempDens(1, nd), TempDens(2, nd)
Loop
Close #1
'Display values on the worksheet
With Sheets("Export_Output1")
.Range("A2").Resize(nd, 2) = Application.Transpose(TempDens)
nd = 0
Erase TempDens
'create a file
Open "c:\z_unit_write.txt" For Append As #2
MaxBuilding = .Range("h12")
MinBuilding = .Range("k12")
volume = .Range("m13")
Write #2, MaxBuilding, MinBuilding, volume, File_name
Close #2
'clear Contents
.Range("A2:B20000").ClearContents 'A2:B100000
End With
inc = inc + 1
Loop
'Next j
MsgBox (Timer - Starttime)
Application.ScreenUpdating = True
Stop
End Subwhich took 30 secs. which means, sadly, that the 4000, 10,000 line files will take 20 mins.:(

aaronreyna
09-11-2009, 04:45 PM
WOW,

p45cal, I ran your code and it works amazingly! HA HA HA HA HA, you guys are so out of my league it's not even funny. Most importantly I see the errors of my way! Dont write it to sheet if you dont need to.

thanks again!

aaron

Oorang
09-11-2009, 07:47 PM
Ok, well if the files are already sorted you just need two things. The record count and then the value of the nth file (twice). This method gets us down to less than 2 minutes.
If you have control over the file format, you might suggest switching to a fixed width file. Then you could calculate record count using the file size and get the percentile using "Seek" to go straight to the record in question.

BTW if you are wondering why I used the Windows API to iterate the files in the folder, it's because it's way faster than the FSO.
Option Explicit

Private Const MAX_PATH As Long = 255&
Private Const INVALID_HANDLE_VALUE As Long = -1&

Private Type FILETIME
dwLowDateTime As Long
dwHighDateTime As Long
End Type

Private Type WIN32_FIND_DATA
dwFileAttributes As Long
ftCreationTime As FILETIME
ftLastAccessTime As FILETIME
ftLastWriteTime As FILETIME
nFileSizeHigh As Long
nFileSizeLow As Long
dwReserved0 As Long
dwReserved1 As Long
cFileName As String * MAX_PATH
cAlternate As String * 14
End Type

Private Declare Function GetTickCount Lib "Kernel32.dll" () As Long
Private Declare Function FindFirstFile Lib "kernel32" Alias "FindFirstFileA" _
(ByVal lpFileName As String, lpFindFileData As WIN32_FIND_DATA) As Long
Private Declare Function FindNextFile Lib "kernel32" Alias "FindNextFileA" _
(ByVal hFindFile As Long, lpFindFileData As WIN32_FIND_DATA) As Long
Private Declare Function GetFileAttributes Lib "kernel32" Alias _
"GetFileAttributesA" (ByVal lpFileName As String) As Long
Private Declare Function FindClose Lib "kernel32" (ByVal hFindFile As Long) As _
Long


Public Sub Test()
Const strFolder_c As String = "C:\Test_CSVs\"
Const strSearchValue_c As String = "*.csv"
Dim tFileData As WIN32_FIND_DATA
Dim tEmpty As WIN32_FIND_DATA
Dim lngFileHandle As Long
Dim lngResult As Long
Dim strFullPath As String
Dim strFileData() As String
Dim lngRecords As Long
Dim strOutputPath As String
Dim lngStart As Long
Dim lngEnd As Long
lngStart = GetTickCount
strOutputPath = strFolder_c & "z_unit_write.txt"
lngFileHandle = FindFirstFile(strFolder_c & strSearchValue_c, tFileData)
Do
'VB does not have short circuted ifs. This
If lngResult = INVALID_HANDLE_VALUE Then Exit Do
strFullPath = strFolder_c & RTrimNulls(tFileData.cFileName)
If (GetFileAttributes(strFullPath) And vbDirectory) <> vbDirectory Then
strFileData = Split(CStr(GetFileBytes(strFullPath)), vbNewLine)
lngRecords = UBound(strFileData)
If LenB(strFileData(lngRecords)) Then lngRecords = lngRecords + 1&
OutputResults strOutputPath, Split(strFileData(Fix(lngRecords * _
0.03)), ",")(1&), Split(strFileData(Fix(lngRecords * 0.97)), _
",")(1&), lngRecords, strFullPath
End If
'Clear old values to prevent accidental data merging.
tFileData = tEmpty
lngResult = FindNextFile(lngFileHandle, tFileData)
Loop While lngResult
FindClose lngFileHandle
lngEnd = GetTickCount
Debug.Print lngEnd - lngStart, "Milliseconds"
End Sub

Private Function GetFileBytes(ByVal path As String) As Byte()
Dim lngFileNum As Long
Dim bytRtnVal() As Byte
lngFileNum = FreeFile
Open path For Binary Access Read Lock Write As lngFileNum
ReDim bytRtnVal(LOF(lngFileNum)) As Byte
Get lngFileNum, , bytRtnVal
Close lngFileNum
GetFileBytes = bytRtnVal
End Function

Private Function RTrimNulls(ByVal value As String) As String
RTrimNulls = LeftB$(value, InStrB(value, vbNullChar))
End Function

Public Sub OutputResults(ByVal outputPath As String, ByVal maxBuilding As _
Double, ByVal minBuilding As Double, ByVal volume As Double, ByVal sourcePath _
As String)
Dim lngFileNum As Long
lngFileNum = FreeFile
Open outputPath For Append As lngFileNum
Write #lngFileNum, maxBuilding, minBuilding, volume, sourcePath
Close lngFileNum
End Sub

Paul_Hossler
09-12-2009, 06:20 AM
aaronreyna: (and the others)

This has been a very interesting learning experience, but I was wondering about the Percentile() calculations. They never seem to get saved for any file and get wiped out when the next file is loaded.

The Max, Min, count, and file name seem to be all that's actually saved.

Paul

p45cal
09-12-2009, 07:41 AM
aaronreyna: (and the others)

This has been a very interesting learning experience, but I was wondering about the Percentile() calculations. They never seem to get saved for any file and get wiped out when the next file is loaded.

The Max, Min, count, and file name seem to be all that's actually saved.

PaulThe Max and min saved are the 97th and 3rd percentile values. I suspect the other percentiles showing were only used during development to help choose which percentiles would be best to save.