PDA

View Full Version : [SOLVED] FSO filtering on the fly



JKwan
02-11-2016, 07:34 AM
I am wondering if it is possible to read a text file using FSO and filter the data on the fly. The scenario is that I have a data file (csv) that has over 900k of lines. I would like to read this data file and filter by date range and put that into Excel

SamT
02-11-2016, 10:19 AM
FSO? File System Object?

This give you any ideas?
Process All CSV Files In SubFolders (http://www.vbaexpress.com/forum/showthread.php?52649-Process-All-CSV-Files-In-SubFolders)

JKwan
02-11-2016, 11:02 AM
Yes, File System Object

I think it is a start.... However, I wanted to filter the data file by a date range then put that into the worksheet. Instead of loading 900k of lines into Excel, I wanted to filter to a date range, this is more meaningful for trending data and probably more efficient since a lot of unwanted data will be filtered out.

JKwan
02-11-2016, 11:03 AM
Another idea that I have is to run SQL on the data file to achieve the filtering instead of using FSO. I will see what I can hammer out.

Paul_Hossler
02-11-2016, 11:18 AM
1. It's easy enough to just read the text file one line at a time using VBA, seeing if the line 'passes', and if it does add it to the spread sheet (this'd be my preference)

2. A FSO way


https://msdn.microsoft.com/en-us/library/h7se9d4f(v=vs.84).aspx






Const ForReading = 1, ForWriting = 2, ForAppending = 8
Dim fso, MyFile, FileName, TextLine

Set fso = CreateObject("Scripting.FileSystemObject")

' Open the file for input.
Set MyFile = fso.OpenTextFile(FileName, ForReading)

' Read from the file and display the results.
Do While MyFile.AtEndOfStream <> True
TextLine = MyFile.ReadLine

check to see if you want it and put it in the spreadsheet if it passes


Loop

MyFile.Close

JKwan
02-11-2016, 11:46 AM
Paul, thanks but that line by line is way too slow. That is how I am currently doing it, I fell asleep during the load.

p45cal
02-11-2016, 12:05 PM
Re SQL; give us such a data file to play with.

Paul_Hossler
02-11-2016, 01:01 PM
Paul, thanks but that line by line is way too slow. That is how I am currently doing it, I fell asleep during the load.

I've always found that reading a text file was pretty fast.

My text CSV is 100K lines with all but the 100,000th one being 'throw aways' but I read it 10 times for a round million reads. The last one that starts with bbbbb is written to the next available row in the activesheet

Only takes a second or less on my PC from a cold start

Even if you have to parse it with Split() to get the date field, it still should be faster than falling asleep :-)



Option Explicit
Sub TimingTest()
Dim iFile As Long, iOut As Long, iIn As Long
Dim sLine As String
Dim i As Long

Dim tStart As Date, tEnd As Date


tStart = Now

iOut = 1

For i = 1 To 10

iFile = FreeFile
iIn = 0
Open "c:\users\daddy\desktop\datafile.csv" For Input As #iFile

Do While Not EOF(iFile)

iIn = iIn + 1

If iIn Mod 1000 = 0 Then DoEvents

Line Input #iFile, sLine

If Left(sLine, 5) = "bbbbb" Then ' line 100,000
ActiveSheet.Cells(iOut, 1) = sLine
iOut = iOut + 1
End If
Loop

Close #iFile
Next I

tEnd = Now
MsgBox tEnd - tStart

End Sub

JKwan
02-11-2016, 02:09 PM
Hey Paul, thanks for the demo code. However, with my 200 meg file, it is taking forever to load. I am making head ways with the SQL pre-filter.

Paul_Hossler
02-11-2016, 02:51 PM
NP, but like p45cal says .. it'd be nice to see a small sample of the file.

Is it a CSV or is it in a database that exports it as a CSV where you can use SQL?

JKwan
02-11-2016, 03:31 PM
Here is a snippet of the data file, it is a CSV file that I grab from a site. So I want to load a range of records based on the DrillingDate.

SamT
02-11-2016, 05:12 PM
You must pick a book up before you can read it.

A 200MB file will take a while for the computer to open.

It takes ~4 seconds for my fast binary file editor to open a 2+ MB file.

Until you accept that it is going to take several minutes, possibly even a couple of cups of coffee, to do this...

And is this a one time thin?. You have data here going back to 1955!

How many times will you need to parse this file? That is a serious question, because that knowledge will affect my recommendations.

p45cal
02-11-2016, 05:43 PM
This is work in progress (bedtime here).
The attached has a button in it which asks you to choose a csv file (to test, first select the file you gave us above). It then adds a new sheet and gives you your filtered data.
At the moment the dates it uses are fixed in the code (from 1978-11-12 to 1984-06-24 (so you should be able to test on any file with drilling dates in that range)), but I need to know how you want to pass/define the dates between which you want results (will they be in cells somewhere, or do you want to be able to set them while the code runs?).
(The code may be more convoluted than it need be as I haven't done this too often. ODBC may not be the best solution.)

JKwan
02-11-2016, 07:43 PM
@Sam - like I said earlier, I fall asleep when I load the data file....
This is why I wanted to pre-filter the data. I know loading the data file will take time, I accepted it.

I trend the data semi-annually, looking at various year ranges

@P45 - thanks for looking at it, I will take a look tomorrow.

SamT
02-11-2016, 08:16 PM
I trend the data semi-annually, looking at various year ranges
Why not break the data into decade blocks, one file to a decade? That would take little more time than parsing the entire file once. Just a guess, but splitting it into years might only take twice as long as parsing it once , but you would never again have to take more than a few seconds to read in ten or twenty files.

On the other hand, putting it in a database would take about the same setup time.

snb
02-12-2016, 02:16 AM
Sub M_snb()
sn= filter(split(createobject("scripting.filesystemobject").opentextfile("G:\OF\sample.csv").readall,vbcrlf),"dd-mm-yyyy")
cells(1).resize(ubound(sn)+1)=application.transpose(sn)
End sub
or

Sub M_snb()
open "G:\OF\sample.csv") for input as #1
sn= filter(split(Input(LOF(1), #1),vbcrlf),"dd-mm-yyyy")
close
cells(1).resize(ubound(sn)+1)=application.transpose(sn)
End sub

If transpose can't handle this (applies to both apporaches):

Sub M_snb()
with createobject("scripting.filesystemobject")
.createtextfile("G:\OF\Filtered.csv").write join(filter(split(.opentextfile("G:\OF\sample.csv").readall,vbcrlf),"dd-mm-yyyy")),vbcrlf)
end with
workbooks.open "G:\OF\Filtered.csv"
End sub

or something like (Classical ADO):


Sub M_snb()
With CreateObject("ADODB.Recordset")
.Open "SELECT * FROM `sample.csv` WHERE [DrillingDate] <12/01/1980", "Driver={Microsoft Text Driver (*.txt; *.csv)};" & "Dbq=G:\downloads\", adOpenStatic ' 3
Cells(1).CopyFromRecordset .DataSource
End With
End Sub

JKwan
02-12-2016, 07:19 AM
@Sam - This is only one of my big data file. This is an example of "one of my problems". I have data file from SCADA that produce hourly data from wells on volume productions vs pressure, this is even bigger in size! I used the above example because that data is public. I agree with you that a database is more manageable, however, Excel is my only tool for me, since I am in user land.

@snb - Thanks for the different ways of filtering, I will definitely look at all of them. I think they are more flexible and very usable ways to pre-filter!

@p45 - I will look at your example as well, but I will look at snb's first, because they are very promising.

Paul_Hossler
02-12-2016, 07:21 AM
It would seem that instead of having Excel (which runs on a small computer) do the filtering, it'd be faster over all to have a SQL query on the larger DBMS computer to filter out the records that are not needed and only send the small subset to a CSV file for Excel.


I would think that regardless of the actual technique (having Excel or something else), all 900K 200 MB of records would still need to be 'looked at' on the smaller PC

SamT
02-12-2016, 07:49 AM
I have data file from SCADA that produce hourly data from wells on volume productions vs pressure, this is even bigger in size! I used the above example because that data is public.
Can you run the loading at night so that it is ready for you in the morning. I don't think we're going to be able to shorten the actual loading of the files much, although snb writes the fastest code around.

Are you dealing with "Raw" SCATA data?
I ask because it probably has tons of info you don't need and it might be most efficient to have an old ($300) Linux machine do the preliminary filtering. I believe that a simple GREP script would be blazing fast. (Blazing) being a relative term.) In any case it would be parallel work and not tie up your main Box.

I'll ask a buddy of mine who is into Linux and ask his opinion

JKwan
02-12-2016, 08:06 AM
Here is an update:
1 - it seems that the FSO is still "dog" slow, the good, it is faster to prefilter then load. I imaging this is slow because it scans the input to apply filtering across
2 - ADO - this is by far the fastest and more manageable, I can change the load by altering my SQL, very handy this way. I can really select a true date range (and pretty fast)
3 - p45's method, works relatively quickly, however, snb's ADO method is much more flexible and faster still.

Thanks everyone for the help, by the way, this really cuts down my napping :-)

Paul_Hossler
02-12-2016, 08:19 AM
I'm not familiar with ADO, but it seems that this still requires that the 200 MB CSV file be extracted and downloaded to the PC, correct?



Sub M_snb()
With CreateObject("ADODB.Recordset")
.Open "SELECT * FROM `sample.csv` WHERE [DrillingDate] <12/01/1980", "Driver={Microsoft Text Driver (*.txt; *.csv)};" & "Dbq=G:\downloads\", adOpenStatic ' 3
Cells(1).CopyFromRecordset .DataSource
End With
End Sub

p45cal
02-12-2016, 08:24 AM
3 - p45's method, works relatively quickly, however, snb's ADO method is much more flexible and faster still.Could we see the code that you finally used with ADO method (I was working on that too)?

JKwan
02-12-2016, 09:28 AM
this is the base of my select. I will be asking for a date range, rather than a hardcoded range (which is simple)

Sub Sample4()
With CreateObject("ADODB.Recordset")
.Open "SELECT * FROM WellList.csv WHERE " & _
"[DrillingDate] Between #1/1/1954# And #12/31/1956#", _
"Driver={Microsoft Text Driver (*.txt; *.csv)};" & _
"Dbq=c:\temp\", adOpenStatic ' 3
Cells(1).CopyFromRecordset .DataSource
End With
End Sub

SamT
02-12-2016, 09:57 AM
:thumb

snb
02-12-2016, 10:48 AM
In this case I'd prefer 2 variables:


Sub M_snb()
t_1 = CLng(DateSerial(1954, 1, 1))
t_2 = CLng(DateSerial(1975, 12, 31))

With CreateObject("ADODB.Recordset")
.Open "SELECT * FROM `sample.csv` WHERE [DrillingDate] Between " & t_1 & " And " & t_2, "Driver={Microsoft Text Driver (*.txt; *.csv)};" & "Dbq=G:\downloads\", 3
Cells(1).CopyFromRecordset .DataSource
End With
End Sub