PDA

View Full Version : Automatically split a large text file in separate worksheets on import



norway
07-06-2010, 01:33 AM
Hi
this is my first post-and appologies if this has already been solved.
I've searched - but not found any answers.

Anyway - here is my problem.
I have a very large txt file (500 000 lines) and excel 2003.
The file does however contain information which can be used to split the date into separate worksheets.

All information is in a list form
.domain Norway
.set Oslo
Alpha=0
Beta=1,2,3,4
Charlie=6,7,8,9
delta=1,2,2
string="this is just a sample string saying I can have long test strings"
.set Oslo ZZ
.set Bergen
Alpha = 2
Beta = 2,4
.
.
.
.set Bergen ZZ
.domain Germany
.set Berlin
Yankee=2,2,3
Zulu= 3,4
text="hallo hallo"
.set Berlin ZZ
.set Hamburg
.
.
.
.
and so on.

the identifier for splitting into sheets are the .domain identifier
and the name of the sheet is ofcourse the text following this identifier.

Problem 1 is thus to import a text file and split it into different worksheets- all labeled from the values in .domain.

Problem 2: Within a domain - I have sereral subsets of date.
each identified by .set <name> and ended with .set <name> ZZ
these data I would like to have in a table format (horisontal) like this:
name Alpha Beta Charlie
Oslo 0 1,2,3,4 6,7,8,9
Bergen 2 2,4 .................and so on


Does anyone have any good tips for how I can achieve this using VB?

Thanks in advance
André

GTO
07-06-2010, 01:41 AM
Greetings André,

First and formost, welcome to vbaexpress!

As to the issue at hand, the data certainly seems obtuse enough. Could you zip an example textfile? It wouldn't need to be 500k lines worth, but enough of a sampling to cover variations in records.

Mark

PS - a wb showing how a record should look 'After' would probably be a good idea.

norway
07-06-2010, 02:04 AM
Thanks

The data is sensitive -so i made a small sample.
I made 2 domains with 2 subsets.
in real life I have 30+ domains- and may have 50 subsets.
The number of parameters in a subset can be over 100.

the resulting excel should give:
sheet name = .domain

the columns in the sheet should be:
.domain -> .set -> and then the different parameters (alfa-bravo-charlie etc)

Note: within a domain the paramters are constant - but not between domains.

Br
André

GTO
07-06-2010, 04:21 AM
Hi again,

Maybe just the blonde guy, but I'm not sure whether we'll cover all possibilities with such and abbreviated sample. That said, could you put this data in a worksheet, layed out like we want the result?

.domain Norway
.set Oslo
Alpha=0
Beta=1,2,3,4
Charlie=6,7,8,9
delta=1,2,2
string="this is just a sample string saying I can have long test strings"
.set Oslo ZZ

Again, maybe just me, but I'm not getting a clear picture from:

these data I would like to have in a table format (horisontal) like this:
name Alpha Beta Charlie
Oslo 0 1,2,3,4 6,7,8,9
Bergen 2 2,4 .................and so on

Thanks,

Mark

norway
07-06-2010, 05:20 AM
sorry- I saw i had made a fault in the text file as well.
Here is a simple excel along with the corrected txt file.

/André

GTO
07-06-2010, 11:40 AM
Okay, here is a first shot. Please test well and analyze results, more for accuracy than speed, as I'm sure this is not stellar efficiencey-wise. (I am afraid you are victim to my first try at a jagged array)

In a Standard Module:


Option Explicit

Sub exa()
Dim FSO As Object 'FileSystemObject
Dim fsoTStream As Object 'TextStream

Dim _
aryLineData As Variant, _
CountryData As Variant, _
strLine As String, _
strFullNam As String, _
bolNewSheet As Boolean

strFullNam = Application.GetOpenFilename(FileFilter:="Text Files (*.txt), *.txt", _
Title:="Choose File", _
MultiSelect:=False)
If strFullNam = "False" Then Exit Sub

Set FSO = CreateObject("Scripting.FileSystemObject")
Set fsoTStream = FSO.OpenTextFile(Filename:=strFullNam, IOMode:=1)

With fsoTStream

ReDim aryLineData(0 To 0)

Do While Not fsoTStream.AtEndOfStream

strLine = .ReadLine
RESTART:
If strLine Like "*.domain *" And Not bolNewSheet Then
ReDim Preserve aryLineData(1 To UBound(aryLineData, 1) + 1)
bolNewSheet = True
aryLineData(UBound(aryLineData, 1)) = strLine
ElseIf strLine Like "*.domain *" And bolNewSheet Then
CountryData = BuildSheet(aryLineData)
CountryData = RetSheet(CountryData)
Call OutputSheet(CountryData)
ReDim aryLineData(0 To 0)
bolNewSheet = False
GoTo RESTART
ElseIf fsoTStream.AtEndOfStream And bolNewSheet Then
CountryData = BuildSheet(aryLineData)
CountryData = RetSheet(CountryData)
Call OutputSheet(CountryData)
ReDim aryLineData(0 To 0)
bolNewSheet = False
ElseIf bolNewSheet Then
ReDim Preserve aryLineData(1 To UBound(aryLineData, 1) + 1)
aryLineData(UBound(aryLineData, 1)) = strLine
End If
Loop
fsoTStream.Close
End With
End Sub

Sub OutputSheet(Data)
Dim wks As Worksheet

With ThisWorkbook
Set wks = .Worksheets.Add(, .Worksheets(.Worksheets.Count))
End With
With wks
.Name = Data(2, 1)
.Range("A1").Resize(UBound(Data, 1), UBound(Data, 2)).Value = Data
.Range("A1").Resize(UBound(Data, 1)).Font.Bold = True
With .Range("A1").Resize(, UBound(Data, 2))
.Font.Bold = True
.EntireColumn.AutoFit
End With
End With
End Sub

Function BuildSheet(InputArray As Variant) As Variant()
Dim _
strCountry(1 To 1) As String, _
aryCity As Variant, _
aryParameter As Variant, _
aryParaData As Variant, _
aryParaArray As Variant, _
aryTemp As Variant, _
i As Long, _
bolNewCountry As Boolean

ReDim aryCity(0 To 0)
ReDim aryParameter(0 To 0)
ReDim aryParaData(0 To 0)
ReDim aryParaArray(0 To 0)

strCountry(1) = _
RetClean(Mid(InputArray(LBound(InputArray)), _
InStr(1, InputArray(LBound(InputArray)), ".domain") + 7 _
), "\b[A-Za-z]+\b\ {0,1}" _
)

bolNewCountry = True

For i = LBound(InputArray) + 1 To UBound(InputArray)

If InputArray(i) Like "*.set*ZZ*" Then
bolNewCountry = False
ReDim Preserve aryParaArray(1 To UBound(aryParaArray) + 1)
aryParaArray(UBound(aryParaArray)) = aryParaData
ReDim aryParaData(0 To 0)
ElseIf InputArray(i) Like "*.set*" Then
ReDim Preserve aryCity(1 To UBound(aryCity) + 1)

aryCity(UBound(aryCity)) = _
RetClean(Mid(InputArray(i), InStr(1, InputArray(i), ".set") + 4), _
"\b[A-Za-z]+\b\ {0,1}" _
)

ElseIf InputArray(i) Like "*=*" Then

aryTemp = Split(InputArray(i), "=")
If bolNewCountry Then
ReDim Preserve aryParameter(1 To UBound(aryParameter) + 1)
aryParameter(UBound(aryParameter)) = aryTemp(0)
ReDim Preserve aryParaData(1 To UBound(aryParaData) + 1)
aryParaData(UBound(aryParaData)) = aryTemp(1)
Else
ReDim Preserve aryParaData(1 To UBound(aryParaData) + 1)
aryParaData(UBound(aryParaData)) = aryTemp(1)
End If
End If
Next

BuildSheet = Array(strCountry, aryCity, aryParameter, aryParaArray)
End Function

Function RetSheet(InputArray)
Dim _
AllData As Variant, _
i As Long, _
x As Long

ReDim AllData(1 To _
UBound(InputArray(LBound(InputArray) + 1)) - _
LBound(InputArray(LBound(InputArray) + 1)) + 2, _
1 To _
UBound(InputArray(LBound(InputArray) + 2)) - _
LBound(InputArray(LBound(InputArray) + 2)) + 3)

AllData(1, 1) = "Country"
AllData(1, 2) = "City"

For i = 2 To UBound(AllData, 1)
AllData(i, 1) = InputArray(LBound(InputArray))(1)
Next

For i = 1 To UBound(InputArray(LBound(InputArray) + 1))
AllData(i + 1, 2) = InputArray(LBound(InputArray) + 1)(i)

Next

For i = 1 To UBound(InputArray(LBound(InputArray) + 2))
AllData(1, i + 2) = InputArray(LBound(InputArray) + 2)(i)
Next

For i = 1 To UBound(InputArray(LBound(InputArray) + 3))

For x = 1 To UBound(InputArray(LBound(InputArray) + 3)(i))
AllData(i + 1, x + 2) = InputArray(LBound(InputArray) + 3)(i)(x)
Next
Next

RetSheet = AllData
End Function

Function RetClean(ByVal InputText As String, sPattern As String) As Variant
Static REX As Object 'RegExp
Dim rexMatches As Object 'MatchCollection
Dim rexMatch As Object 'Match

If REX Is Nothing Then
Set REX = CreateObject("VBScript.RegExp")
End If

With REX
.Global = True
.Pattern = sPattern
If .Test(InputText) Then

Set rexMatches = .Execute(InputText)
For Each rexMatch In rexMatches
RetClean = RetClean & rexMatch.Value
Next
RetClean = Trim(RetClean)
End If
End With
End Function

Hope that helps,

Mark

norway
07-07-2010, 12:26 AM
Thanks :-)
I am reading the code now and testing it on my real data.
I do however get a error message "400" ?

To a novice - not the most informative response - but I'll try to narrow it down by testing subsets of the data set.

Your program seemed to do the trick with the sample file though - good work!

/André

norway
07-07-2010, 12:52 AM
found an error with importing the sample2 text file.
the values in the last column - there the commas are not kept

GTO
07-07-2010, 01:12 AM
found an error with importing the sample2 text file.
the values in the last column - there the commas are not kept

Hi Andre,

Where? This is what I get:

norway
07-07-2010, 01:53 AM
Hi
I've played around with the code and tested a bit.
I've updated the sample text file with a few more cases - sample3
The script fails with the following values.

Rock=?
Pop="error when having = sign in text string, I can have several = signs inside the exclamation marks"

and then there is the odd fault that the last column did not have commas "," in the sample2 testfile import.

Any chance you can help me fix these bugs?

thanks in advance
/André

norway
07-07-2010, 01:57 AM
Hmm
how strange-here is the excel I get


<img alt="">

norway
07-07-2010, 01:59 AM
I use a norwegian keyboard and number formatting
Don't see why it should matter - as it correctly displays the commas in the earlier cells- but you never know

GTO
07-07-2010, 02:48 AM
Hi
I've played around with the code and tested a bit.
I've updated the sample text file with a few more cases - sample3
The script fails with the following values.

Rock=?
Pop="error when having = sign in text string, I can have several = signs inside the exclamation marks"

Try adding the Limit arg as shown, this is in 'Function BuildSheet'


ElseIf InputArray(i) Like "*=*" Then
'// add the Limit arg>>>v
aryTemp = Split(InputArray(i), "=", 2)
If bolNewCountry Then
ReDim Preserve aryParameter(1 To UBound(aryParameter) + 1)
aryParameter(UBound(aryParameter)) = aryTemp(0)
ReDim Preserve aryParaData(1 To UBound(aryParaData) + 1)
aryParaData(UBound(aryParaData)) = aryTemp(1)
Else
ReDim Preserve aryParaData(1 To UBound(aryParaData) + 1)
aryParaData(UBound(aryParaData)) = aryTemp(1)
End If
End If
Next

BuildSheet = Array(strCountry, aryCity, aryParameter, aryParaArray)
End Function


I use a norwegian keyboard and number formatting
Don't see why it should matter - as it correctly displays the commas in the earlier cells- but you never know

As to the commas disappearing, I do not at this point think it to be code related. After running the code, select a cell in the last column where the commas have disappeared. Check the number formatting of the cell and see if Excel has changed it to something besides 'General'. Has Excel coerced it to something else?

norway
07-07-2010, 03:08 AM
I'll check - and add this new code.
I found another bug
.domain <name>
if the <name> for instance is bergen_city - it will fail.

norway
07-07-2010, 03:18 AM
Hi
the changes to the code fixes the problem with severeal "=" (equal) signs- but not the '?' problem.

And- yes- it changed the format to number format.

GTO
07-07-2010, 03:24 AM
I'll check - and add this new code.
I found another bug
.domain <name>
if the <name> for instance is bergen_city - it will fail.

Andre,

Okay, so far we need to fix around multiple equal signs and there's a problem with commas disappearing. Could we get one thing worked out before adding more?

The above is a good example of why better, more complete test data is needed. Whilst I would 'challenged' at coming up with any particularly advanced .Pattern for RegExp, there are those here who could. But I nor anyone can develop rules for the code without being shown all the possibilities. Does that make sense?

Mark

Edit: What problem for the question mark exists? I return a '?' under volare/Berlin as expected.

norway
07-07-2010, 04:03 AM
Hi Mark
yep,
I anticipated that comment when I detected the problem and i can not say anything else than that you are right.
sorry- I missed that in the samples I made.

I have been studying the data and it seems I have more than 256 data elements for the group I've called city (bergen, oslo etc)
So when i deleted the '?' (including the parameter name) - it worked (since I now could fit the data into excel 2003).

:think:
I need to think a bit how to solve that problem.

GTO
07-07-2010, 04:22 AM
...I have been studying the data and it seems I have more than 256 data elements for the group I've called city (bergen, oslo etc)
So when i deleted the '?' (including the parameter name) - it worked (since I now could fit the data into excel 2003).

Ahh... that makes sense of course.

norway
07-07-2010, 04:27 AM
Hi Mark
is there any way I can make a prefix filter on the parameters to be loaded into the table?
I've made a new text file and added a GO900 series of parameters.
.set Oslo
Alpha="hello"
Bravo=1,2,3,4
Charlie=6,7,8
d=2,3,3
e=1,1,61,8
f=2,133,45
GO900_my = 12
GO900_your ="erika"
GO900_his = "party"
GO900singalong="singsong"
g=2,31,3
h=11,32,4,5
i=243,228,312
.set OSlo ZZ

All parameters starting with GO900 I could then opt to not include in the excel table - as they are not relevant.
Then I could get around the limitations in excel 2003

Thanks

/André

GTO
07-07-2010, 05:11 AM
Andre,

The new textfile still only has 50-60 lines. It seems unlikely that it will show anywhere near a decent representation of varying patterns. As you pointed out, what happens when the "city" is 'bergen_city', it fails.

I would admit straightaway that someone used to developing jagged-arrays could no doubt bang this out quicker, and who knows, there might be a better way anyways. For all I know, someone will point out that if we just clicked on filter + some other option, Excel would do all this for us. (okay, maybe a tiny exaggeration).

I want to help, especially as I am learning along the way. It makes good practice. All that said, I have several hours in muddling my way through what is produced thus far. Please provide a more comprehensive text file. I think its a fair request; would you not agree?


Hi
the changes to the code fixes the problem with severeal "=" (equal) signs- but not the '?' problem.

And- yes- it changed the format to number format.

Did you come up with a solution as to Excel's coercing a change as to the number formatting?


Hi Mark
is there any way I can make a prefix filter on the parameters to be loaded into the table?
I've made a new text file and added a GO900 series of parameters.
.set Oslo
Alpha="hello"
Bravo=1,2,3,4
Charlie=6,7,8
d=2,3,3
e=1,1,61,8
f=2,133,45
GO900_my = 12
GO900_your ="erika"
GO900_his = "party"
GO900singalong="singsong"
g=2,31,3
h=11,32,4,5
i=243,228,312
.set OSlo ZZ

All parameters starting with GO900 I could then opt to not include in the excel table - as they are not relevant.
Then I could get around the limitations in excel 2003

This sounds like a possibility, but I would like to follow the logic a bit better before trying. Off the top, other than opening the file and running thru with a couple of counters until EOF, how would you know whether a certain parameter's disclusion would make the difference?

In about post#3, you mentioned the subsets (cities, right?) being around 50, and maybe 100 parameters. If the parameters exceed 256, have you looked at the 'cities'/subsets? Might we transpose the sheet layout?

Mark

norway
07-07-2010, 05:42 AM
Hi Mark
I can see if i can find a way to make the data anonymous- by deleting all parameters with "GO900" (an example of course) from the dataset -and removing all the underscores in the .domain file - I was able to load the entire structure.
There are still glitches - but it is getting there.
With regards to format - I am reading up on using the style.Numberformat parameter (but as i am just a beginner - it is a lot of reading :-)

I've also noted that as with the issue with "underscore" - it is also valid for parameter values containing ":"
seems like the logic copes with values that are in "" - if not - it stops if it is a value for domain- but leaves it blank if it is a parameter value (not used to name the sheet)


Parameter disclusion: The dataset I have should contain all the paramters- the increase will come in the number of data-sets within the domain (.set - here the cities)

Once I now got the data through the program (with a few missing bits and pieces) - I could look at the .set quantities.
So far the largest domain contains over 4500 subsets - so removing some parameters is the way to go I think.

Thanks
/André

GTO
07-07-2010, 06:10 AM
Hi Mark
I can see if i can find a way to make the data anonymous- by deleting all parameters with "GO900" (an example of course) from the dataset -and removing all the underscores in the .domain file - I was able to load the entire structure.

Okay, but we can include underscores or whatever, we need to change the .Pattern to handle various anomalies. The downside is the wackier and more anomalies that exist, the harder it is to get the pattern to parse stuff correctly.



There are still glitches - but it is getting there.
With regards to format - I am reading up on using the style.Numberformat parameter (but as i am just a beginner - it is a lot of reading :-)

I would look at changing the numberformat of the range the array is getting plunked into, to Text (ie "@").

norway
07-07-2010, 06:12 AM
Hi Mark,
I did the following changes to my datatable
1. removed all GP900 paramters (so each subset is less than 256)
2. removed all underscores in .domain names

it then loaded :-D - and it was quite fast !:clap:

Of course there are still bugs in here - but this is progress :-)

I could now read the data and make heads and tails out of it

Transposing:
nopes- not an option. One of the domains had over 4500 subsets.

another apparent problem is ":"
all parameter values containing ":" are neglected it seems -the cells are blank in excel

In terms of changing the style: I am reading up on style.numberformat - hoping I can incorporate this in any way. Being new to VB it does take some time to figure it out though.
Another way is to send all data to the cells wiht <'> in front- it should then be forced to interpret it as text.
Then the exclamation marks could be removed as well.
When I did the full table loading i did see that it happends also in the middle of large tables and not only at the end of the table.

examplefile: I'll look at anonymizing more data

norway
07-12-2010, 12:52 AM
Hi Mark,
I've been working wiht the code and have sorted some of the issues.
Quite educational to sit and go through the code like this :-)
Am I correct in reading that this segment is the logic for creating new sheets?
Sub OutputSheet(Data)
Dim wks As Worksheet

With ThisWorkbook
Set wks = .Worksheets.Add(, .Worksheets(.Worksheets.Count))
End With
With wks
.Name = Data(2, 1)
.Range("A1").Resize(UBound(Data, 1), UBound(Data, 2)).Value = Data
.Range("A1").Resize(UBound(Data, 1)).Font.Bold = True
With .Range("A1").Resize(, UBound(Data, 2))
.Font.Bold = True
.EntireColumn.AutoFit
End With
End With
End Sub

GTO
07-12-2010, 02:06 AM
Hi Andre,

Yes, that is correct. See if this helps:


'// Pass the two-dimensional array 'CountryData with the info from that country //
Sub OutputSheet(Data)
Dim wks As Worksheet

'// Set a reference to a newly created worksheet, which is placed at the tail //
'// end (rightmost) //
With ThisWorkbook
Set wks = .Worksheets.Add(, .Worksheets(.Worksheets.Count))
End With

With wks
'// Name the sheet //
.Name = Data(2, 1)
'// Plunk the data into a range (re)sized equally to the array //
.Range("A1").Resize(UBound(Data, 1), UBound(Data, 2)).Value = Data
'// Make the first column bold //
.Range("A1").Resize(UBound(Data, 1)).Font.Bold = True
With .Range("A1").Resize(, UBound(Data, 2))
'// Make the first row bold and autofit the columns in our range //
.Font.Bold = True
.EntireColumn.AutoFit
End With
End With
End Sub

By the way, when posting code, please use the green/white VBA button atop the message box. This will place '...Your code here tags, between which you can paste your code.

Mark