PDA

View Full Version : Converting 1 million Rows of data to new Tables and Columns



collector00
02-07-2015, 06:40 AM
Hi, thank you for reading this. I have a database that is not at all structured correctly but is the only way we can receive the data. The table has over a million rows and consists of the following columns:

dataID - just a unique ID; not used by us
system_id - the system the question is valid for
siteName - name of the site
siteAddress - address of site
siteType - type of site
QLevel - level of question
fieldQuestion - question label
fieldAnswer - answer
questionID - question number

dataID | system_id | siteName | siteAddress | siteType | QLevel | fieldQuestion | fieldAnswer | questionID
1234 | 12649 | ABC Location | 123 Main St | Building | 1 | This is a question | This is their answer to77 | 77
1234 | 12649 | ABC Location | 123 Main St | Building | 2 | This is a child question1 | This is their answer to78 | 78
1234 | 12649 | ABC Location | 123 Main St | Building | 3 | This is a child question2 | This is their answer to79 | 79

Each question has been grouped together, so for the above example 77,78,79 would go into a specific table but 80-99 may be in another table and so on. I did this to clean-up the tables and also because I didn't want to put hundreds of questions into a single table. So what I would like when all is said and done are tables that look like this:

system_id | question_77 | question_78 | question_79
12649 | This is their answer to77 | This is their answer to78 | This is their answer to79

(for reference; system_id is a number, and all question fields are memo. Added pipes just to show separation of data)

The other tables would look similar but with their own set of questions.

What I first did was to create all of the new tables with their corresponding "question_#" fields. Each questionID corresponds to a specific table and not all are sequential.

Now I need to move the data from the original table to the new recipient tables. Below is the VBA. However, when I run it I tend to get different issues at different times. I have never successfully gotten all the way through to the end. Sometimes I can get 100K+ through and then it errors out other times only a couple thousand and it errors out. The error is typically 'Run-time error '3049': Cannot open database ". It may be a database that your application recognizes, or the file may be corrupt.' (again almost at random). Below is the VBA that I wrote to move everything to its correct table:



Private Sub Command75_Click()
Dim dbs As DAO.Database
Dim myOriginalTable As Recordset
Dim myRecipientTable As Recordset
Dim newQuestionNum As String
'
Set myOriginalTable = CurrentDb.OpenRecordset("originaltable", dbOpenDynaset)
'
Do Until myOriginalTable.EOF = True
'
If myOriginalTable![questionID] = 77 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableA", dbOpenDynaset)
If myOriginalTable![questionID] = 78 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableA", dbOpenDynaset)
If myOriginalTable![questionID] = 79 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableA", dbOpenDynaset)
If myOriginalTable![questionID] = 80 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableB", dbOpenDynaset)
If myOriginalTable![questionID] = 85 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableC", dbOpenDynaset)
If myOriginalTable![questionID] = 87 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableB", dbOpenDynaset)
If myOriginalTable![questionID] = 92 Then Set myRecipientTable = CurrentDb.OpenRecordset("tableD", dbOpenDynaset)
'
' The above continues for every single question # so that each one will be placed
' in the correct table; there are a couple hundred and not all sequential
'
' Convert number to a field name
'
newQuestionNum = "question_" & myOriginalTable![questionID]
'
' Determine whether this systemID exists in new table
'
myRecipientTable.FindFirst ("[system_id] = " & myOriginalTable![system_id])
'
If myRecipientTable.NoMatch Then
'
' systemID does not exist in new table, create it
'
If Len(myOriginalTable![fieldAnswer]) > 1 Then
myRecipientTable.AddNew
myRecipientTable![system_id] = myOriginalTable![system_id]
myRecipientTable(newQuestionNum).Value = myOriginalTable![fieldAnswer]
myRecipientTable.Update
End If
Else
'
' systemID exists in new table, update it
'
If Len(myOriginalTable![fieldAnswer]) > 1 Then
myRecipientTable.Edit
myRecipientTable(newQuestionNum) = myOriginalTable![fieldAnswer]
myRecipientTable.Update
End If
End If
'
myOriginalTable.MoveNext
End If

Loop

MsgBox "Complete!"
GoTo Cleanup

Cleanup:
myOriginalTable.Close
myRecipientTable.Close
Set myRecipientTable = Nothing
Set dbs = Nothing

End Sub



I added a status bar update but removed it in the above code. It is useful because it tells me whether the process is still running (but does sometimes freeze although it appears to still be running UGH).

The above basically checks to see if the systemID exists; if it does it overwrites the answer for that question with the new data. If it does not it simply adds the new systemID with the corresponding answer in the appropriate question column.

This is also a slow process. I'm hoping someone could tell me whether I'm doing this right or perhaps there is a simpler and quicker way to do this. Since the data is all over the place I figured the above would be the best way to do it. I'm also getting this large file on a weekly basis so I want to have a consistent method to process it and hopefully a reliable and fast way to process it.

PLEASE LET ME KNOW YOUR THOUGHTS!!! I really like to learn from examples and other people's experiences. If I'm doing it wrong or maybe it just needs a tweek, I appreciate any and all feedback.

Jon

jonh
02-09-2015, 10:07 AM
IMHO, the original table design looks better.

Data should be stored in more rows, less columns, not vice versa.

collector00
02-09-2015, 12:45 PM
IMHO, the original table design looks better. Data should be stored in more rows, less columns, not vice versa. I don't necessarily disagree. I guess my issue is that when we go to create reports we want them in columns not individual rows. A single person may be responsible for hundred of IDs. Keeping them in rows there could be hundred of rows just for a single ID. Having them in columns allows a person to scan hundreds of IDs at a time rather than hundreds of questions for the same ID. Hopefully that makes sense. Perhaps there is a way to keep the data as is and just create a report/spreadsheet that is user friendly.

jonh
02-09-2015, 04:50 PM
Do what you need to do to get the job done. But (good) database design is usually based on normalising the data. The only thing i see wrong with your example data is that the data in the first five fields is duplicated.

In your design, if you decide certain questions are no longer required, are you going to delete those fields and lose the existing data? Or are you going to leave them in and have surplus fields taking up valuable space in your file and possibly slowing down queries / reports?

If somebody wants to see answers to questions spanning multiple tables what do you do?

Neither of those are issues in the first design.

The first priority of table design is ease of data retrieval, and the changes you propose don't appear to accomplish that.

Not to mention, the last thing you want to add to an Access database is a memo field. They are notorious for corrupting databases and the more you have the more likely you'll encounter problems.

Best option is to post up an example of the kind of report you want and we'll help you create it.

collector00
02-11-2015, 12:44 PM
Do what you need to do to get the job done. But (good) database design is usually based on normalising the data. The only thing i see wrong with your example data is that the data in the first five fields is duplicated.

Understood. The goal is really for MS Access to be a repository for the data so that we can create a report from that data. As I said before, it is must easier for someone to look at a spreadsheet with several columns than it would to look vertical. In other words, if you were only interested in reviewing question_77, you would simply span that column. In your proposed method it would need to be sorted, etc. The end users will not be using MS Access to view the data.


If somebody wants to see answers to questions spanning multiple tables what do you do?

That is simple. A query from multiple tables all linked back by the system_id.

SELECT [Site List].[system_id], tableA.question_77, tableC.question_223 (and so on)

We know which questions we would want and we know which table they exist in.


Neither of those are issues in the first design.

Perhaps. The first design does not place them into columns as we want them. It just comes down to that and placing them into the new tables does accomplish this (but perhaps you have a solution or recommendation that does not consist of remarking about the poor design; I get that).


Not to mention, the last thing you want to add to an Access database is a memo field. They are notorious for corrupting databases and the more you have the more likely you'll encounter problems.

May be true but one must try and fail to learn.


Best option is to post up an example of the kind of report you want and we'll help you create it.

I thought I had already but essentially we're looking for the report to be column-based to be used in Excel.

system_id | question_77 | question_78 | question_79
12649 | This is their answer to77 | This is their answer to78 | This is their answer to79

jonh
02-12-2015, 03:34 AM
This will try to use any table where the name begins "table".
So if you have other tables called "tablewhatever" you'll need to put this in a new db and link in the ones you want to process.


Private Sub Command0_Click()
Dim db As DAO.Database
Dim rs As DAO.Recordset
Dim t As DAO.TableDef
Dim f As DAO.Field
Dim qid As Integer

Set db = CurrentDb

Set rs = db.OpenRecordset("select name from msysobjects where name like 'table*' and type=1")

Do Until rs.EOF
db.Execute "delete * from " & rs(0)
db.Execute "INSERT INTO " & rs(0) & " ( system_id ) " & _
"SELECT distinct originaltable.system_id FROM originaltable"
Set t = db.TableDefs(rs(0))
For Each f In t.Fields
If f.Name <> "system_id" Then
qid = CInt(Mid(f.Name, InStr(1, f.Name, "_") + 1))
db.Execute "UPDATE " & rs(0) & " INNER JOIN originaltable " & _
" ON originaltable.system_id = " & rs(0) & ".system_id SET " & _
rs(0) & ".question_" & qid & " = [fieldanswer] WHERE originaltable.questionID=" & qid
End If
Next
rs.MoveNext
Loop
rs.Close
Set rs = Nothing
Set db = Nothing
End Sub

collector00
02-12-2015, 06:25 AM
First of all, thank you again for your assistance.

I ran it a few times now and thought it was running without issue but I'm getting the Type Mismatch error on this line:


qid = CInt(Mid(f.Name, Instr(1, f.Name, " ") +1))

I'm not familiar with this function so I apologize in advance. The first time that it did run with success it created all of the system_IDs within the tables but no answers were added. So I have a few tables with system_IDs that are essentially blank. I'm going to keep digging into it but if you have any suggestions please let me know. Thanks again

jonh
02-12-2015, 07:06 AM
That bit is trying to get the question number from the field name.
Do you have other fields in your tables besides system_id and the question fields (question_#) ?

edit+
try changing
If f.Name <> "system_id" Then
to
If Left(f.Name, 9) = "question_" Then

edit2+
the code you've posted has a space
qid = CInt(Mid(f.Name, Instr(1, f.Name, " ") +1))

should be an underscore ??
qid = CInt(Mid(f.Name, InStr(1, f.Name, "_") + 1))

collector00
02-12-2015, 07:35 AM
That bit is trying to get the question number from the field name.
Do you have other fields in your tables besides system_id and the question fields (question_#) ?

edit+
try changing
If f.Name <> "system_id" Then
to
If Left(f.Name, 9) = "question_" Then

edit2+
the code you've posted has a space
qid = CInt(Mid(f.Name, Instr(1, f.Name, " ") +1))

should be an underscore ??
qid = CInt(Mid(f.Name, InStr(1, f.Name, "_") + 1))

The question number is actually obtained in its own field, labeled questionID. So I think that is what I need to use rather than trying to extract the number from the question. The questions are just that.

I left the qid line the same but now that you explain that it is extracting the number may be an issue.

jonh
02-12-2015, 07:48 AM
It gets the number from the field NAME. You created the tables and set up the columns already yes?

so it gets say, 77 from the field name question_77 and builds a query around that.

collector00
02-12-2015, 07:55 AM
Gotcha, my misunderstanding. From the new tables yes, it would be question_77 for example. You are correct. I was able to successfully re-run the script and it seems to be fine now. I may have missed something on the initial programming. Thanks again for your patience and assistance.

jonh
02-12-2015, 07:59 AM
Glad to help :)