PDA

View Full Version : Automation of Data Transformation and Consolidation



LucasLondon
11-04-2009, 11:18 AM
Hello,

I'm currently doing this manually and it's very time consuming for each dataset and easy to make a mistake so I'm looking for a more robust automated procedure. I started initial work on this but realised the VBA I need is a lot more advanced than a novice like me can put together.

I have data arranged in variables (columns) in one worksheet that I want to transform into the same format as the same variables in another worksheet and then consolidate them together in terms of new values.

One sheet ("latestdata") has the raw data - dates in column A and variable names in row 1. Another sheet (Souredata) has the equivalent but transformed variables stored by column. I need to convert the data from the latestdata worksheet into the same format and combine with the sourcedata worksheet by date.

However, there are three major variations between the two sets of data in the different sheets: The example attached will make it clearer.

In souredata worksheet the data has following features:

1) Has different column headings/start names. The column takes on three variants:

a) The column names are the same between the two sheets
b) Names in the souredata sheet start with "Diff-" or "2NDDIFF" and the remainder of the variable name is the same as in the sourecdata worksheet.

I have developed a macro the converts the names in the sourcedata worksheet in the equivalent name in the latest data sheet, i.e. inserts the name in row two of the sheet removing the "Diff-" from the variable name so that I can match the names of the variables across the two sheets for consolidation. (macro is stored in the module). In the attached sheet, I have coloured each equivalent variable across the sheets by the same colour to make clear which variables are related.

2) Some of the data variables are differenced (this means the difference (delta) is calculated between the consecutive values for a given variable). Example of differencing for each variable is given in the sheet "Final Updated Souredata").

Whether a variable is differenced depends on how the variable names starts:

a) "Diff-" - means the data is differenced once, i.e. the delta is calculated, e.g. for a variable in Column B with data starting in row 5, the differenced value in row 6 would be: b6-b5

b) "2NDDIFF" means the variable is differenced twice (the delta of the delta is calculated - example given in the sheet)

c) If the column names are the same - the data is undifferenced and in the same format as the data in the latestdata worksheet

3) The data variables are also lagged, the lag of each variable is shown in row 3 of the sourcedata worksheet. So if b3 = 0, there is no lag but if B3 = 4 then the variable is lagged by 4 (shifted down by 4 from the base row - more details below)

4) The data variables are not arranged in the same order across the sheet, i.e. the columns in which each variable appears differs.

I want to use the data in latest data sheet to create a new dataset (replicate the transformations of the source datasheet) so the data is of the same format of source data sheet.

Here are the steps I'm currently doing manually:

For each variable in the latestdata worksheet:

Step 1) Find the corresponding column in the sourecdata worksheet. Check if it is differenced, if it is, take the data variable from the latestdata worksheet and create a new variable by taking the difference.

Step 2) Lookup the lag number for the variable, if it is different from zero then I would re-arrange the starting/ending position via shifting each series down by a number of cells relative to a base row (currently row 5). So if the lag number is 2, I shift the start of this series down by 2 cells from the base row (row 5).

Step 3) Copy over the newly created variable into the souredata worksheet/consolidate with the data with the existing variable data.

The above steps are shown in the attached workbook in the worksheet "Final Updated Sourcedata"). This is essentially shows what the final data in the Sourcedata worksheet would look like after applying the transformations (Columns A to D). The new added values are colured in red.

Columns F to I show the original data from the latestdata worksheet, while columns K to Q show the calculations/transformations applied to each orginal variable to create the transformed variables. So for the variable called DiffVarXYZ, in column K I take the first difference and then in column L lag the variable by 1. Then copy the values over to column B, aligning by date.

Sorry for the long post, but if someone can help me out, it would save me a lot manual work and mistake making.

Thanks,

Lucas