PDA

View Full Version : Extracting highest value correlations and series names from Correlation Matrix



Nick_London
06-15-2009, 04:27 AM
Hi,

I have calculated a correlation matrix between a data set. What I need to do is for each given series in the results is identify the highest occurring correlation coefficient (other than the correlation between the series itself, which is always one) and then log the pair with such an occurrence to another location in another sheet.

For example in the attached sheet, for the first series in column B (XOM), the highest correlation of that series is with row 3 - CVX.

So I want to store XOM CVX say in cells B21 and C21 and correlation coefficient in D21. For the next series in column C (CVX), the highest positive correlation is with row 8 (DDA). I want to store this combination in B22 and C22 and the corresponsing correlation coefficient in D22.

For the attached example, it is quite easy to do manually for a few series but my final correlation matrix features over 200 variables.

Hope someone can help.

Thanks,

Nick

p45cal
06-15-2009, 08:02 AM
In haste (ie. it could be better); see attached and click the button. In the case of exact duplicates of correlation values it'll only take the first one.
Code:Sub blah()
Dim InputRng As Range
Set InputRng = Application.InputBox(prompt:="Please select the correlation matrix EXCLUDING headers top and left", Type:=8)
Dim OutputRng
Set OutputRng = Application.InputBox(prompt:="Please select the top left cell for where the results will be..(It can be on another sheet)", Type:=8)
DestRowOffset = 0
For Each Colum In InputRng.Columns
PairY = InputRng.Parent.Cells(InputRng.Row - 1, Colum.Column).Value 'Column header
MaxVal = 0: PairX = ""
For Each cll In Colum.Cells
If Not IsEmpty(cll) And cll.Value > 0 And cll.Value <> 1 And MaxVal < Application.Max(MaxVal, cll.Value) Then
MaxVal = Application.Max(MaxVal, cll.Value)
PairX = InputRng.Parent.Cells(cll.Row, InputRng.Column - 1)
End If
Next cll
If MaxVal <> 0 Then 'add the pair to the results
OutputRng.Offset(DestRowOffset) = PairY
OutputRng.Offset(DestRowOffset, 1) = PairX
OutputRng.Offset(DestRowOffset, 2) = MaxVal
DestRowOffset = DestRowOffset + 1
End If
Next Colum
End Sub

p45cal
06-27-2009, 02:30 AM
I wonder if the OP even looked at the response.

Nick_London
06-28-2009, 07:33 AM
Hello P45Cal,

Yes I did look at response but have had to put this project on hold due to
other priorites rethinking in terms of what I want to achieve in terms of the final analysis. Anyway, I've just tested the code out on a sample of data and it works great, so thanks.

In order to get the correlations ranked, I'm simply sorted the data based on the correlation column which is fine. But I noticed that there are no negative correlations and now I am thinking this information would be usefull as well. Would it be easy to include the negative ones as well postive in the output?

p45cal
06-30-2009, 01:11 AM
I think you know, from the tone of my question, why I asked it.

You're not the only one.

When someone asks a question here, and someone clearly spends significant time giving them a solution (giving being the operative word) then the very minimum the original poster can do is acknowledge that effort in a timely fashion. If not, there are several consequences:
1. It p*sses-off the person who supplied the solution (can I use that turn of phrase here?) Why? Because it looks as if the solution provider has so totally wasted his time and effort. He doesn't know if it's been of any use to the OP; has the OP cross-posted elsewhere and already got a solution? Was the solution inadequate? Perhaps the OP has used his solution and is such an ingrate not to thank them?
2. Ultimately the OP, and anyone else posting a question, will lose out as those willing to give solutions will be discouraged from doing so. I, for one, don't even bother to read new posts from those that have treated me that way in the past (if I can remember their handle that is!). Worse, solution providers will stop even looking at forums such as these.

Please note that this isn't specifically a gripe at you, but at a significant number of others (who won't, of course, see this) who behave similarly. It's just that yours was the one that finally took it beyond my tolerance level.

In answer to your question, yes I think it would be fairly easy to include negative correlations, but until I'm sure you've taken the project off on-hold, I'm not even going to begin thinking about it.

Nick_London
08-31-2009, 07:58 AM
Hello P45Cal,

Well I just re-started focusing on this project, have been using the macro you created and now have a much better idea of my requirements, what's missing and what I need.

Seeing the negative correlations would definately be useful but the other more important thing is I just realised that there's a whole bunch of correlations that I should be looking at that are not being included in the final output.

Let me explain. In the orginal example, there are 9 combination of variables that generate postive correlations (excluding correlations of variables with themselves) yet the macro returns only 5 of these.

But in my analysis, I should be including all combinations of variables that yield positive (and now negative correlations). So in fact, all I want to do is extract all pairwise correlations accept for the self correlations - i.e the ones with the value of one

So for example, in the case of a correlation matrix of seven variables, excel will populate 28 correlation coefficents, in my output I want to see all of these accept the self correlations (the ones), i.e 21 correlation coefficents.

I've had a look at the code but really don't know where to start to modify. Would be really helpful if you could help or indicate if you will be able to.

Thanks,

Nick

p45cal
08-31-2009, 01:00 PM
So confirm that we're no longer interested in best correlations (maxima) but all correlations (excluding the 1s). If so:Sub blah()
Dim InputRng As Range
Set InputRng = Application.InputBox(prompt:="Please select the correlation matrix EXCLUDING headers top and left", Type:=8)
Dim OutputRng
Set OutputRng = Application.InputBox(prompt:="Please select the top left cell for where the results will be..(It can be on another sheet)", Type:=8)
DestRowOffset = 0
For Each cll In InputRng.Cells
If cll.Value <> 1 And cll.Value <> "" Then
OutputRng.Offset(DestRowOffset) = InputRng.Parent.Cells(cll.Row, InputRng.Column - 1).Value 'row headetr
OutputRng.Offset(DestRowOffset, 1) = InputRng.Parent.Cells(InputRng.Row - 1, cll.Column).Value
OutputRng.Offset(DestRowOffset, 2) = cll.Value
DestRowOffset = DestRowOffset + 1
End If
Next cll
End Subwhich just runs through all cells in the (square) correlation matrix and reports the value (when not 1 and not empty) along with the row and column headers for that value.

p45cal
08-31-2009, 02:01 PM
but perhaps you're looking for the likes of:


Series Correlates best with Correlation
XOM CVX 0.738976806
CVX XOM 0.738976806
XYZ ZYX 0.954513523
ZYX XYZ 0.954513523
NYZ XYZ 0.358511319
BBC CVX 0.059801189
DDA XOM 0.437733711
?

Nick_London
09-01-2009, 03:40 PM
Hi P45Cal,

Many thanks for changing the code, it gives me exactly what I requested. But yes you are quite right that ultimatley what my goal is to focus on the highest positive (and lowest negative) correlations but I was thinking that if the macro returns all the correlations coefficients I can just sort them and easily identify the highest e.g. Coefficents with a value of say above 0.20 and the negative coefficents less than -0.20.

But from what you're suggesting, it seems it would be easy to build this into the code? Although I think I need to make the threshold criteria for inclusion in the output (0.20) dynamic. By that I mean I might want to change the criteria based on each dataset that I analyse, e.g from 0.20 to 0.40.

Anyway, I can work with the current code, if it doesn't take up too much of your time then a more sophisticated solution would be most welcome.

Thanks,

Nick

p45cal
09-01-2009, 04:21 PM
re:"make the threshold criteria for inclusion in the output (0.20) dynamic.. ..criteria based on each dataset"

In words, how would you make that decision? Would it be a binary decision (either 0.2 or 0.4 and nothing else) or could it be a sliding scale value?

Nick_London
09-02-2009, 03:56 AM
Hello again,

No I didn't envisage a binary threshold as I think they would be highly limitating. Logically it should be be a function of the range of correlation coefficients each dataset returns. So for example, with a data that returns lots of correlations at above say 0.8 then the threhold should be quite high (e.g 0.4), with a data set that returns very few high correlations or none at all, the threshold should be lower so that I have enougth observations to analyse for that dataset.

I was thinking that the easiest way would be to have a line in the code where I can manually change the threshold based on what the correlation matrix is showing for each data or if this is automatically built into the code then have an option of a manual overide so I can see all the correlations if I want to. But you probably have a better idea of the best way to do it. The key thing for me is something with a degree of flexibility so that I can change the criteria easily if I need to.

Thanks,

Nick