PDA

View Full Version : Return combinations of the most popular choices



sassora
01-18-2014, 12:35 AM
Hi all

I have data about choices made by individuals
There are 100 options and people can choose from 1 to 8 of these. For example one may take a1, a2 and a3, another may take a1, a3, a87 and a88 etc. For each set of choices I have the number of people that made those choices.

It's easy to work out how many make each single choice e.g. 32,140 made choice a1 but I'd like to look at combinations (that exist in the data). What I would like to do is to take the 2-choice combinations to identify the largest groups. Ideally there would be code that spits out 1,2,3 and 4-choice combinations.

Here's some sample data.
11112

Is there a good way of doing this? Any help would be appreciated.

sassora
01-19-2014, 06:10 AM
The output would output, for example, the 2-choice combinations (a1 * a2) and the count of individuals. It would do similarly for 3 and 4 choice combinations, say (a1 * a2 * a4) and (a1 * a6 * a35 * a70).

Is the example clear enough to follow?

Thanks - any ideas?

D_Marcel
01-19-2014, 07:35 PM
Hi!


Please, tell me if my understanding is right.
When you say multiplicate the combinations, you want to know the max number of possibilities? For example, people can choose 8 from a universe of 100, right? But if only two values are choosen, it means that:


100 * 99 = 9900 possibilites (Considering that the possible values cannot repeat)
And then, count the number of individuals that choose, for example, 34 and 68? So, you can know how many combinations were made from that universe
In others words, for example: I have 3000 combinations from a universe of 9900, 30,30%. And also know all the values choosen in each of these 3000 combinations.


Douglas Marcel

sassora
01-20-2014, 12:45 AM
Choices data shows the combinations chosen (up to 8 choice per row) with a count next to them. Not all combinations are included but that's just because nobody chose them.

I want a summary table so that I can understand the information better. Currently there's about 20,000 combination lines many with only a few people for each. I want to know the most popular two choices made, 3 choices made and so on.
For a summary table, possible values cannot repeat and also order doesn't matter. So the maximum number of 2-choices is 100*99/2 = 4,500 but many of these may have a count of 0 and so can be excluded.

So if in the summary table shows a2 * a3 people would be counted from e.g.
a2 * a3 * a4
a2 * a3
a1 * a2 * a3 * a88

since they all have a2 * a3 in. The count for a2 * a3 would be the sum of counts for combinations with a2 and a3 in.


The summary table would have a count of people for the 1-choice, 2-choices, 3-choices etc (combinations where the counts are 0 don't need to be shown). For example,

a1
a2
a3
...
a100

a1 * a2
a1 * a3
...
a99 * a100

a1 * a2 * a3
...

Is this any clearer? :)

sassora
01-20-2014, 12:27 PM
I'm surprised there's not a standard solution or otherwise for this kind of problem.

D_Marcel
01-20-2014, 04:57 PM
Hi sassora, thanks for the explanations, I understood your need, but it's quite complex!
I tested here (although that you probably done the same) the formula COUNTIFS, and don't work for this purpose.

In this case, VBA it's the only choice, I guess, but will be a insane challenge because before count the combinations, we need to map those combinations, but how?
I guess we can use an array. Using a loop structure that will read each of the 20.000 rows, we can insert the combinations into an array and load them in the summary table. Each time a row is read, the code will check in the summary table if that combination is already there. The combinations in the summary table would be increased with counters inside this loop.

What you think about this idea? I can't think in a workaround.

Douglas Marcel

sassora
01-21-2014, 12:48 AM
My workaround so far (for 2-choices) has been to list all the choices as column headings and put "yes" or "no" according to whether they are included. Then use VBA and sumifs to create a 100*100 table with "COUNTS" (they are sums!) only in the upper diagonal. From here I was able to use vba to return a list of the most of popular choices in a manageable way. For 3-choices, I was thinking of having a list of 2-choices as rows and 1-choice and columns and repeat the process above.

As you say, mapping the combinations could be hard as well. The number of combinations increases a lot too as the number of choices increases (although in practice many of these won't be populated). I think I'd be fine with knowing up to 4-choices. Using the method for 4-choices there would be 3-choices as rows and 1-choices as columns to give 161700 rows by 100 columns = 16,170,000 calculations. Using an array does sound like a good way to go, I get the feeling we need a neat trick to minimise the processing, although loops will almost definitely be needed.

snb
01-21-2014, 02:02 AM
delete row 1.

delete columns >4

The frequency of combinations of first and second choice:


Sub M_snb()
sn = Cells(1).CurrentRegion

With CreateObject("scripting.dictionary")
For j = 2 To UBound(sn)
.Item(sn(j, 1) & "_" & sn(j, 2)) = .Item(sn(j, 1) & "_" & sn(j, 2)) + 1
Next

Cells(1, 8).Resize(.Count) = Application.Transpose(.keys)
Cells(1, 9).Resize(.Count) = Application.Transpose(.items)
Cells(1, 8).CurrentRegion.Sort Cells(1, 9), 2
End With
End Sub

sassora
01-21-2014, 03:04 AM
Hi snb, I was hoping you'd see this.

I'm looking at what you sent by phone, just to check interpretation:
The order of the choices don't matter. I'm essentially looking across each set of choices (up to 8) and looking at
1) the number of people that chose {a1,a2,...,a100} (not all choices will be taken)
2) the number of people that chose {a1xa2,...,a99xa100} (order does not matter a1xa2 = a2xa1)
3) do this for 3-choice combinations: a1xa2xa3 etc
4) do a many more as possible up to 8.

I'll take a look at the code more closely in a little while :)

snb
01-21-2014, 04:50 AM
I think it's a complex form of data mining.
To write an algorithm to perform what you want seems to me beyond the purpose & scope of this forum.
It looks more as a to be paid for job.

GTO
01-21-2014, 05:24 AM
Am I way off here, or is each "order" one of 203+ billion combinations?



1
100


2
4,950


3
161,700


4
3,921,225


5
75,287,520


6
1,192,052,400


7
16,007,560,800


8
186,087,894,300







203,366,882,995

sassora
01-21-2014, 05:40 AM
Fair enough snb, I will try to see if SQL can make an easier job of it. I was hoping that there was a shortcut I was missing. I guess if it was easy there would already be solutions. How much does this sort of thing cost out of interest.

Yes GTO if you keep every combination that's right. In practice, most cases would have no people choosing them - I would happily ignore these. Also I'm happy to stop at 4 - which is slightly better :P

GTO
01-22-2014, 01:57 AM
...Yes GTO if you keep every combination that's right. In practice, most cases would have no people choosing them - I would happily ignore these. Also I'm happy to stop at 4 - which is slightly better :P

Hi Sassora,

Here is a stab at it. In the attached workbook, you can see how I created some random records of 1 to 8 items. I imagine some items are more popular than others, and thus, certain combinations of items are more likely to be more popular. Thus, I would think that real-life data overview may be returned faster than the attached example data. For me, the example data returns stats in about 6.5 seconds for 10k records of purchase.

I would note that in the way I did it (concatenating item descriptors to produce unique combination identifiers) the combinations must be sorted. Hopefully that is not a big hitch in our giddy-up.

Anyways, in a Standard Module:


Option Explicit

'// NOTE: Requires 'Microsoft Scripting Runtime' (scrrun.dll) to be referenced. //
'// On my copy of WIN7, located at 'C:\Windows\system32\scrrun.dll' //
'// Whilst I normally write late-bound, I believe previous testing showed me a //
'// marked difference in runtime for given code, and I do not believe this library//
'// to have changed... //

Public Sub EvaluateOrders()
Dim DIClarge(1 To 8) As Scripting.Dictionary
Dim arrOrders2Assess As Variant
Dim y As Long
Dim x As Long
Dim n As Long
Dim i As Long
Dim UBnd As Long
Dim lKeyCount As Long
Dim sConcatenated As String
Dim arrOutput() As Variant
Dim wb As Workbook

Dim timerStart As Single: timerStart = Timer

With Sheet2
arrOrders2Assess = _
Range(.Cells(2, "D"), .Cells(.Rows.Count, "D").End(xlUp).Offset(, 8)).Value
End With

For x = 1 To 8
Set DIClarge(x) = New Scripting.Dictionary
Next

For y = 1 To UBound(arrOrders2Assess, 1)
sConcatenated = vbNullString
UBnd = arrOrders2Assess(y, 9)
For x = 1 To UBnd
sConcatenated = sConcatenated & arrOrders2Assess(y, x)
Next

If DIClarge(UBnd).Exists(sConcatenated) Then
DIClarge(UBnd).Item(sConcatenated) = DIClarge(UBnd).Item(sConcatenated) + 1
Else
DIClarge(UBnd).Item(sConcatenated) = 1
End If
Next

For x = 1 To 8
lKeyCount = lKeyCount + DIClarge(x).Count
y = Application.Sum(DIClarge(x).Items)
Next

ReDim arrOutput(1 To UBound(arrOrders2Assess, 1), 1 To 2)
y = 0: n = 1

For x = 1 To 8
arrOutput(n, 1) = "Combinations ordered of " & x & "[" & DIClarge(x).Count & "]"
For i = 1 To DIClarge(x).Count
arrOutput(i + n - 1, 2) = _
DIClarge(x).Keys(i - 1) & "[" & DIClarge(x).Item(DIClarge(x).Keys(i - 1)) & "]"
Next
n = n + DIClarge(x).Count
Next

Set wb = Workbooks.Add(xlWBATWorksheet)
wb.Worksheets(1).Range("A1").Resize(UBound(arrOutput, 1), 2).Value = arrOutput

MsgBox "Took " & _
FormatNumber(Timer - timerStart, 2) & _
" seconds to compute " & _
UBound(arrOrders2Assess) & _
" records", _
vbInformation, _
vbNullString

End Sub

Hope that helps,

Mark

Aussiebear
01-22-2014, 03:26 AM
8
186,087,894,300









And this is why gambling games such as Gold Lotto where you need to pick the correct 8 numbers (out of 45) is stacked against you.

sassora
01-22-2014, 09:43 AM
Mark, that looks amazing and it works very fast. I'll apply it to the actual data and see what happens. There's a lot to understand in the code but it seems to be a succinct way of approaching the task. Thank you

GTO
01-22-2014, 04:00 PM
Hi Sassora,

Not likely amazing, but thank you and of course, you are most welcome. Please mention how it works on real data when you get a chance, I'll keep my fingers crossed in the meantime.

Mark

sassora
01-22-2014, 04:57 PM
In practice, it didn't work how I thought it would. Hopefully the following explains it:

Using a unique list of combinations and the number of people making those options, I was hoping to find the number of people making a subset of options.

E.g.
a1 is in combinations (in the hypothetical combinations data)

a1
a1,a2
a1,a9
a1,a2,a3
a1,a3,a5,a10 etc

and for each of those combinations sum the people in total. There may have been 10 people choosing a1 only, 30 choosing a1 and a2 etc. I want to do 10 + 30 + ...

I'd then like the make the subset to be 2-choices e.g. a1,a2. In the combinations data above, I'd like to add the number of people choosing either a1,a2 or a1,a2,a3.

The idea is to go up to a subset of 4-choices although more if possible.

GTO
01-22-2014, 11:06 PM
Okay, I would like not to be mis-understanding (again) as to exactly what all we want to extrapolate. Could you put together a workbook with both the raw data and any/all returns that we would be looking for. Not thousands of rows of course, but enough raw data where it is easy to see what is returned and the logic.

snb
01-23-2014, 12:50 AM
Now we are getting somewhere....

sassora
01-23-2014, 12:01 PM
Yes snb, they are the 2-option outcomes with the right totals :)

snb
01-23-2014, 12:45 PM
And now the 3 options outcomes added:



I assume you'll manage to do the sorting.

snb
01-24-2014, 04:46 AM
The ultimate version for 8 choices turned out to be less complex than a feared at the start of the inquiry.

assumptions:

- contiguous filled cells per row in column A to H
- choices sorted in each row in the same order

caveat:

I have no idea how much time the macro takes when testing many rows.
I tried to avoid as many unnecessary loops as possible.
When applied to many rows, the 'transpose' method could reach it's limit.



Sub M_snb()
sn = Cells(1).CurrentRegion

With CreateObject("scripting.dictionary")
For y = 2 To 8
For j = 2 To UBound(sn)
If sn(j, y) <> "" Then
For jj = 1 To UBound(sn, 2) - 1 - y
For jjj = jj + y - 1 To UBound(sn, 2) - 1
If sn(j, jjj) = "" Then Exit For
c00 = Join(Application.Transpose(Application.Index(sn, j, Evaluate("row(" & jj & ":" & jj + y - 2 & ")"))), "_") & "_" & sn(j, jjj)
.Item(c00) = .Item(c00) + sn(j, 9)
Next
Next
End If
Next

If .Count > 0 Then
Cells(1, 14).Offset(, 3 * (y - 2)).Resize(.Count) = Application.Transpose(.keys)
Cells(1, 15).Offset(, 3 * (y - 2)).Resize(.Count) = Application.Transpose(.items)
End If
.RemoveAll
Next
End With
End Sub

sassora
01-25-2014, 01:58 AM
Thanks snb - it's great to see a version that works for all 8 that in a short and clear way.

I'll try to make it faster - currently I don't think I could run it on 20,000 combinations say without going for lunch.

snb
01-25-2014, 02:24 AM
I don't think you can make it faster.
The only thing you could do is putting the result in an array and write that array to the worksheet at last.

Where can I send the invoice to ?

sassora
01-25-2014, 03:00 AM
You're probably right, the cheque's in the post :P

snb
01-25-2014, 04:08 AM
If that's the case I think I have to wait longer for your payment than you for the macro to complete it's task ;)