PDA

View Full Version : Solved: Counting sort for multi-dimensional arrays?



skulakowski
05-17-2005, 07:49 AM
I'm using John Walkenbach's counting sort but need to apply it to a multi-dimensional array. The code below beautifully sorts an array of one column. How can I modify this for a two-column array, sorting on column one but keeping the corresponding data in column two?

First, set up 2-column array of data to be sorted.


ReDim MaxPValueArray(0 To iLastRow-1) As Double
'Id like this to be ReDim MaxPValueArray(0 To iLastRow-1, 0 To 1) As Double

Application.Goto "Property"
qCol = ActiveCell.Column
For q = 0 To iLastRow - 2
MaxPValueArray(q, 0) = Cells(q + 2, qCol)
Next q

'code works if MaxPValueArray(0 to iLastRow-1) and I never pick up the Value column
Application.Goto "Value"
qCol = ActiveCell.Column
For q = 0 To iLastRow - 2
MaxPValueArray(q, 1) = Cells(q + 2, qCol)
Next q

Call Countingsort(MaxPValueArray)


Here's John's counting sort.

Sub Countingsort(list)
Dim counts()
Dim i As Long
Dim j As Long
Dim next_index As Long
Dim min, max
Dim min_value As Variant, max_value As Variant

' Allocate the counts array. VBA automatically initialises all entries to 0.
min_value = Minimum(list)
max_value = Maximum(list)

min = LBound(list)
max = UBound(list)

ReDim counts(min_value To max_value)
'ReDim counts(min_value To max_value, 0 to 1) causes a subscript out of range at the Minimum = l(s1) line in Function Minimum.

' Count the values
For i = min To max
counts(list(i)) = counts(list(i)) + 1
Next i

' Write the items back into the list array.
next_index = min
For i = min_value To max_value
For j = 1 To counts(i)
list(next_index) = i
next_index = next_index + 1
Next j
Next i
End Sub


The functions Minimum and Maximum are the same, only different in a lesser-than or greater-than comparison.

Private Function Maximum(l)
Dim s1, s2
Dim i
s1 = LBound(l)
s2 = UBound(l)
Maximum = l(s1)
For i = s1 To s2
If l(i) > Maximum Then Maximum = l(i)
'For Minimum, If l(i) < Minimum Then Minimum = l(i)
Next i
End Function

Many thanks in advance for what will be obvious once someone points it out and explains it to me.:beerchug:

MWE
05-17-2005, 05:53 PM
The "counting sort" approach is very fast, but also rather difficult to understand. Unless you have lots of data (say # points in target column > 5000), I suggest that you use a simplier approach. I am partial to the simple "bubble sort" method because it is so easy to implement and modify.

I have several such routines, one that is probably what you want. It is called MultiSort and sorts an array based on a target "column". I have attached a zip file that demonstrates its use.

Hope this helps

BlueCactus
05-18-2005, 12:19 AM
Seems like the only concern would be getting the second dimension to tag along. You're not sorting based on the second dimension, right?

The first code block should work with your redim statement. BTW, there are faster ways of loading the array if your columns are adjacent.

I've not tested it, but I would think this for the Max / Min functions:
Private Function Maximum(l)
Dim s1, s2
Dim i
' Note that the '1' refers to the first dimension.
s1 = LBound(l,1)
s2 = UBound(l,1)
Maximum = l(s1,0)
For i = s1 To s2
If l(i,0) > Maximum Then Maximum = l(i,0)
'For Minimum, If l(i) < Minimum Then Minimum = l(i)
Next i
End Function

...and then (again untested):
Sub Countingsort(list)
' I'm removing the parantheses because they're probably not needed,
' and rediming a dim'd array with multiple dimensions can be problematic
Dim counts as variant
Dim i As Long
Dim j As Long
Dim next_index As Long
Dim min, max
Dim min_value As Variant, max_value As Variant

' Allocate the counts array. VBA automatically initialises all entries to 0.
min_value = Minimum(list)
max_value = Maximum(list)

min = LBound(list,1)
max = UBound(list,1)

ReDim counts(min_value To max_value, 0 to max-min+1)

' Count the values
For i = min To max
counts(list(i,0),0) = counts(list(i,0),0) + 1
counts(list(i,0),counts(list(i,0),0))=list(i,1)
Next i

' Write the items back into the list array.
next_index = min
For i = min_value To max_value
For j = 1 To counts(i,0)
list(next_index,0) = i
list(next_index,1)=counts(i,0,j)
next_index = next_index + 1
Next j
Next i
End Sub

Damn. I'm making this crap up on the fly, if you can't tell. That algorithm was tricky to understand, like MWE said. OK, here's what I've done in the last section. I'm stacking up your values (second column) in a second dimension of counts(). If this works, the catch is that you don't know in advance how many elements you need in that second dimension. So I've redim'd it as the number of elements in the first dimension of list because the most you'll ever need is when all the property values are identical.

BTW, this sort method only works with integers, right? (Just want to make sure I understood it right - sometimes understanding the logic of the code is not the same as understanding how it works.)

skulakowski
05-18-2005, 09:17 AM
BlueCactus, For something that you made up on the fly, your code is very good...in fact, if my data were perfect, your code would be as well.

I happen to have duplicate Property values (it was a lucky test dataset). As soon as I've written the code to handle duplicates, I'll post it.

Many, many thanks.

MWE, Thanks to you as well but I've got too much data for a Bubble sort to be effective. Sometimes it's a set of 1,500, sometimes it's a set of 22,000.

BlueCactus
05-18-2005, 09:36 AM
I'm actually surprised that it works but doesn't handle duplicates. Unless you mean it should delete duplicates? The one thing it won't do is sort duplicate properties based on the Value - it should spit them out in the same order it found them in the original data. You could do the secondary sort relatively easily by copying the second dimension of counts() into a 1D array, feed that to the original sort code, and copy the results back to the second dimension of counts(). The only problem is that most of that counts() second dimension is unused which will require more code mods to increase efficiency.

Anyway, good luck with that and let us know how it goes. I think I may have some uses for this code myself.

BTW, just noticed an error in the code at the end. Should be:
list(next_index,1)=counts(i,j)

skulakowski
05-18-2005, 09:57 AM
Here's the code that I wrote for duplicates in the first column Property.
For i = min To max
If counts(list(i, 0), counts(list(i, 0), 0)) = 0 Then
'this is your code
counts(list(i, 0), 0) = counts(list(i, 0), 0) + 1
counts(list(i, 0), counts(list(i, 0), 0)) = list(i, 1)
Else
'my test to pick up the larger Value in the second column
Duplicate(0, 0) = counts(list(i, 0), counts(list(i, 0), 0))
If Duplicate(0, 0) > list(i, 1) Then list(i, 1) = Duplicate(list(i, 0), 0)
counts(list(i, 0), counts(list(i, 0), 0)) = 0
counts(list(i, 0), counts(list(i, 0), 0)) = list(i, 1)
End If
Next i

It seems to work...

Now I'm up to a "subscript out of range" error at
next_index = min
For i = min_value To max_value
' For j = 1 To counts(i) 'this is the offending code
For j = 1 To counts(i,0)'it looks this may be the correct code
list(next_index, 0) = i
list(next_index, 1) = counts(i, 0)
next_index = next_index + 1
Next j
Next i


My test sample now is 2 pairs, each with a Property of 7. The code takes the larger of 5 and 6. So far, so good.

I'm left with a counts array that is 3 pairs long (counts(5 to 7, 0 to 1). Both counts(5, 0 to 1) and counts(6, 0 to 1) are empty. I think that the first is empty because it was used as a placeholder for counting and the second is empty because I eliminated the Value of 5 in my test for duplicates.

So, since this doesn't work, I'm going back to test with non-duplicate data. Back in a few.

skulakowski
05-18-2005, 10:16 AM
BTW, just noticed an error in the code at the end. Should be:
list(next_index,1)=counts(i,j)
Hey, me too!:yes

We're getting very close now. On to testing for data with duplicates...

Bob Phillips
05-18-2005, 10:24 AM
MWE, Thanks to you as well but I've got too much data for a Bubble sort to be effective. Sometimes it's a set of 1,500, sometimes it's a set of 22,000.

QuickSort?

skulakowski
05-18-2005, 10:38 AM
A 2-dimension counting sort.
Sorts on the first column and retains the corresponding value in the second.
If anyone would care to test, I think this works.

Sub Countingsort(list)
Dim counts()
Dim i As Long
Dim j As Long
Dim next_index As Long
Dim min, max
Dim min_value As Variant, max_value As Variant
Dim Duplicate(0, 0)

' Allocate the counts array. VBA automatically initialises all entries to 0.
min_value = Minimum(list)
max_value = Maximum(list)

min = LBound(list)
max = UBound(list)

ReDim counts(min_value To max_value, 0 To 1)

'Count the values (i.e., recognize it exists in data; grab the value in column 2.)
For i = min To max

'If this is the first time we see the value in column 1, recognize it.
If counts(list(i, 0), counts(list(i, 0), 0)) = 0 Then
counts(list(i, 0), 0) = counts(list(i, 0), 0) + 1
counts(list(i, 0), counts(list(i, 0), 0)) = list(i, 1)
Else
'If the value in column 1 has already appeared in the data, take the larger of the two possibilities in column 2.
Duplicate(0, 0) = counts(list(i, 0), counts(list(i, 0), 0))
If Duplicate(0, 0) > list(i, 1) Then list(i, 1) = Duplicate(list(i, 0), 0)
counts(list(i, 0), counts(list(i, 0), 0)) = 0
counts(list(i, 0), counts(list(i, 0), 0)) = list(i, 1)
End If
Next i

' Write the items back into the list array.
next_index = min
For i = min_value To max_value
For j = 1 To counts(i, 0)
list(next_index, 0) = i
list(next_index, 1) = counts(i, j)
next_index = next_index + 1
Next j
Next i
End Sub

Private Function Maximum(l)
Dim s1, s2
Dim i
s1 = LBound(l, 1) 'NB &quot;1&quot; => dimension 1 or 1st column
s2 = UBound(l, 1)
Maximum = l(s1, 0) 'but I started the index numbering w/0
For i = s1 To s2
If l(i, 0) > Maximum Then Maximum = l(i, 0)
'or If l(i, 0) < Minimum Then Minimum = l(i, 0) for Minimum function
Next i
End Function

:beerchug:I only wish I could say that I understand it!:beerchug:
Also, BlueCactus, you said you had a better way to read data into arrays if the columns were adjacent. What is it?

skulakowski
05-18-2005, 10:51 AM
Three quick notes.

W/out the test for duplicates, the code generates a "subscript out of range" error. (At least it did when I was testing that bit...but who knows what else I was moving at the time.)

The duplicate test above substitutes the larger column 2 value in place of the smaller => two duplicate entries in the arrays with the same larger value. This works for my purposes but it may not work for others.

In effect, I have an almost-twice-sorted list with the counting sort on column 1 and a "always larger value" in column 2. It'd be nice to modify the duplicate test so that it retained both values in column 2 but I've had too much fun already.

BlueCactus
05-18-2005, 11:49 AM
Don't have time to address most of this right now, but in terms of reading in the array, use:

Dim myArray as Variant
myArray = Sheets(index).Cells(row,col).Resize(numRows,numCols)

where row,col is the top-left cell, and numRows, numCols are the number of rows and cols.

Note that the first cell will be in myArray(1,1), not myArray(0,0), so this will require some minor mods to some of the element indices you use for the second dimension. The first dimension should not be affected since the code already refers to LBound(myArray,1) and UBound(myArray,1).

BlueCactus
05-19-2005, 08:34 AM
I only wish I could say that I understand it!:beerchug:

It's actually not so hard once you've stared at it for a while.

Basically, you find the minimum and maximum values in the data. You create a counts() array spanning from minimum to maximum. This array then has an element for each possible value in the data. You walk through the data, and count how many times each possible value appears in the data. Thus, if counts(7)=12 then there are 12 occurences of the number 7 in the data. Now, you don't actually have to sort anything, because the counts() array is by definition already in ascending order. So you just step through the counts() array: for each n, you list counts(n) occurences of n, and you're automatically sorted.

An important aspect of VBA in the code provided is the handling of
for i = min_value to max_value
for j = 1 to counts(i)
...
next j
next i
If counts(i)=0, there were no occurences of i in the data. Some versions of BASIC will execute the code contained in the For j = 1 to 0 loop once because j is originally set to 1. VBA does not. It pretests the loop, finds that j>0 and skips it. This eliminates having to enclose the For j loop inside a If counts(i) >0 block.

The code as provided only works with integers, because you can have only integer indices for an array. You could easily modify it to work with fixed decimal places by mupltiplying the data by a factor of 10 (and then rounding) before counting, and dividing it by the same factor at the end. But, this could lead to some pretty huge memory-hungry arrays. (Imagine having min_value = 1, max_value= 100000, with 3 decimal places. That would lead to a counting array with 100,000,000 elements.)

MWE
05-19-2005, 02:00 PM
It's actually not so hard once you've stared at it for a while.

Basically, you find the minimum and maximum values in the data. You create a counts() array spanning from minimum to maximum. This array then has an element for each possible value in the data. You walk through the data, and count how many times each possible value appears in the data. Thus, if counts(7)=12 then there are 12 occurences of the number 7 in the data. Now, you don't actually have to sort anything, because the counts() array is by definition already in ascending order. So you just step through the counts() array: for each n, you list counts(n) occurences of n, and you're automatically sorted.

An important aspect of VBA in the code provided is the handling of
for i = min_value to max_value
for j = 1 to counts(i)
...
next j
next i
If counts(i)=0, there were no occurences of i in the data. Some versions of BASIC will execute the code contained in the For j = 1 to 0 loop once because j is originally set to 1. VBA does not. It pretests the loop, finds that j>0 and skips it. This eliminates having to enclose the For j loop inside a If counts(i) >0 block.

The code as provided only works with integers, because you can have only integer indices for an array. You could easily modify it to work with fixed decimal places by mupltiplying the data by a factor of 10 (and then rounding) before counting, and dividing it by the same factor at the end. But, this could lead to some pretty huge memory-hungry arrays. (Imagine having min_value = 1, max_value= 100000, with 3 decimal places. That would lead to a counting array with 100,000,000 elements.)

I never meant to say that it was hard to understand once you have played with it a bit; rather that for average folks, it was overly complicated for the gains achieved unless there are lots of numbers to be sorted. For the vast majority of what I do, the bubble sort approach works fine; it is adequately fast, pretty independent of data type (works fine for integer, long, string, single, etc., with no code fiddling) and is very compact.

Bob Phillips
05-20-2005, 08:40 AM
I haven't followed this thread too deeply, but wouldn't it be easier just to drop the array onto a worksheet and use Excel's built-in sort? It can all be triggered from VBA.

BlueCactus
05-20-2005, 09:21 AM
I haven't followed this thread too deeply, but wouldn't it be easier just to drop the array onto a worksheet and use Excel's built-in sort? It can all be triggered from VBA.
Possibly, but where's the fun in that? :devil:

Bob Phillips
05-20-2005, 09:33 AM
Possibly, but where's the fun in that?
Solving the problem?

BlueCactus
05-20-2005, 11:07 AM
skulakowski added code to throw out duplicates during the search, which makes some of the other code a little redundant because I'd originally anticipated keeping and maybe sorting the duplicates.

However, if you want to show use of Excel's functions, go right ahead.

Bob Phillips
05-20-2005, 11:39 AM
However, if you want to show use of Excel's functions, go right ahead.

I certainly don't have any appetite for working on this, or mnore to the point tracking back and finding out the course of events, but I am intrigued as to two things

- why Excel wasn't used (dropping an array into a worksheet and back is very straight-forward, as is using Excel's sort in VBA)

- why this very peculiar sort algorithm was used. As far as I can see a counting sort is efficient when the range of keys is small and there many duplicate keys (which makes the change odd!). I doubt that it would beat a quicksort (although maybe keeping the sort stable using quicksort would be a challenge?

skulakowski
05-23-2005, 08:00 AM
just back in the office today...

I chose to use a counting sort for two reasons. First, because my users sometimes have tens of thousands of records to sort, none of the other straight-forward sorts I tried were fast enough to suit them (an entertaining yet very demanding group!).

Second, I figured I'd learn something. I have other sorts set up that rely on filtering out and deleting non-unique rows (which makes them fast enough for tens of thousands of records because many records would be filtered out). But I already know how to do that.

Thanks to everybody. I have learned a lot.

Vergence
05-05-2010, 04:36 PM
The counting sort is at least as fast (if not faster) than the quick sort at least based on larger data sets. John Walkenbach also demonstrates this in his "Power Programming with VBA". There is a very good lecture on YouTube via MIT that explains the counting sort process as well.

It's a great sort algorithm and is also much more predictable. I thought the approach was quite clever and it's always interesting to rethink these things from time to time. Often we take great algorithms for granted. The benefit of the counting sort is that it's timing is linear (not exponential) as the data size increases.

I can certainly appreciate the effort expended on the thread exchange above. Great job.

skulakowski
05-06-2010, 03:28 PM
Don't know that we created the next best algorithm for fast sorting of large data sets but it was fun to try. I'm glad you found this thread useful.