PDA

View Full Version : Grouping Data into Buckets



bflorez
10-04-2005, 12:37 PM
Hey Guys,


I'm fairly new to VBA / heavy Office use, but I think I have a problem that could be relatively easily solved by a clever VBA solution. I deal with a lot of data that needs to be grouped into meaningful buckets - for example, a bunch of ages I'd like to group into age groups (e.g. Under 18, 18-21, 22-35, etc.)

I'd like a way to have Excel automatically tell me how many people are in a range of values that I provide and a percentage of the total number of people.

I'd ALSO like it to have the option to automatically determine those value ranges to fit a defined percentage of people in it. For example, I could say I want 3 buckets, with a 25%, 50%, 25% distribution, or 5 buckets with a uniform 10% distribution.

The number of records I'm looking at varies drastically, so the script should be able to handle any number of records (up to 65K or so, of course :)

Any ideas?

Thanks for your help.

Warm Regards,
Brendan

MWE
10-04-2005, 03:09 PM
Hey Guys,


I'm fairly new to VBA / heavy Office use, but I think I have a problem that could be relatively easily solved by a clever VBA solution. I deal with a lot of data that needs to be grouped into meaningful buckets - for example, a bunch of ages I'd like to group into age groups (e.g. Under 18, 18-21, 22-35, etc.)

I'd like a way to have Excel automatically tell me how many people are in a range of values that I provide and a percentage of the total number of people.

I'd ALSO like it to have the option to automatically determine those value ranges to fit a defined percentage of people in it. For example, I could say I want 3 buckets, with a 25%, 50%, 25% distribution, or 5 buckets with a uniform 10% distribution.

The number of records I'm looking at varies drastically, so the script should be able to handle any number of records (up to 65K or so, of course :)

Any ideas?

Thanks for your help.

Warm Regards,
Brendan
you have described histograms of various types. Excel has a native historgram generator but it is not very easy to use and IMO does not provide good flexibility. You can bucketize data quite easily using native Excel capability (not VBA) using the countif function. For example, assume your raw data is in Col A. Then you might put a formula in B1 to count all values in Col A that are < 18, then in B2 those < 21 minus B1, etc.It is pretty easy to do and is very flexible.

The 25,50,25 3-bucket or 20,20,20,20,20 5-bucket is a little more difficult because there is either some recursion tweaking the bucket boundaries or you directly calculate the percentiles.

Let me take a look at some of my historgram procedures and see what I can find for you.

MWE
10-04-2005, 03:38 PM
attached is a very simple example of using the CountIf function and some simple VBA

bflorez
10-05-2005, 06:47 AM
Hey MWE,

Wow, thanks for thorough answer to the first part of my question - I know the second part is harder, but any idea how to do that? If you could do that, it would be Fantastic!

Thanks,
Brendan

MWE
10-05-2005, 08:40 AM
Hey MWE,

Wow, thanks for thorough answer to the first part of my question - I know the second part is harder, but any idea how to do that? If you could do that, it would be Fantastic!

Thanks,
Brendan
Brendan: sounds like the first stuff was useful. Re part 2, it really depends on how you wish to approach the problem. Frequency analysis (histogram is a graphical representation of frequency analysis) with defined bucket "widths" is really pretty easy. As mentioned earlier, it is a bit more difficult when you want defined % in each bucket. Are you concerned about "statistical robustness", i.e., that the results are statistically correct? If so, the problem is yet a bit more difficult. For example, in the defined width case, the # of buckets and the width of each is not arbitrary if you want "statistically meaningful" results. There are several rules for # of buckets. The one I use most is buckets are equal "width" and # buckets = SQRT(# samples).

The general approach for defined % / bucket is to calculate percentiles at a convenient increments and then % buckets is essentially done. Let's say you want the 25/50/25 case. We seek the bounds or "left/right" values for each of 3 buckets:
1. sort the raw data, smallest to largest (this is often the most time consuming step)
2. the 1st 25%: left = smallest value; right = the value 1/4 into the sorted array
3. the 50% bucket: left value 1/4 into the sorted array; right = the value 3/4 into the sorted array
4. the final 25% bucket: left= the value 3/4 into the sorted array; right = the max value

There is some minor tweaking of left/right values depending on how you want to define buckets, i.e., where is the equal sign for a bucket edge. If the 25% is (14 , 25) and the 50% is (25 , 44), where does a value = 25 fall? It can not go into both buckets. If you know that values are only integers, the problem is easier, we might use (14 , 25), (26 , 44), etc. But when real numbers are involved, we have to be careful. IF we used (14 , 25) and (26 , 44) where does 25.4 go? What one normally does is something like 14<= X < 25; 25<= X < 44; etc. In "mathspeak", these bucket boundaries are intervals and the use of parentheses vs (square) brackets vs (curly) braces is used to indicate the type of interval. Rewriting the above the intervals would be [14,25}, [25,44}, etc. More than you wanted to know?:devil:

Given the multiple steps, it is easiest to do defined % freq analysis using VBA. I will dig out my Percentile calcuation code.

bflorez
10-05-2005, 09:25 AM
Hey MWE,

Thanks again - you continue to be extrememly helpful. Don't every worry about giving me MORE than I wanted to know -- that's not possible ;)

I'm not sure what you mean when you talk about 'statistical robustness', but I'll try to explain the two things I'm trying to do. In both cases, I am limited to integers, though you make a good point about the proper interval bounds.

1. The first thing I often have to do is try to group a large group of integers to make something *roughly* approximating a normal distribution, like the 25% 50% 25% or a uniform distribution. In these cases, the buckets may be of significantly different sizes, but provided I can manually adjust the ranges if need be, that should be fine.

2. The other task is to group data into 'meaningful' buckets (between 3 and 15 buckets, usually) - what this means is rather ambiguous, but often times it requires that, again, the buckets are of drastically different sizes. Your method for using percentiles makes complete sense; my problem is more with the automation side of things rather than the conceptual end, as I am brand new to Excel/VBA but not so new to mathematics/statistics.

Thanks,
Brendan

MWE
10-06-2005, 05:36 AM
Hey MWE,

Thanks again - you continue to be extrememly helpful. Don't every worry about giving me MORE than I wanted to know -- that's not possible ;)

I'm not sure what you mean when you talk about 'statistical robustness', but I'll try to explain the two things I'm trying to do. In both cases, I am limited to integers, though you make a good point about the proper interval bounds.Statistical robustness means that the tests performed and statistics computed are "valid". There are assumptions behind each statistical method and if you violate those assumptions, the results may be OK for what you want to do, but they are not statistically valid. For example, you wish to estimate the mean of a population, say the average age for students in a high school. How many samples are required from the population for the estimate to be "valid". You do not need to include the age of every student, but you need a reasonable sample size. If you use every student then obviously you have the true mean. But if you use 10%, is that mean "good enough"? Hard to say without knowing a bit more about the sample. It the sample is truely random, 10% may be OK, particuarly if the actual # of students is quite large. But if the sample is skewed, say, just Freshman, or the total number is actually 10 and you use 1 (10% of 10) ... well, you get the idea


1. The first thing I often have to do is try to group a large group of integers to make something *roughly* approximating a normal distribution, like the 25% 50% 25% or a uniform distribution. In these cases, the buckets may be of significantly different sizes, but provided I can manually adjust the ranges if need be, that should be fine.You want to be a little careful here. The population may not be normal and trying to make it look normal is dangerous. For, say, ages of students in a high school, I would expect something closer to uniform than normal. There are simple tests to determine if a population is normal, or any defined distribution. You bucket the data using an appropriate # of buckets (like SQRT(N)) and consistent bucket width (typically (Xmax-Xmin)/N) and then compare the # of values in each bucket compared to what you would get if the data were truely, say, normal. You add up all those differences and look at the relative size of that number using a ChiSquare Test.

BTW, one of the advantages of a "proper" histogram is that it provides a simple way to "see" the distribution and you can usually tell if the data is, say, normal, or uniform, or U-shaped, or bi-modal, or truncated, etc.


2. The other task is to group data into 'meaningful' buckets (between 3 and 15 buckets, usually) - what this means is rather ambiguous, but often times it requires that, again, the buckets are of drastically different sizes. Your method for using percentiles makes complete sense; my problem is more with the automation side of things rather than the conceptual end, as I am brand new to Excel/VBA but not so new to mathematics/statistics.When you state, "... the buckets are of drastically different sizes.", do you mean that the buckets are different "widths".

I have attached a spreadsheet that does percentile calculations. Sheet1 is a repeat of the previous spreadsheet. Sheet2 does 10%, 20%, ... Sheet3 lets you play with percentiles.

Play around a little. I still need to put together some histogram procedures and should be able to post something later today. I am sure we can build an automated version of "this stuff" that will do what you want and save you a lot of time.

bflorez
10-14-2005, 09:47 AM
Hey MWE,

First of all, thanks again for all your help so far - it's aboslutely great to have people like you who are both incredibly knowledgable AND willing to help (a rare combination). I apologize for taken so long to get back to you, but I was on vacation.

The last file you send, ptile.zip, was great....you said you're pulling more together? I can't wait to see what else you have! The two most immediate modifications on what you've already sent me would be to have Sheet 1 automatically adjust the range to look at all the data I paste in (the other sheets DO do that, but I think b/c you were just using a basic 'countif()' on Sheet1, I have to manually resize the range everytime I paste data), and to allow me to add as many buckets as I want automatically. The other modification is also on Sheet1, to be able to specify the ranges inclusively, e.g.

1 to 5
6 to 10
11 to 35
16 to 99, etc.

Where I could put in '1' and '5', and then 6, 11, 16, etc would be automatically generated, allowing me to put in the high end ranges.

I didn't really understand what the stats were that you had on Sheet2, but a summary like that would be great (if I knew what it meant). Sheet3 obviously rocks the house, and is almost exactly what I was talking about - the only thing that would make it better would be not have to put in culumative percentages, but percentages for that particular bucket.

The ideal scenario would be to be able to drop data in, see a 'natural' distribution of the data, and be able to graphically slide the bucket boundaries, while watching another distribution that reflects the distribution of the data as I have it bucketed, while displaying the bucket edges, counts and percentages in each bucket. Much of the data we look at ISN't going to be normally distributed, but we like looking for data that is. You seem like you have a very good grasp of this stuff, so anything else you think might be useful definitely shoot back.

Thanks a lot. Talk to you soon,
Brendan

MWE
10-16-2005, 08:03 PM
Hey MWE,

First of all, thanks again for all your help so far - it's aboslutely great to have people like you who are both incredibly knowledgable AND willing to help (a rare combination). I apologize for taken so long to get back to you, but I was on vacation.

The last file you send, ptile.zip, was great....you said you're pulling more together? I can't wait to see what else you have! The two most immediate modifications on what you've already sent me would be to have Sheet 1 automatically adjust the range to look at all the data I paste in (the other sheets DO do that, but I think b/c you were just using a basic 'countif()' on Sheet1, I have to manually resize the range everytime I paste data), and to allow me to add as many buckets as I want automatically. The other modification is also on Sheet1, to be able to specify the ranges inclusively, e.g.

1 to 5
6 to 10
11 to 35
16 to 99, etc.

Where I could put in '1' and '5', and then 6, 11, 16, etc would be automatically generated, allowing me to put in the high end ranges.

I didn't really understand what the stats were that you had on Sheet2, but a summary like that would be great (if I knew what it meant). Sheet3 obviously rocks the house, and is almost exactly what I was talking about - the only thing that would make it better would be not have to put in culumative percentages, but percentages for that particular bucket.

The ideal scenario would be to be able to drop data in, see a 'natural' distribution of the data, and be able to graphically slide the bucket boundaries, while watching another distribution that reflects the distribution of the data as I have it bucketed, while displaying the bucket edges, counts and percentages in each bucket. Much of the data we look at ISN't going to be normally distributed, but we like looking for data that is. You seem like you have a very good grasp of this stuff, so anything else you think might be useful definitely shoot back.

Thanks a lot. Talk to you soon,
Brendan
Brendan: glad that we are on the same wavelength(s) and that what I have provided so far is useful. I am out of town for a few (more) days and do not have access to my computer, libraries, etc. I was hoping to generate one more iteration of your "solution" before I left, but I guess that was not to happen. I will review your needs and generate the next iteration when I return on the 19th

bflorez
11-16-2005, 08:23 AM
Hey MWE,

Haven't heard from you in awhile - I'm sure you're very busy, but I was wondering if you had ever gotten around to finding that other histogram stuff you mentioned. I'm also don't know what you're other expertise is, but I'm posting another problem (optimization) that you might have some expertise in, since it sounds like you're the math man ;)

Thanks a lot,
Brendan