# Thread: Help Required : Random Sampling of Data

1. If you have a large number data points with only one or two instances of each, proportionate really isn't going to work

Example - look at the yellow columns. There are 20 data points with 12 different values. Assuming that you want to get a proportionate random sample of 10, many of the values (like '2') need .5 of a sample data point

That was why the lines that you deleted were there, so that if the number of occurrences of a sample value size was so small that no data of that value would be selected, I forced at least 1

The only alternative I can think of (again this is outside my range) would be to use percentiles and proportionately sample within a percentile (the orange columns)

Capture.JPG

2. Understood Paul.
Yes, if the values are numbers then the samples haves to be picked based on the weightage in the ranges as per the list box

What do you think if I do the below code to proportionate again the shortfall in samples. I hope I will get some more samples

```    If n < numRecordsProportional Then
d = numRecordsProportional - n ' shortfall
s = WorksheetFunction.Sum(aryStrataSamples)
For iStrata = 1 To numStrata
If aryStrataSamples(iStrata) = 0 Then
Else
aryStrataSamples(iStrata) = aryStrataSamples(iStrata) + Round(d * aryStrataSamples(iStrata) / s, 0)
End If
Next iStrata
End If```

3. Also to consider/giving importance to amount columns in picking samples, I have added a checkbox to select the column which has numbers/amounts to pick the highest values from each numStrata for both equal and proportionate methods.

I need to think of how this can be addressed in the existing codes. Can you help me in this
I have made few cosmetic changes in the codes and attached the version here

4. Originally Posted by anish.ms
Also to consider/giving importance to amount columns in picking samples, I have added a checkbox to select the column which has numbers/amounts to pick the highest values from each numStrata for both equal and proportionate methods.

I need to think of how this can be addressed in the existing codes. Can you help me in this
I have made few cosmetic changes in the codes and attached the version here
I didn't like some of your changes, especially the part about using Selection

Don't understand the purpose or use of Checkbox.

Using Revenue and Sample Size = 40 with 300 Data Points, one way would be to

1. Sort the data
2. Break it into Sample Size = 40 bins
3. Pick 1 random sample from each bin

That way you get your 40 samples, and by sorting the data points with the same values will most likely be in the same bins

In the picture then, you'd have a random pick from the Green, one from the Yellow, one from the blue, .... and one from the Gray

Capture2.JPG

It'll take a while to do this, but it's do-able

5. Hi Paul, thanks for your response

I changed it to selection because finally this will go as an add-in into my excel and it is not necessary to be the data range always starts from row 1 and column 1 and the number of columns also will vary.

I added check box to give an option to pick samples from the highest values in the strata. For example if Quarter is the selected variable and with equal/proportionate method
With checkbox false - sample selection as per the current random method
With checkbox true - the samples shall be picked in the order of highest to lowest values in the selected column (combx_SelHVField)

attached another sample with 50000 rows data (normally the data range is almost 100k to 200K rows)

Can you add your above example of 'sort and sample size bins' as a third method in addition to the current proportionate and equal methods

Can you also check the possibility of picking samples from the highest to lowest values with reference to my check box option.

6. Have not forgotten

Can you also check the possibility of picking samples from the highest to lowest values with reference to my check box option.
Don't understand. Can you give simple example

7. Originally Posted by Paul_Hossler
Have not forgotten

Don't understand. Can you give simple example
Screenshot 2021-08-05 080115.jpg
As per the above screenshot if the checkbox is ticked and for example if revenue is the filed selected, then it shall pick samples in the order of highest to lowest values from revenue for each strata (Q1 to Q4). The result would be like below-
Screenshot 2021-08-05 081125.jpg
Same for proportionate method, if checkbox is ticked

8. Did not do anything with the check box

Enter the number of bins

Macro sorts Strata in order

Divides sorted list into number of bins

picks one randomly from each bins

Capture.JPG

9. Thanks a Ton Paul!

10. Hi Paul, Please find attached the latest version.
Request your help on the checkbox option to pick highest value samples.

11. Not clear as to the results you're expecting

With no changes, I did this ...

Capture.JPG

... and got this

Capture2.JPG

What would be the 'Highest Values' from Facility Name?

12. Hi Paul, sorry for not making it very clear.
combx_SelHVField was showing all column headers. I have now changed it to show only those column headers which has numeric values.
In the above case, if the field selected is either SubAmt or TAT, then the samples should be the highest 20 values from the filed selected in each Category.
Hope it is clear now.
Maybe by ranking the values for each stratum and then sort and mark Yes for first 20 in each stratum
Thanks in advance!

13. I marked the major changes in the code with ' <<<<<<<<<<<<<<<<<

Capture.JPG

14. Thanks a lot Paul!
But the samples are limited to the number of strata

15. The 190 is ignored

There are 19 strata in "Facility Name" and the check box to pick highest from "SubAmt" is checked

So the highest SubAmt from each of the 19 "Facility Name" gets the "YES"

Capture.JPG

16. Hi Paul,
Sorry for the confusion again.
I wanted the sample size to be considered like how it was working previously. And if the check box is checked, it shall pick in the order of highest to lowest values instead of random pick.
If the checkbox is unchecked, then random pick, the way it was working previously
Thanks for you time.

17. So for example

High Value NOT Checked

Strata = "Facility Name" , 19 values
option = Equal from each
# = 10

There are 190 total sample picked, 10 randomly from each Facility Name

High Value Checked, pick from "SubAmt"

Strata = "Facility Name" , 19 values

There are 19 samples picked, each the largest SubAmt in each Facility Name

18. Hi Paul,
Thanks for your time!

For example

High Value NOT Checked

Strata = "Facility Name" , 19 values
Option = Equal from each
# = 10

There are 190 total sample picked, 10 randomly from each Facility Name

High Value Checked, pick from "SubAmt"

Strata = "Facility Name" , 19 values
Option = Equal from each
# = 10

There shall be 190 samples picked, 10 each from highest to lowest in SubAmt for each Facility Name.
Given below the sample for facility 3

Screenshot 2021-08-27 085039.jpg

Similarly for the the other pick options. Hope I was able to make it clear this time.

I think the latest version is not picking high values

19. So if the High Values checkbox is check, then

From the 200 "Facility1" pick 10 randomly, and THEN only selected the highest from the 10?

Repeat for 10 from Facility2, etc. ???

20. Hi Paul,
Sorry, what is 200 here?
No, there is no random picking if the High Values checkbox is checked. 10 highest values from the entire data range for each facility.
Yes, for all facilities (all Strata)
Thanks!

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•