Forum: >>> Magnum BBS <<<

Arranging groups of numbers into a normal distribution without histogra

From marklukawsky@gmail.com@21:1/5 to All on Mon Feb 29 13:07:04 2016

Hi All,

I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

I'm working Excel...

Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the
middle sales tier would have the highest count of locations.

I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

Thank you!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bruce Weaver@21:1/5 to marklu...@gmail.com on Mon Feb 29 14:50:55 2016

On Monday, February 29, 2016 at 4:07:07 PM UTC-5, marklu...@gmail.com wrote:

Hi All,

I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

I'm working Excel...

Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the

middle sales tier would have the highest count of locations.

I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

Thank you!

Why do you want to do that? I ask, because as George Box famously observed, "in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, [the statistician] can often derive
results which match, to a useful approximation, those found in the real world."

Source: Section 2.5 of http://mkweb.bcgsc.ca/pointsofsignificance/img/Boxonmaths.pdf.

HTH.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to All on Tue Mar 1 03:27:32 2016

On Mon, 29 Feb 2016 13:07:04 -0800 (PST), marklukawsky@gmail.com
wrote:

Hi All,

I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

I'm working Excel...

Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the

middle sales tier would have the highest count of locations.

I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

Thank you!

My! Arbitrary widths? That sounds like someone is trying
for an entry in some new edition of "How to Lie with Statistics".

Don't do it.

I Googled, and I have just glanced at this: Wikipedia has a nice
article under "Histograms". It has a good discussion of algorithms
for "number of bins".

However, I don't see that it mentions what to do with data that
cover a huge range and deserve to be transformed .... Maybe
I should add that, or suggest it on the discussion page ....

If you have a wide range, it could be natural to use transformation.

Keep it a simple one, and one that is fairly natural for whatever is
measured. I would have to say that the first choice is always "logs".

Second, consider inverting the measure, such as Miles-per-gallon
becoming Gallons-per-100 miles. - I once saw a very nice data
presentation concerning the records at various distances for
track meets, which achieved fine unification by converting all
"times" for the records into "speed". They plotted 100 years of
world-records, all distances, for males and females.

For random counts of events, it could be that the square-root
will be the "natural" transformation. What would inhibit me from
using it is that it is seldom used by others. However, it is another
one to consider.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to hrubin@skew.stat.purdue.edu on Tue Mar 1 20:28:17 2016

On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin <hrubin@skew.stat.purdue.edu> wrote:

One should never use a transformation without having a good reason
not depending on the observed distribution to do it. Transforming to
a normal distribution makes ALL subsequent analyses suspect.

I would ask that the second sentence be started with some
qualifier like "Arbitrarily..."

Trying to analyze a distribution with tests that have normality
assumptions, when you can be sure that the metric in use
has great heterogeneity of variance, is a way to assure
that your statistical tests are wrong.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 09:53:10 2016

Thanks for all the replies guys. I have read up in depth on the Wikipedia page regarding histograms. I have come up with a few formulas that get my distribution as close to normal as possible. One was the Sturges' method.

In it the formulas I use are:

# of bins = log(# of data points)+1
bin size = (max of data points - min of data points)/# of bins

This always gets me close to a normal distribution. I was just wondering if anyone had anything better. Thanks!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Herman Rubin@21:1/5 to Rich Ulrich on Tue Mar 1 21:07:00 2016

On 2016-03-01, Rich Ulrich <rich.ulrich@comcast.net> wrote:

On Mon, 29 Feb 2016 13:07:04 -0800 (PST), marklukawsky@gmail.com
wrote:

Hi All,

I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

I'm working Excel...

One of my colleagues had on his door a sign, the exact wording of
which I do not remember, saying that friends should prevent friends
from using Excel.

Currently I have a list of text representing store locations in column

A and sales figures in column C. In column B I'm trying to generate
a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the middle sales tier would
have the highest count of locations.

I've come up with a few statistical methods which generate the (1)

size of each bin, and (2) the number of bins, but I can't seem to get
it quite into a normal distribution.

Is there anyway to write a formula to generate the number of bins and

bin size to organize the data into a normal distribution? Any help would
be appreciated.

Thank you!

My! Arbitrary widths? That sounds like someone is trying
for an entry in some new edition of "How to Lie with Statistics".

Don't do it.

I Googled, and I have just glanced at this: Wikipedia has a nice
article under "Histograms". It has a good discussion of algorithms
for "number of bins".

However, I don't see that it mentions what to do with data that
cover a huge range and deserve to be transformed .... Maybe
I should add that, or suggest it on the discussion page ....

If you have a wide range, it could be natural to use transformation.

Keep it a simple one, and one that is fairly natural for whatever is measured. I would have to say that the first choice is always "logs".

Second, consider inverting the measure, such as Miles-per-gallon
becoming Gallons-per-100 miles. - I once saw a very nice data
presentation concerning the records at various distances for
track meets, which achieved fine unification by converting all
"times" for the records into "speed". They plotted 100 years of world-records, all distances, for males and females.

For random counts of events, it could be that the square-root
will be the "natural" transformation. What would inhibit me from
using it is that it is seldom used by others. However, it is another
one to consider.

One should never use a transformation without having a good reason
not depending on the observed distribution to do it. Transforming to
a normal distribution makes ALL subsequent analyses suspect.

I suggest you discuss your problem with someone who understands
statistical theory, and who will try to get from YOU the underlying assumptions. ALL statistical procedures have assumptions, and the
theorist is in a position to help you find the necessary approximations.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Herman Rubin@21:1/5 to Rich Ulrich on Wed Mar 2 19:57:51 2016

On 2016-03-02, Rich Ulrich <rich.ulrich@comcast.net> wrote:

On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin
<hrubin@skew.stat.purdue.edu> wrote:

One should never use a transformation without having a good reason
not depending on the observed distribution to do it. Transforming to
a normal distribution makes ALL subsequent analyses suspect.

I would ask that the second sentence be started with some
qualifier like "Arbitrarily..."

Trying to analyze a distribution with tests that have normality
assumptions, when you can be sure that the metric in use
has great heterogeneity of variance, is a way to assure
that your statistical tests are wrong.

There are a few, such as tests for independence, which might still
go over, but how much better are they than nonparametric tests?

Testing for a correlation coefficient of 0 is fairly robust, but
testing for any other value, or testing whether two are equal, is
highly dependent on higher moments.

Once Gauss proved the Gauss-Markov Theorem, he ceased worrying
about whether the errors of observation were normal. And the
word "normal" was introdued by Quetelet, who stated that it was
the distribution of properties of a "normal" person, and the
biologists and social scientists have been misusing it ever since.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 14:04:02 2016

Thanks Herman, but I think we are getting slightly off topic here :).

I have a list of locations and their related sales volumes.

Assuming the data is a normal distribution then all I'm trying to do is determine a formula that:
(1) gives me the optimal number of bins and
(2) bin ranges which create a normal distribution

so that the locations fall into the bin ranges in a normal distribution pattern. i.e. the highest count of locations will be in the middle tier.

Thanks for the interesting commentary though!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 14:19:54 2016

What are your thoughts on if I use percentiles to set my ranges? So I group locations and their sales into groups based on percentiles from the bell curve. Example:

Tier A Upper Bound: Top 100% or Max of Data Set
Tier B Upper Bound: Top 97.5%
Tier C Upper Bound: Top 84%
Tier D Upper Bound: Top 50%
Tier E Upper Bound: Bottom 16%
Tier F Upper Bound: Bottom 2.5%

That way it forces it into a normal distribution.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gordon Sande@21:1/5 to marklukawsky@gmail.com on Wed Mar 2 20:32:49 2016

On 2016-03-02 22:04:02 +0000, marklukawsky@gmail.com said:

Thanks Herman, but I think we are getting slightly off topic here :).

I have a list of locations and their related sales volumes.

Business data is more likely to be exponential than normal. Firm sizes
are an accretion of multiplicative effects rather than additive effects
in the usual economic theory.

Assuming the data is a normal distribution then all I'm trying to do is determine a formula that:
(1) gives me the optimal number of bins and
(2) bin ranges which create a normal distribution

so that the locations fall into the bin ranges in a normal distribution pattern. i.e. the highest count of locations will be in the middle
tier.

Thanks for the interesting commentary though!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Ulrich@21:1/5 to marklukawsky@gmail.com on Wed Mar 2 21:49:15 2016

On Wed, 2 Mar 2016 14:19:54 -0800 (PST), marklukawsky@gmail.com wrote:

What are your thoughts on if I use percentiles to set my ranges? So I group locations and their sales into groups based on percentiles from the bell curve. Example:

Tier A Upper Bound: Top 100% or Max of Data Set
Tier B Upper Bound: Top 97.5%
Tier C Upper Bound: Top 84%
Tier D Upper Bound: Top 50%
Tier E Upper Bound: Bottom 16%
Tier F Upper Bound: Bottom 2.5%

That way it forces it into a normal distribution.

Yeah, but nobody does that. Notice, defining the points that
way makes the histogram totally redundant -- all you need
to show is the cutoffs for the ranges.

If you want to show "typical" points along distribution,
you are tryingn to start with "normal" cutoffs, which I think
most readers will not relate to. (Nobody does that.)

I suggest using the percentiles 0-25-50-75-100.
Or, similarly, showing what scores mark the deciles.

--
Rich Ulrich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Jasonrx8
  Wed Apr 17 13:08:27 2024
  from Sydney Nsw via SSH
- Bob Worm
  Thu Apr 18 21:44:01 2024
  from Wales, Uk via Telnet
- Bob Worm
  Thu Apr 18 13:24:26 2024
  from Wales, Uk via Telnet
- Chippey
  Fri Apr 19 02:45:49 2024
  from Winnipeg, Canada via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	213:36:13
Calls:	6,619
Calls today:	1
Files:	12,168
Messages:	5,317,428

Arranging groups of numbers into a normal distribution without histogra

Who's Online

Recent Visitors

System Info