• Arranging groups of numbers into a normal distribution without histogra

    From marklukawsky@gmail.com@21:1/5 to All on Mon Feb 29 13:07:04 2016
    Hi All,

    I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

    I'm working Excel...

    Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the
    middle sales tier would have the highest count of locations.

    I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

    Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

    Thank you!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Weaver@21:1/5 to marklu...@gmail.com on Mon Feb 29 14:50:55 2016
    On Monday, February 29, 2016 at 4:07:07 PM UTC-5, marklu...@gmail.com wrote:
    Hi All,

    I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

    I'm working Excel...

    Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the
    middle sales tier would have the highest count of locations.

    I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

    Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

    Thank you!

    Why do you want to do that? I ask, because as George Box famously observed, "in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, [the statistician] can often derive
    results which match, to a useful approximation, those found in the real world."

    Source: Section 2.5 of http://mkweb.bcgsc.ca/pointsofsignificance/img/Boxonmaths.pdf.

    HTH.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to All on Tue Mar 1 03:27:32 2016
    On Mon, 29 Feb 2016 13:07:04 -0800 (PST), marklukawsky@gmail.com
    wrote:

    Hi All,

    I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

    I'm working Excel...

    Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the
    middle sales tier would have the highest count of locations.

    I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

    Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

    Thank you!

    My! Arbitrary widths? That sounds like someone is trying
    for an entry in some new edition of "How to Lie with Statistics".

    Don't do it.

    I Googled, and I have just glanced at this: Wikipedia has a nice
    article under "Histograms". It has a good discussion of algorithms
    for "number of bins".

    However, I don't see that it mentions what to do with data that
    cover a huge range and deserve to be transformed .... Maybe
    I should add that, or suggest it on the discussion page ....

    If you have a wide range, it could be natural to use transformation.

    Keep it a simple one, and one that is fairly natural for whatever is
    measured. I would have to say that the first choice is always "logs".

    Second, consider inverting the measure, such as Miles-per-gallon
    becoming Gallons-per-100 miles. - I once saw a very nice data
    presentation concerning the records at various distances for
    track meets, which achieved fine unification by converting all
    "times" for the records into "speed". They plotted 100 years of
    world-records, all distances, for males and females.

    For random counts of events, it could be that the square-root
    will be the "natural" transformation. What would inhibit me from
    using it is that it is seldom used by others. However, it is another
    one to consider.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to hrubin@skew.stat.purdue.edu on Tue Mar 1 20:28:17 2016
    On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin <hrubin@skew.stat.purdue.edu> wrote:



    One should never use a transformation without having a good reason
    not depending on the observed distribution to do it. Transforming to
    a normal distribution makes ALL subsequent analyses suspect.

    I would ask that the second sentence be started with some
    qualifier like "Arbitrarily..."

    Trying to analyze a distribution with tests that have normality
    assumptions, when you can be sure that the metric in use
    has great heterogeneity of variance, is a way to assure
    that your statistical tests are wrong.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 09:53:10 2016
    Thanks for all the replies guys. I have read up in depth on the Wikipedia page regarding histograms. I have come up with a few formulas that get my distribution as close to normal as possible. One was the Sturges' method.

    In it the formulas I use are:

    # of bins = log(# of data points)+1
    bin size = (max of data points - min of data points)/# of bins

    This always gets me close to a normal distribution. I was just wondering if anyone had anything better. Thanks!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Herman Rubin@21:1/5 to Rich Ulrich on Tue Mar 1 21:07:00 2016
    On 2016-03-01, Rich Ulrich <rich.ulrich@comcast.net> wrote:
    On Mon, 29 Feb 2016 13:07:04 -0800 (PST), marklukawsky@gmail.com
    wrote:

    Hi All,

    I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

    I'm working Excel...

    One of my colleagues had on his door a sign, the exact wording of
    which I do not remember, saying that friends should prevent friends
    from using Excel.

    Currently I have a list of text representing store locations in column
    A and sales figures in column C. In column B I'm trying to generate
    a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the middle sales tier would
    have the highest count of locations.

    I've come up with a few statistical methods which generate the (1)
    size of each bin, and (2) the number of bins, but I can't seem to get
    it quite into a normal distribution.

    Is there anyway to write a formula to generate the number of bins and
    bin size to organize the data into a normal distribution? Any help would
    be appreciated.

    Thank you!

    My! Arbitrary widths? That sounds like someone is trying
    for an entry in some new edition of "How to Lie with Statistics".

    Don't do it.

    I Googled, and I have just glanced at this: Wikipedia has a nice
    article under "Histograms". It has a good discussion of algorithms
    for "number of bins".

    However, I don't see that it mentions what to do with data that
    cover a huge range and deserve to be transformed .... Maybe
    I should add that, or suggest it on the discussion page ....

    If you have a wide range, it could be natural to use transformation.

    Keep it a simple one, and one that is fairly natural for whatever is measured. I would have to say that the first choice is always "logs".

    Second, consider inverting the measure, such as Miles-per-gallon
    becoming Gallons-per-100 miles. - I once saw a very nice data
    presentation concerning the records at various distances for
    track meets, which achieved fine unification by converting all
    "times" for the records into "speed". They plotted 100 years of world-records, all distances, for males and females.

    For random counts of events, it could be that the square-root
    will be the "natural" transformation. What would inhibit me from
    using it is that it is seldom used by others. However, it is another
    one to consider.

    One should never use a transformation without having a good reason
    not depending on the observed distribution to do it. Transforming to
    a normal distribution makes ALL subsequent analyses suspect.

    I suggest you discuss your problem with someone who understands
    statistical theory, and who will try to get from YOU the underlying assumptions. ALL statistical procedures have assumptions, and the
    theorist is in a position to help you find the necessary approximations.


    --
    This address is for information only. I do not claim that these views
    are those of the Statistics Department or of Purdue University.
    Herman Rubin, Department of Statistics, Purdue University hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Herman Rubin@21:1/5 to Rich Ulrich on Wed Mar 2 19:57:51 2016
    On 2016-03-02, Rich Ulrich <rich.ulrich@comcast.net> wrote:
    On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin
    <hrubin@skew.stat.purdue.edu> wrote:



    One should never use a transformation without having a good reason
    not depending on the observed distribution to do it. Transforming to
    a normal distribution makes ALL subsequent analyses suspect.

    I would ask that the second sentence be started with some
    qualifier like "Arbitrarily..."

    Trying to analyze a distribution with tests that have normality
    assumptions, when you can be sure that the metric in use
    has great heterogeneity of variance, is a way to assure
    that your statistical tests are wrong.

    There are a few, such as tests for independence, which might still
    go over, but how much better are they than nonparametric tests?

    Testing for a correlation coefficient of 0 is fairly robust, but
    testing for any other value, or testing whether two are equal, is
    highly dependent on higher moments.

    Once Gauss proved the Gauss-Markov Theorem, he ceased worrying
    about whether the errors of observation were normal. And the
    word "normal" was introdued by Quetelet, who stated that it was
    the distribution of properties of a "normal" person, and the
    biologists and social scientists have been misusing it ever since.


    --
    This address is for information only. I do not claim that these views
    are those of the Statistics Department or of Purdue University.
    Herman Rubin, Department of Statistics, Purdue University hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 14:04:02 2016
    Thanks Herman, but I think we are getting slightly off topic here :).

    I have a list of locations and their related sales volumes.

    Assuming the data is a normal distribution then all I'm trying to do is determine a formula that:
    (1) gives me the optimal number of bins and
    (2) bin ranges which create a normal distribution

    so that the locations fall into the bin ranges in a normal distribution pattern. i.e. the highest count of locations will be in the middle tier.

    Thanks for the interesting commentary though!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From marklukawsky@gmail.com@21:1/5 to All on Wed Mar 2 14:19:54 2016
    What are your thoughts on if I use percentiles to set my ranges? So I group locations and their sales into groups based on percentiles from the bell curve. Example:

    Tier A Upper Bound: Top 100% or Max of Data Set
    Tier B Upper Bound: Top 97.5%
    Tier C Upper Bound: Top 84%
    Tier D Upper Bound: Top 50%
    Tier E Upper Bound: Bottom 16%
    Tier F Upper Bound: Bottom 2.5%

    That way it forces it into a normal distribution.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gordon Sande@21:1/5 to marklukawsky@gmail.com on Wed Mar 2 20:32:49 2016
    On 2016-03-02 22:04:02 +0000, marklukawsky@gmail.com said:

    Thanks Herman, but I think we are getting slightly off topic here :).

    I have a list of locations and their related sales volumes.

    Business data is more likely to be exponential than normal. Firm sizes
    are an accretion of multiplicative effects rather than additive effects
    in the usual economic theory.

    Assuming the data is a normal distribution then all I'm trying to do is determine a formula that:
    (1) gives me the optimal number of bins and
    (2) bin ranges which create a normal distribution

    so that the locations fall into the bin ranges in a normal distribution pattern. i.e. the highest count of locations will be in the middle
    tier.

    Thanks for the interesting commentary though!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Ulrich@21:1/5 to marklukawsky@gmail.com on Wed Mar 2 21:49:15 2016
    On Wed, 2 Mar 2016 14:19:54 -0800 (PST), marklukawsky@gmail.com wrote:

    What are your thoughts on if I use percentiles to set my ranges? So I group locations and their sales into groups based on percentiles from the bell curve. Example:

    Tier A Upper Bound: Top 100% or Max of Data Set
    Tier B Upper Bound: Top 97.5%
    Tier C Upper Bound: Top 84%
    Tier D Upper Bound: Top 50%
    Tier E Upper Bound: Bottom 16%
    Tier F Upper Bound: Bottom 2.5%

    That way it forces it into a normal distribution.

    Yeah, but nobody does that. Notice, defining the points that
    way makes the histogram totally redundant -- all you need
    to show is the cutoffs for the ranges.

    If you want to show "typical" points along distribution,
    you are tryingn to start with "normal" cutoffs, which I think
    most readers will not relate to. (Nobody does that.)

    I suggest using the percentiles 0-25-50-75-100.
    Or, similarly, showing what scores mark the deciles.

    --
    Rich Ulrich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)