• Statistics of problem-solution pairs

    From David Stork@21:1/5 to All on Wed Mar 1 17:57:44 2023
    Are there any large-scale simulation studies of the statistics of symbolic problem-solution pairs?

    Let's consider just symbolic integration of some class of functions, for instance the class covered by the Risch algorithm (exponentials, logarithms, trigonometric functions, addition, subtraction, multiplication, and division). Suppose we quantify the "
    size" of an integration problem by the number of leafs in its tree-based representation, and likewise for the problem's anti-derivative.

    Problems of a given size may have solutions spanning a range of sizes, of course. Thus there is some statistical distribution of the sizes, and thus a statistical relation (perhaps correlation) between problem and solution sizes.

    So if we double the size of an integration problem will the size of its solution double? or increase faster than linear? or slower than linear? What about the variance in sizes? Or other statistics?

    Naturally there is an infinite number of integration problems of a given size, so any simulation study will have inherent uncertainties. Nevertheless, this seems like an interesting, and potentially fruitful, class of simulation problems, for which we
    must use large-scale computer-algebra systems.

    Anyone interested in working on this?

    --David G. Stork

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Albert Rich@21:1/5 to David Stork on Wed Mar 1 19:12:43 2023
    On Wednesday, March 1, 2023 at 3:57:46 PM UTC-10, David Stork wrote:
    Are there any large-scale simulation studies of the statistics of symbolic problem-solution pairs?

    Let's consider just symbolic integration of some class of functions, for instance the class covered by the Risch algorithm (exponentials, logarithms, trigonometric functions, addition, subtraction, multiplication, and division). Suppose we quantify the
    "size" of an integration problem by the number of leafs in its tree-based representation, and likewise for the problem's anti-derivative.

    Problems of a given size may have solutions spanning a range of sizes, of course. Thus there is some statistical distribution of the sizes, and thus a statistical relation (perhaps correlation) between problem and solution sizes.

    So if we double the size of an integration problem will the size of its solution double? or increase faster than linear? or slower than linear? What about the variance in sizes? Or other statistics?

    Naturally there is an infinite number of integration problems of a given size, so any simulation study will have inherent uncertainties. Nevertheless, this seems like an interesting, and potentially fruitful, class of simulation problems, for which we
    must use large-scale computer-algebra systems.

    Anyone interested in working on this?

    --David G. Stork
    From my experience implementing symbolic integrators, there is little to no correlation between the size of integrands and their optimal antiderivatives. Derivatives of relatively small expressions can be huge. Conversely antiderivatives of relatively
    small expressions can be huge.

    Albert

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Stork@21:1/5 to Albert Rich on Wed Mar 1 22:04:18 2023
    On Wednesday, March 1, 2023 at 7:12:47 PM UTC-8, Albert Rich wrote:
    On Wednesday, March 1, 2023 at 3:57:46 PM UTC-10, David Stork wrote:
    Are there any large-scale simulation studies of the statistics of symbolic problem-solution pairs?

    Let's consider just symbolic integration of some class of functions, for instance the class covered by the Risch algorithm (exponentials, logarithms, trigonometric functions, addition, subtraction, multiplication, and division). Suppose we quantify
    the "size" of an integration problem by the number of leafs in its tree-based representation, and likewise for the problem's anti-derivative.

    Problems of a given size may have solutions spanning a range of sizes, of course. Thus there is some statistical distribution of the sizes, and thus a statistical relation (perhaps correlation) between problem and solution sizes.

    So if we double the size of an integration problem will the size of its solution double? or increase faster than linear? or slower than linear? What about the variance in sizes? Or other statistics?

    Naturally there is an infinite number of integration problems of a given size, so any simulation study will have inherent uncertainties. Nevertheless, this seems like an interesting, and potentially fruitful, class of simulation problems, for which
    we must use large-scale computer-algebra systems.

    Anyone interested in working on this?

    --David G. Stork
    From my experience implementing symbolic integrators, there is little to no correlation between the size of integrands and their optimal antiderivatives. Derivatives of relatively small expressions can be huge. Conversely antiderivatives of relatively
    small expressions can be huge.

    Albert

    Albert,

    My experience over decades suggests instead that there IS (or at least MAY BE) interesting structure in the relation between problems and solutions, and a principled exploratory study might give unexpected results.

    Here's one:

    Integrate[((c + d Tan[e + f x])^{5/2}(a + b Tan[e + f x] + c Tan[e + f x]^2))/(a + b Tan[ e + f x])^{9/2},x]

    has a solution that takes 36 Mbytes to write out!

    --David Stork

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Albert Rich@21:1/5 to David Stork on Thu Mar 2 00:35:30 2023
    On Wednesday, March 1, 2023 at 8:04:19 PM UTC-10, David Stork wrote:
    On Wednesday, March 1, 2023 at 7:12:47 PM UTC-8, Albert Rich wrote:
    On Wednesday, March 1, 2023 at 3:57:46 PM UTC-10, David Stork wrote:
    Are there any large-scale simulation studies of the statistics of symbolic problem-solution pairs?

    Let's consider just symbolic integration of some class of functions, for instance the class covered by the Risch algorithm (exponentials, logarithms, trigonometric functions, addition, subtraction, multiplication, and division). Suppose we quantify
    the "size" of an integration problem by the number of leafs in its tree-based representation, and likewise for the problem's anti-derivative.

    Problems of a given size may have solutions spanning a range of sizes, of course. Thus there is some statistical distribution of the sizes, and thus a statistical relation (perhaps correlation) between problem and solution sizes.

    So if we double the size of an integration problem will the size of its solution double? or increase faster than linear? or slower than linear? What about the variance in sizes? Or other statistics?

    Naturally there is an infinite number of integration problems of a given size, so any simulation study will have inherent uncertainties. Nevertheless, this seems like an interesting, and potentially fruitful, class of simulation problems, for which
    we must use large-scale computer-algebra systems.

    Anyone interested in working on this?

    --David G. Stork
    From my experience implementing symbolic integrators, there is little to no correlation between the size of integrands and their optimal antiderivatives. Derivatives of relatively small expressions can be huge. Conversely antiderivatives of
    relatively small expressions can be huge.

    Albert
    Albert,

    My experience over decades suggests instead that there IS (or at least MAY BE) interesting structure in the relation between problems and solutions, and a principled exploratory study might give unexpected results.

    Here's one:

    Integrate[((c + d Tan[e + f x])^{5/2}(a + b Tan[e + f x] + c Tan[e + f x]^2))/(a + b Tan[ e + f x])^{9/2},x]

    has a solution that takes 36 Mbytes to write out!

    --David Stork
    Your example validates my point. The antiderivative of this relatively small integrand is huge. But the antiderivative of many other small integrands are small. Thus no correlation.

    I recommend you use parentheses rather than curly brackets around fractional exponents, since curly brackets designate lists in Mathematica.

    BTW, the leaf size of the valid antiderivative Rubi (the Rule-Based Integrator) returns for your example is only 789 leaves. So if you are planning to use antiderivative leaf size in your research, I suggest using the size of optimal antiderivatives
    like those Rubi usually delivers. Rubi is freely available for downloading at https://rulebasedintegration.org/

    Albert

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Fateman@21:1/5 to David Stork on Sun Apr 9 13:30:31 2023
    While there are easy ways of sometimes bounding the size of the derivative of an expression, you rapidly
    run out of rules for more general "problems/solutions". For instance, here is a problem.
    expand ((x+1)^(2^n)).
    characterize the number of terms as a function of the integer n.

    Simulation of random expression problems could (in general) provide super-exponential growth in size.

    Another study of "statistics" might be -- given a free on-line computer algebra system that
    does some task (you can pick indefinite integration), what is the distribution of inputs
    from [random?] clients?
    I suspect you will get (a) homework problems and (b) people just trying it out to see if
    it works. People can ask ChatGPT to do math, and it sometimes gets the right answer.
    It sometimes doesn't.
    A (much) earlier experiment we ran at Berkeley TILU collected some problems (maybe a few hundred?) and mostly found that people were unlikely to master the first stage of the problem: getting the syntax right. Thus we collected stuff like
    sin x, sinx, sin(x), Sin(x), Sin[x], SinX.

    As for whether this is interesting or not, I would not expect simulation -- where you write
    a program P to generate problems -- to reveal much other than the behavior of program P.
    RJF



    On Wednesday, March 1, 2023 at 5:57:46 PM UTC-8, David Stork wrote:
    Are there any large-scale simulation studies of the statistics of symbolic problem-solution pairs?

    Let's consider just symbolic integration of some class of functions, for instance the class covered by the Risch algorithm (exponentials, logarithms, trigonometric functions, addition, subtraction, multiplication, and division). Suppose we quantify the
    "size" of an integration problem by the number of leafs in its tree-based representation, and likewise for the problem's anti-derivative.

    Problems of a given size may have solutions spanning a range of sizes, of course. Thus there is some statistical distribution of the sizes, and thus a statistical relation (perhaps correlation) between problem and solution sizes.

    So if we double the size of an integration problem will the size of its solution double? or increase faster than linear? or slower than linear? What about the variance in sizes? Or other statistics?

    Naturally there is an infinite number of integration problems of a given size, so any simulation study will have inherent uncertainties. Nevertheless, this seems like an interesting, and potentially fruitful, class of simulation problems, for which we
    must use large-scale computer-algebra systems.

    Anyone interested in working on this?

    --David G. Stork

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)