• an AI looks at Nessie sightings (1/2)

    From MrPostingRobot@kymhorsell.com@21:1/5 to All on Sat Apr 3 18:57:56 2021
    I mentioned before I have been working on an AI that is intended to
    "do science research" aka "unmanned science".

    I'll post this brief (?) example of what it does so far as an
    illustration of a "new flavor" of scientific research that I see as
    being a possible growth area over the next few years.

    I have some experience in the area having worked in science and
    engineering over the past 50y. While I'm not a great writer of
    scientific papers, the publishers tell me I have been given credit in
    about 400 published papers so far, mostly in physics. In the past few
    years I have changed careers for the Nth time and retooled as a data
    scientist. At least a few years ago when there were about 1/2 mn
    registered participants at KAGGLE I was ranked #189 in the world. I
    must be doing something approximately right. <kaggle.com/kymhorsell1>

    I've written about the basic philosophy of AI science research
    elsewhere so the following is mostly restricted to a nice area that
    sci.skeptic can get its teeth into. Although I probably wont be
    listening. At an advanced age and in somewhat ill health for the past
    decade on and off I'm not all that interested in listening to lectures
    in one of my subject areas from people quoting from magazine articles
    that neither understand the subject matter not cite research that
    apparently represents the subject matter accurately. But maybe that's
    just me. :)

    The topic today is some kind of AI s/w examining the basic data for
    sightings of a lake monster in Scottyland. I allays use the totely
    korrect speling so pls no complaints about thet.

    While witnesses over decades and perhaps a couple centuries swear
    black and blue they have seen "something" in the lake, and there is
    even interesting footage of "something" making a wave in the canal
    connecting it with the sea, no-one seems to have landed a sample of the
    beast for everyone to get a good close reproducible look at it.

    This can be galling to students of hard science. But it is quite an
    everyday for "science" of other types where Nature cant be made to
    dance a jig in the lab at will. E.g. astronomy.

    So "working from data only" would seem to be a valid kind of science.
    Theories can be proposed and checked against the data available. From
    the results of some test other tests can be proposed and validated (or
    not).

    And that's what my little AI s/w does in a simple but increasingly
    more complex way.

    The basic workhorse are a bunch of robust statistical routines that
    can handle different kinds of data but basically line up one dataset
    with another dataset and determine how close they match and whether
    the match is too close to be just due to chance.

    To update anyone that may not know, Data Science (pardon the use of
    capitals) is these days up to the task of establishing causal links at
    least along the lines intro'ed in the 1960s Surgeon General's report
    into the link between tobacco and lung cancer. I.e. it builds
    predictive models that predict forward in time, backwards in time, and
    across a cross section of subsets. If all these pass then causation
    has been scientifically established. In the past few years it was a
    hot area of research in the D.S. literature so things get more complex
    from the 1960s basis. Way more.

    Behind the stats s/w are some learning and reasoning capabilities. We anticipate in designing this s/w it will have to sort through 1000s of
    mns of datasets to make a decision. Since the stats for just 1
    test can run several seconds of cpu it's not a reasonable idea to
    actually run every test.

    And that's were the AI part comes in. A learning module runs tests and
    looks at the results, building a robust predictive model that can
    predict results from metadata of the inputs. Whenever it's asked to
    run a stats test it can decide whether it can predict the answer
    within a given interval of certainty and if it thinks it can it
    "guesstimates" the answer rather than doing the work.

    This feature has proved to be a kind of AI backbone for the whole
    s/w. Practically every decision in the whole program is run through
    this "smart memoization" mechanism. If any kind of logical test (and
    there are 1000s in the s/w -- i.e. all the "if statements" for one
    thing) it runs that test first through a learning algorithm that can
    either answer "I know the answer with 99% certainty" and avoid doing
    the work involved or it can not and does the work involved. The
    similarity with a version of limited time travel quantum computation
    is not entirely coincidental. :)

    Behind these various modules is the database. At present this is a
    quite large 1 TB database maintained using mysql on a unix
    machine. Various associated s/w updates the database from time to time
    from known websites that are maintained by different science groups
    around the place. It can also hunt up and download some kinds of
    dataset given meta-information or a series of specific-enough
    keywords. That's how it got hold of the Nessie sighting data, e.g. And
    such hunt-up queries can come from me or can be internally generated
    by the problem-solving part of the s/w. Obviously we have to put some
    brakes on this part or my 10 GB monthly net quota will get blown in
    the first days of every month.

    The results we'll now turn to are the basic "guts" of the s/w to
    date. Given a dataset of some phenomenon, it looks through (currently)
    28,000 based datasets, blowing out to 3 mn derived datasets
    after each base dataset is manipulated in various ways -- e.g. time
    shifting, various kinds of transforms like differentiation,
    integration, scaling, binning, categorizing, de-categorizing,
    de-trending, deseasonaling, homogenization with respect to some
    criterion, etc etc.

    So there's a lot of work to get through. But luckily its accumulated
    experience with 1000s of problems over the past couple years now
    allows it to skip 99% of the work and take a lot of shortcuts with the
    1% it actually needs to process. For the processing it has the
    benefit of an old setup I used to run as a business (exaflops.com) a
    few years back. <kymhorsell.com/garage-pc>. So it can generally get
    through the work in a few minutes despite the increasing decrepitude
    of the hardware available. :}

    So the the results. We are interested in which data series in the big
    database closely match aka "look like" the target dataset -- in this
    cat the dates of "registered" Nessie sightings from some web database.

    Some of the results will be shocking if you haven't seen this kind of
    "fringe" stuff before.

    The part of the s/w actively under development is intended to create
    "every possible theory" (upto a given complexity level) and evaluate
    how likely is to be true according to all the results from the
    statistical analysis and then spit out more or less a natural language explanation of around 1 page in length -- something you could given to
    a magazine journo for example. :)

    That part will be switched off for today. It's still "highly
    experimental".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)