From MrPostingRobot@kymhorsell.com@21:1/5 to All on Sat Apr 3 18:57:56 2021
I mentioned before I have been working on an AI that is intended to
"do science research" aka "unmanned science".
I'll post this brief (?) example of what it does so far as an
illustration of a "new flavor" of scientific research that I see as
being a possible growth area over the next few years.
I have some experience in the area having worked in science and
engineering over the past 50y. While I'm not a great writer of
scientific papers, the publishers tell me I have been given credit in
about 400 published papers so far, mostly in physics. In the past few
years I have changed careers for the Nth time and retooled as a data
scientist. At least a few years ago when there were about 1/2 mn
registered participants at KAGGLE I was ranked #189 in the world. I
must be doing something approximately right. <kaggle.com/kymhorsell1>
I've written about the basic philosophy of AI science research
elsewhere so the following is mostly restricted to a nice area that
sci.skeptic can get its teeth into. Although I probably wont be
listening. At an advanced age and in somewhat ill health for the past
decade on and off I'm not all that interested in listening to lectures
in one of my subject areas from people quoting from magazine articles
that neither understand the subject matter not cite research that
apparently represents the subject matter accurately. But maybe that's
just me. :)
The topic today is some kind of AI s/w examining the basic data for
sightings of a lake monster in Scottyland. I allays use the totely
korrect speling so pls no complaints about thet.
While witnesses over decades and perhaps a couple centuries swear
black and blue they have seen "something" in the lake, and there is
even interesting footage of "something" making a wave in the canal
connecting it with the sea, no-one seems to have landed a sample of the
beast for everyone to get a good close reproducible look at it.
This can be galling to students of hard science. But it is quite an
everyday for "science" of other types where Nature cant be made to
dance a jig in the lab at will. E.g. astronomy.
So "working from data only" would seem to be a valid kind of science.
Theories can be proposed and checked against the data available. From
the results of some test other tests can be proposed and validated (or
And that's what my little AI s/w does in a simple but increasingly
more complex way.
The basic workhorse are a bunch of robust statistical routines that
can handle different kinds of data but basically line up one dataset
with another dataset and determine how close they match and whether
the match is too close to be just due to chance.
To update anyone that may not know, Data Science (pardon the use of
capitals) is these days up to the task of establishing causal links at
least along the lines intro'ed in the 1960s Surgeon General's report
into the link between tobacco and lung cancer. I.e. it builds
predictive models that predict forward in time, backwards in time, and
across a cross section of subsets. If all these pass then causation
has been scientifically established. In the past few years it was a
hot area of research in the D.S. literature so things get more complex
from the 1960s basis. Way more.
Behind the stats s/w are some learning and reasoning capabilities. We anticipate in designing this s/w it will have to sort through 1000s of
mns of datasets to make a decision. Since the stats for just 1
test can run several seconds of cpu it's not a reasonable idea to
actually run every test.
And that's were the AI part comes in. A learning module runs tests and
looks at the results, building a robust predictive model that can
predict results from metadata of the inputs. Whenever it's asked to
run a stats test it can decide whether it can predict the answer
within a given interval of certainty and if it thinks it can it
"guesstimates" the answer rather than doing the work.
This feature has proved to be a kind of AI backbone for the whole
s/w. Practically every decision in the whole program is run through
this "smart memoization" mechanism. If any kind of logical test (and
there are 1000s in the s/w -- i.e. all the "if statements" for one
thing) it runs that test first through a learning algorithm that can
either answer "I know the answer with 99% certainty" and avoid doing
the work involved or it can not and does the work involved. The
similarity with a version of limited time travel quantum computation
is not entirely coincidental. :)
Behind these various modules is the database. At present this is a
quite large 1 TB database maintained using mysql on a unix
machine. Various associated s/w updates the database from time to time
from known websites that are maintained by different science groups
around the place. It can also hunt up and download some kinds of
dataset given meta-information or a series of specific-enough
keywords. That's how it got hold of the Nessie sighting data, e.g. And
such hunt-up queries can come from me or can be internally generated
by the problem-solving part of the s/w. Obviously we have to put some
brakes on this part or my 10 GB monthly net quota will get blown in
the first days of every month.
The results we'll now turn to are the basic "guts" of the s/w to
date. Given a dataset of some phenomenon, it looks through (currently)
28,000 based datasets, blowing out to 3 mn derived datasets
after each base dataset is manipulated in various ways -- e.g. time
shifting, various kinds of transforms like differentiation,
integration, scaling, binning, categorizing, de-categorizing,
de-trending, deseasonaling, homogenization with respect to some
criterion, etc etc.
So there's a lot of work to get through. But luckily its accumulated
experience with 1000s of problems over the past couple years now
allows it to skip 99% of the work and take a lot of shortcuts with the
1% it actually needs to process. For the processing it has the
benefit of an old setup I used to run as a business (exaflops.com) a
few years back. <kymhorsell.com/garage-pc>. So it can generally get
through the work in a few minutes despite the increasing decrepitude
of the hardware available. :}
So the the results. We are interested in which data series in the big
database closely match aka "look like" the target dataset -- in this
cat the dates of "registered" Nessie sightings from some web database.
Some of the results will be shocking if you haven't seen this kind of
"fringe" stuff before.
The part of the s/w actively under development is intended to create
"every possible theory" (upto a given complexity level) and evaluate
how likely is to be true according to all the results from the
statistical analysis and then spit out more or less a natural language explanation of around 1 page in length -- something you could given to
a magazine journo for example. :)
That part will be switched off for today. It's still "highly