We report synthesis of an open source test data set and work in progress to expand it. Sophisticated data mining and Machine Learning (ML) techniques can discover statistical associations among variables that may or may not reflect actual causal dependencies. In many applications, systems must discriminate between associations that are mere coincidences and those that are at least plausibly causal. Further, a graph of causal relationships may be complex, with fan-in, fan-out, transitive, and various combinations of, dependencies. To test a system’s power to filter out non-causal associations and untangle the causal web, suitable synthetic data is needed. We report the development, in Wolfram Mathematica, of code that synthesizes data with subtle, complex, causal dependencies among some but not all of the generated observable variables. We implement several simple dissipative chaotic flows. Four (4) are autonomous, six (6) are driven. Among the resulting ten (10) observable state vectors, there are forty-five (45) potential pairwise (1:1) relationships, of which four (4) are strong, five (5) are moderate, three (3) are weak, for a total of twelve (12) that are actually causal, and any others are mere statistical artifacts that a tool under test should reject. Each system’s observables are corrupted by additive Gaussian noise. Each system’s hidden dynamics are disturbed by a normal Wiener process. The levels of these stochastic components are parameterized to make problem difficulty tunable. A set of generated data and code for generating more will be released openly on-line.
We report favorable preliminary findings of work in progress bridging the Artificial Intelligence (AI) gap between bottom-up data-driven Machine Learning (ML) and top-down conceptually driven symbolic reasoning. Our overall goal is automatic generation, maintenance and utilization of explainable, parsimonious, plausibly causal, probably approximately correct, hybrid symbolic/numeric models of the world, the self and other agents, for prediction, what-if (counter-factual) analysis and control. Our old Evolutionary Learning with Information Theoretic Evaluation of Ensembles (ELITE2) techniques quantify strengths of arbitrary multivariate nonlinear statistical dependencies, prior to discovering forms by which observed variables may drive others. We extend these to apply Granger causality, in terms of conditional Mutual Information (MI), to distinguish causal relationships and find their directions. As MI can reflect one observable driving a second directly or via a mediator, two being driven by a common cause, etc., to untangle the causal graph we will apply Pearl causality with its back- and front-door adjustments and criteria. Initial efforts verified that our information theoretic indices detect causality in noise corrupted data despite complex relationships among hidden variables with chaotic dynamics disturbed by process noise, The next step is to apply these information theoretic filters in Genetic Programming (GP) to reduce the population of discovered statistical dependencies to plausibly causal relationships, represented symbolically for use by a reasoning engine in a cognitive architecture. Success could bring broader generalization, using not just learned patterns but learned general principles, enabling AI/ML based systems to autonomously navigate complex unknown environments and handle “black swans”.
KEYWORDS: Data modeling, Feature selection, Statistical analysis, Process control, Analog electronics, Image information entropy, Facility engineering, Statistical modeling, Algorithm development, Data storage
In evolutionary learning, the sine qua non is evolvability, which requires heritability of fitness and a balance between
exploitation and exploration. Unfortunately, commonly used fitness measures, such as root mean squared error (RMSE),
often fail to reward individuals whose presence in the population is needed to explain important data variance; and
indicators of diversity generally are not only incommensurate with those of fitness but also essentially arbitrary. Thus,
due to poor scaling, deception, etc., apparently relatively high fitness individuals in early generations may not contain
the building blocks needed to evolve optimal solutions in later generations. To reward individuals for their potential
incremental contributions to the solution of the overall problem, heritable information theoretic functionals are
developed that incorporate diversity considerations into fitness, explicitly identifying building blocks suitable for
recombination (e.g. for non-random mating). Algorithms for estimating these functionals from either discrete or
continuous data are illustrated by application to input selection in a high dimensional industrial process control data set.
Multiobjective information theoretic ensemble selection is shown to avoid some known feature selection pitfalls.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.