SRI Authors: Peter D Karp Abstract The Pathway Tools (PTools) software provides a suite of capabilities for storing and analyzing integrated collections of genomic and metabolic information in the form of organism-specific Pathway/Genome Databases (PGDBs). A microbial community is represented in PTools by generating a PGDB from each metagenome-assembled genome (MAG). PTools computes a metabolic […]
Artificial intelligence publications
We describe work in progress towards deriving a unification algorithm automatically from a declarative specification using deductive methods. The specification is phrased as a logical theorem and the program is extracted from the proof. The theorem is proved using Stickel’s system SNARK, operating over a theory of expressions and substitutions. The theory has been formulated to allow a simpler specification of the algorithm, and the theorem prover has discovered novelties in the implementation. It is hoped that the same techniques may enable the discovery of previously unknown unification algorithms for specific theories.
Pathway Size Matters: The Influence of Pathway Granularity on Over-Representation (Enrichment) Statistics
Background: Enrichment or over-representation analysis is a common method used in bioinformatics studies of transcriptomics, metabolomics, and microbiome datasets. The key idea behind enrichment analysis is: given a set of significantly expressed genes (or metabolites), use that set to infer a smaller set of perturbed biological pathways or processes, in which those genes (or metabolites) play a role. Enrichment computations rely on collections of defined biological pathways and/or processes, which are usually drawn from pathway databases. Although practitioners of enrichment analysis take great care to employ statistical corrections (e.g., for multiple testing), they appear unaware that enrichment results are quite sensitive to the pathway definitions that the calculation uses.
Results: We show that alternative pathway definitions can alter enrichment p -values by up to nine orders of magnitude, whereas statistical corrections typically alter enrichment p -values by only two orders of magnitude. We present multiple examples where the smaller pathway definitions used in the EcoCyc database produces stronger enrichment p -values than the much larger pathway definitions used in the KEGG database; we demonstrate that to attain a given enrichment p -value, KEGG-based enrichment analyses require 1.3–2.0 times as many significantly expressed genes as does EcoCyc-based enrichment analyses. The large pathways in KEGG are problematic for another reason: they blur together multiple (as many as 21) biological processes. When such a KEGG pathway receives a high enrichment p -value, which of its component processes is perturbed is unclear, and thus the biological conclusions drawn from enrichment of large pathways are also in question.
Conclusions: The choice of pathway database used in enrichment analyses can have a much stronger effect on the enrichment results than the statistical corrections used in these analyses.
Neural methods of molecule property prediction require efficient encoding of structure and property relationship to be accurate. Recent work using graph algorithms shows limited generalization in the latent molecule encoding space. We build a Transformer-based molecule encoder and property predictor network with novel input featurization that performs significantly better than existing methods. We adapt our model to semi-supervised learning to further perform well on the limited experimental data usually available in practice.
Interestingness Elements for Explainable Reinforcement Learning: Understanding Agents’ Capabilities and Limitations
We propose an explainable reinforcement learning (XRL) framework that analyzes an agent’s history of interaction with the environment to extract interestingness elements that explain its behavior. The framework relies on data readily available from standard RL algorithms, augmented with data that can easily be collected by the agent while learning. We describe how to create visual explanations of an agent’s behavior in the form of short video-clips highlighting key interaction moments, based on the proposed elements. We also report on a user study where we evaluated the ability of humans in correctly perceiving the aptitude of agents with different characteristics, including their capabilities and limitations, given explanations automatically generated by our framework. The results show that the diversity of aspects captured by the different interestingness elements is crucial to help humans correctly identify the agents’ aptitude in the task, and determine when they might need adjustments to improve their performance.
Procedure automation can relieve users of the burden of repetitive, time-consuming, or complex procedures and enable them to focus on more cognitively demanding tasks. Procedural learning is a method by which procedure automation can be achieved by intelligent computational assistants. This paper explores the use of filtering heuristics based on action models for automated planning to augment sequence mining techniques. Sequential pattern mining algorithms rely primarily on frequency of occurrence to identify patterns, leaving them susceptible to discovering patterns that make little sense from a cognitive perspective. In contrast, humans are able to form models of procedures from small numbers of observations, even without explicit instruction. We posit that humans are able to do so because of background knowledge about actions and procedures, which lets them effectively filter out meaningless sequential patterns. The action models foundational to artificial intelligence (AI) planning is one way to provide semantics to actions, supporting the design of heuristics for eliminating spurious patterns discovered from event logs. We present experiments with various filters derived from these action models, the results of which show the value of the filters in greatly reducing the number of sequential patterns discovered without sacrificing the number of correct patterns found, even with small, noisy event logs.
Exact Inference for Relational Graphical Models with Interpreted Functions: Lifted Probabilistic Inference Modulo Theories
Probabilistic Inference Modulo Theories (PIMT) is a recent framework that expands exact inference on graphical models to use richer languages that include arithmetic, equalities, and inequalities on both integers and real numbers. In this paper, we expand PIMT to a lifted version that also processes random functions and relations. This enhancement is achieved by adapting Inversion, a method from Lifted First-Order Probabilistic Inference literature, to also be modulo theories. This results in the first algorithm for exact probabilistic inference that efficiently and simultaneously exploits random relations and functions, arithmetic, equalities and inequalities.
This document compares SRI’s Pathway Tools (PTools) software with the Reactome software. Both software systems serve the pathway bioinformatics area, including representation and analysis of metabolic pathways and signaling pathways. The comparison covers pathway bioinformatics capabilities, but does not cover other major facets of Pathway Tools that are completely absent from the Reactome software: Pathway Tools genome-informatics capabilities, regulatory informatics capabilities, and table-based analysis tools (SmartTables).
Our overall findings are as follows. (1) PTools is significantly ahead of Reactome in its basic information pages. For example, PTools pathway layout algorithms have been developed to an advanced state over several decades, whereas Reactome pathway layouts are illegible, omit important information, and are created manually and therefore cannot scale to thousands of genomes. (2) PTools is far ahead of Reactome in omics analysis. PTools includes all of the omics-analysis methods that Reactome provides, and includes multiple methods that Reactome lacks. (3) PTools contains a metabolic route search tool (searching for paths through the metabolic network), which Reactome lacks. (4) PTools is significantly ahead of Reactome in inference of metabolic pathways from genome information to create new metabolic databases. (5) PTools has an extensive complement of metabolic-modeling tools whereas Reactome has none. (6) PTools is more scalable than Reactome, handling 18,000 genomes versus 90 genomes for Reactome. (7) PTools has a larger user base than Reactome. PTools powers 17 websites versus two for Reactome. PTools has been licensed by 10,800 users (Reactome licensed user count is unknown).
MetaCyc (MetaCyc.org) is a comprehensive reference database of metabolic pathways and enzymes from all domains of life. It contains 2749 pathways derived from more than 60 000 publications, making it the largest curated collection of metabolic pathways. The data in MetaCyc are evidence-based and richly curated, resulting in an encyclopedic reference tool for metabolism. MetaCyc is also used as a knowledge base for generating thousands of organism-specific Pathway/Genome Databases (PGDBs), which are available in BioCyc.org and other genomic portals. This article provides an update on the developments in MetaCyc during September 2017 to August 2019, up to version 23.1. Some of the topics that received intensive curation during this period include cobamides biosynthesis, sterol metabolism, fatty acid biosynthesis, lipid metabolism, carotenoid metabolism, protein glycosylation, antibiotics and cytotoxins biosynthesis, siderophore biosynthesis, bioluminescence, vitamin K metabolism, brominated compound metabolism, plant secondary metabolism and human metabolism. Other additions include modifications to the GlycanBuilder software that enable displaying glycans using symbolic representation, improved graphics and fonts for web displays, improvements in the PathoLogic component of Pathway Tools, and the optional addition of regulatory information to pathway diagrams.