Bjørn Bredesen-Aa

About

Large parts of complex genomes of multi-cellular organisms are non-coding. Cis-regulatory elements (CREs) are non-coding sequences that establish or modify gene transcription by multiple mechanisms. Multiple classes of CREs have been identified, including promoters, enhancers, silencers, insulators and Polycomb/Trithorax Response Elements. CREs can be identified experimentally or by means of in silico prediction. Experimental identification of CREs can depend on the cells that are used. Genome-wide in silico prediction, on the other hand, can potentially comprehensively predict CREs in a genome. In order to use machine learning for CRE prediction, a variety of functionality is required. A variety of packages exist for Python 3 for machine learning and sequence analysis, but successfully combining them requires the implementation of interfacing between them. Ensuring that the solution is efficient is important for large genomes, but can be challenging for end-users.

Gnocis is a system in Python 3 for the interactive and reproducible analysis and modelling of CRE DNA sequences. A broad suite of tools is implemented for data preparation, feature set definition, model formulation, training, cross-validation and genome-wide prediction. Gnocis employs Cython and a variety of techniques in order to optimally implement the glue necessary in order to apply machine learning for CRE analysis and prediction.

Features

DNA sequence handling
- File format support - Loading and streaming
  - FASTA
  - 2bit
- File format support - Saving
  - FASTA
- Operations
  - Printing
  - Sliding window extraction
  - Reverse complement generation
Sequence region handling
- File format support - Loading and saving
  - GFF
  - BED
  - Coordinate lists (chromosome:start..end)
- Operations
  - Overlap acquisition
  - Non-overlap acquisition
  - Merged set generation
  - Exclusion set generation
  - Sequence region extraction
Modelling
- Generative DNA sequence models, with training and sequence generation
  - I.i.d.
  - N’th order Markov chains
- Confusion matrices
  - Generation from model statistics
  - Printing
  - Receiver Operating Characteristic curve generation
  - Precision Recall Curve generation
  - Area Under the Curve calculation
- Feature models
  - Log-odds
  - Dummy
  - Support Vector Machines (via sklearn)
  - Random Forest (via sklearn)
- Features
  - k-mer spectrum
  - Motif occurrence spectrum
  - Motif pair occurrence spectrum
Motifs
- Types
  - IUPAC nucleotide motifs
  - Position Weight Matrices
  - k-mer spectra
Feature networks
- Directed acyclic graphs of features
- Transformations of feature sets: filtering; concatenation; scaling; square; …
- Feature network nodes for constructing models
- Application to sequences
Optionally integrates with established packages
- Numpy – for integration with external methods
- Pandas – for integration with external methods
- Scikit-learn – for extended analyses and classic machine learning
- TensorFlow – for neural networks
- Jupyter Notebooks – for interactive and reproducible analysis and modelling
Easy to use
Objects are represented by classes, with human-readable descriptions
Optimized with Cython
…

Last time actively developed

2022

System requirements

Ubuntu
- Ubuntu 20.04

Links

Github

Publication