-

header

MOCCA suite

About

MOCCA (Motif Occurrence Combinatorics Classification Algorithms) is a suite for modelling DNA cis-regulatory element (CRE) sequences. With MOCCA, we include the first polished, efficient and configurable implementation of the Support Vector Machine Motif Occurrence Combinatorics Classification Algorithm (SVM-MOCCA), a method that we previously presented and found to improve generalization to Polycomb/Trithorax Response Elements (PREs) (Bredesen et al. 2019), a class of cis-regulatory elements (CREs) that maintains epigenetic memory. SVM-MOCCA is a hierarchical method based on Support Vector Machines (SVMs) and motifs, where one SVM is trained per motif to classify its occurrences, with the feature space consisting of local dinucleotide and motif occurrence frequencies. Positively classified motif occurrences are subsequently combined using a log-odds model for a final prediction score. SVM-MOCCA distinguishes itself from classical use of SVMs with motifs for the modelling of CRE sequences, where SVMs are trained with motif occurrence frequencies or k-spectra, whereas the MOCCA methods train one model per motif and combine predictions. MOCCA also includes a derivative method based on Random Forests called the Random Forest Motif Occurrence Combinatorics Classification Algorithm (RF-MOCCA). In addition, MOCCA implements support for training log-odds models and classical SVM and RF models using a variety of feature space formulations. MOCCA includes functionality for the generation of negative data, threshold calibration and genome-wide prediction, and also an automated mode that requires only that the user specifies positive sequences, motifs and a genome.

Features

  • Models
    • Dummy PREdictor
    • CPREdictor
    • SVM-MOCCA (the Support Vector Machine Motif Occurrence Combinatorics Classification Algorithm)
    • RF-MOCCA (the Random Forest Motif Occurrence Combinatorics Classification Algorithm)
    • Log-odds models with motif-based feature spaces
    • Support Vector Machines with motif-based feature spaces
    • Random Forests with motif-based feature spaces
  • Motif handling
    • Command-line specification of IUPAC motifs
    • Loading of IUPAC motifs from XML
    • Generation of random IUPAC motifs
    • Full k-mer sets
    • IUPAC motif occurrence parsing Finite State Machine
    • Position Weight Matrix motifs
  • Feature spaces
    • Motif occurrence frequency spectrum
    • Motif pair occurrence frequency spectrum, with distance cutoff, and multiple distancing and overlap modes
    • Motif distancing kernels
    • Periodic motif occurrence kernels
    • Motif pairing kernels that incorporate positional information
  • Core usage features
    • Training with FASTA sequence files
    • Validation with FASTA sequence files
    • Prediction threshold calibration for a desired precision
    • Genome-wide prediction of candidate CREs to General Feature Format files
    • Genome-wide prediction to Wiggle files
    • Saving of sequence scores to table
    • Scoring of sequence files to Wiggle curves
    • Automatic construction of negative training/test/calibration data

Last time actively developed

2021

System requirements

  • Ubuntu
    • Ubuntu 20.04

Links

Github

Publication

Copyright Bjørn Bredesen-Aa, 2022 - E-mail: bjorn at bjornbredesen dot no