Software for experiments reported in
"Classification of non-Coding RNA using graph
representations of secondary structure"
Yan Karklin, Richard F. Meraz, and Stephen R. Holbrook.
(Pacific Symposium on Biocomputing 2005).
This package includes only the single family classifier code. The
rest (multi-class classification and analysis, for example) is
included in the "rnagraphs_misc.tgz" package. It's not documented or
guaranteed to work - it's only there to make your life easier, not
mine more complicated.
Send questions and comments to yan+@cs.cmu.edu.
yan karklin. nov 2004.
#####################################################################
Here's what's in this distribution:
run.m
loadParams.m
loadGraphs.m
labelGraphs.m
makeSynthRNA.m
doClassification.m
utils/brkt2bpgr.m
utils/bpgr2dual.m
utils/countFreqs.m
utils/getSVMlabels.m
svm/train.m
svm/classify.m
svm/graphKernel.m
svm/edgeKernel.m
svm/vertexKernel.m
analysis/drawBPGraph.m
analysis/drawDual.m
analysis/arrangeVert.m
analysis/plotSVMresult.m
analysis/computeROC.m
data/tRNA.sec
Here's what's not:
RFAM - Data used in experiments.
http://www.sanger.ac.uk/Software/Rfam/
Vienna RNA Package - Used for folding RFAM and synthetic RNAs
http://www.tbi.univie.ac.at/~ivo/RNA/
To duplicate results in the paper, you will have to download RNAs from
RFAM and put them in the ./data/ directory. Then install vienna RNA,
point this code to it, and fold the RNAs. You might have to
post-process the generated bracket-notation files; I've included an
example of what these should look like, in ./data/tRNA.sec (a few
tRNAs from RFAM).
####################################################################
A brief overview of the included functions. See comments in
individual files for detailed instruction.
run.m
This is the main run function. Use it as an example of
how to run your own classifier.
If you're lazy (like me), ignore all the text below and start
by looking at this function.
loadParams.m
Specify the representation (graph type, label scheme) and
learning (kernel, svm) parameters, as well as parameters for
generating the negative (synthetic) data set.
loadGraphs.m
Read secondary structure files and convert them into graph
structures.
labelGraphs.m
Convert nucleotide labels on the graph vertices to labels for
kernel computation. The scheme used here, together with exact
form of the vertex/edge computation specifies the kernel for the
SVM classifier.
makeSynthRNA.m
Take the positive data set (real RNAs), compute their statistics,
generate random sequences, fold them, and write bracket-notation
secondary structure files, to be read by loadGraphs.m.
doClassification.m
Run the SVM classifier (withh cross-validation) on the loaded data.
The returned "Results" structure contains the true labels, the
kernel matrix, the assigned scores, computed SVM quantities
alpha/beta, and the indeces used for the randomized cross-validation
procedure.
utils/brkt2bpgr.m
Converts RNA secondary structures in bracket notation to the
base-pair graph representation, in which a vertex is devoted to each
single nucleotide and each nucleotide pair in the stem. Sort of the
way you'd draw an RNA secondary structure, with paired nucl merged
together. A, C, G, U are kept as vertex labels (concatenated for
paired nucl) and edges are labeled 2 for within-stem edges and 1 for
the regular (backbone) edges. Again, just like you'd draw a folded
molecule, with a double edge for the stem (helix).
utils/bpgr2dual.m
Converts the above format to a dual graph. Don't ask me how this
works, it would take too long to figure out and longer to explain.
If you think it's a horrific algorithm/implementation, I agree, and
urge you to write a better one. Oh, and it works by starting with
the base-pair graph representation generated by brkt2bpgr.m, that's
how lame it is.
The resulting representation is a graph described in the paper
(minus the labelling scheme). The edges are labeled with the
lengths of the unpaired strand, and the vertices are labeledd with
concatenated nucleotide sequences.
utils/countFreqs.m
Computes the 1st (mono-nucl) and 2nd (di-nucl) statistics of an RNA
sequence.
utils/getSVMlabels.m
Compares the RNA family name of all RNAs in the Graph structure to
the 'targetfam' name in order to assign the "true labels" for SVM
classification.
svm/train.m
Train the SVM. Computes the kernel matrix and the classifier output
parameters.
This code uses the simple matlab implementation of least-squares
SVM (http://www.esat.kuleuven.ac.be/sista/lssvmlab/).
svm/classify.m
Using the computed support vectors, classifies given training data.
Again, adapted from LS-SVM.
svm/graphKernel.m
The core of this kernel machine. Implements the kernel between two
labeled graphs as described in Koshima, Tsuda, Inokuchi (2003).
svm/edgeKernel.m
Simple routing for computing the kernel between two edge labels. The
type of kernel used is specified in the parameters.
svm/vertexKernel.m
An identical function that compues the kernel between two vertex labels.
analysis/drawBPGraph.m
Draws the base pair graph. Uses arrangeVert.m to spread out the
graph a bit.
analysis/drawDual.m
Draws the dual graph. Not a particularly intelligent function,
but does the trick.
analysis/arrangeVert.m
Arranges (by spreading out) the vertices of a graph. Starts by
arranging them in a circle and using some spring-type
repulsion/attraction mechanism for pulling adjacent vertices
together.
analysis/plotSVMresult.m
Computes the TruePositive/TrueNeg/FP/FN statistics from the Result
structure and plot them as bars.
analysis/computeROC.m
Computes the ROC curve, area under it, and an error estimate if
multiple classification runs (e.g. in cross-validation) are given.
data/tRNA.sec
An example of the secondary structure file used by this
software. This was produced (possibly with some post-processing) by
folding the RFAM RNA sequences with Vienna RNA's folding algorithm.
You'll have to decipher the file input scheme in loadGraphs.m to see
how things really work.