Software for experiments reported in 

    "Classification of non-Coding RNA using graph 
       representations of secondary structure"
    Yan Karklin, Richard F. Meraz, and Stephen R. Holbrook. 
    (Pacific Symposium on Biocomputing 2005).

This package includes only the single family classifier code.  The
rest (multi-class classification and analysis, for example) is
included in the "rnagraphs_misc.tgz" package.  It's not documented or
guaranteed to work - it's only there to make your life easier, not
mine more complicated.

Send questions and comments to yan+@cs.cmu.edu.  
                                       yan karklin.  nov 2004.

#####################################################################

Here's what's in this distribution:

run.m
loadParams.m
loadGraphs.m
labelGraphs.m
makeSynthRNA.m
doClassification.m

utils/brkt2bpgr.m
utils/bpgr2dual.m
utils/countFreqs.m
utils/getSVMlabels.m

svm/train.m
svm/classify.m
svm/graphKernel.m
svm/edgeKernel.m
svm/vertexKernel.m

analysis/drawBPGraph.m
analysis/drawDual.m
analysis/arrangeVert.m
analysis/plotSVMresult.m
analysis/computeROC.m

data/tRNA.sec


Here's what's not:

RFAM - Data used in experiments.
  http://www.sanger.ac.uk/Software/Rfam/
Vienna RNA Package - Used for folding RFAM and synthetic RNAs
  http://www.tbi.univie.ac.at/~ivo/RNA/
  
To duplicate results in the paper, you will have to download RNAs from
RFAM and put them in the ./data/ directory.  Then install vienna RNA,
point this code to it, and fold the RNAs.  You might have to
post-process the generated bracket-notation files; I've included an
example of what these should look like, in ./data/tRNA.sec (a few
tRNAs from RFAM).


####################################################################

A brief overview of the included functions.  See comments in
individual files for detailed instruction.

run.m

  This is the main run function.  Use it as an example of
  how to run your own classifier.

  If you're lazy (like me), ignore all the text below and start 
  by looking at this function.  

loadParams.m

  Specify the representation (graph type, label scheme) and
  learning (kernel, svm) parameters, as well as parameters for
  generating the negative (synthetic) data set.

loadGraphs.m

  Read secondary structure files and  convert them into graph 
  structures.

labelGraphs.m

  Convert nucleotide labels on the graph vertices to labels for
  kernel computation.  The scheme used here, together with exact
  form of the vertex/edge computation specifies the kernel for the
  SVM classifier.

makeSynthRNA.m

  Take the positive data set (real RNAs), compute their statistics,
  generate random sequences, fold them, and write bracket-notation
  secondary structure files, to be read by loadGraphs.m.

doClassification.m
  
  Run the SVM classifier (withh cross-validation) on the loaded data.
  The returned "Results" structure contains the true labels, the
  kernel matrix, the assigned scores, computed SVM quantities
  alpha/beta, and the indeces used for the randomized cross-validation
  procedure.


utils/brkt2bpgr.m

  Converts RNA secondary structures in bracket notation to the
  base-pair graph representation, in which a vertex is devoted to each
  single nucleotide and each nucleotide pair in the stem.  Sort of the
  way you'd draw an RNA secondary structure, with paired nucl merged
  together.  A, C, G, U are kept as vertex labels (concatenated for
  paired nucl) and edges are labeled 2 for within-stem edges and 1 for
  the regular (backbone) edges.  Again, just like you'd draw a folded
  molecule, with a double edge for the stem (helix).

utils/bpgr2dual.m

  Converts the above format to a dual graph.  Don't ask me how this
  works, it would take too long to figure out and longer to explain.
  If you think it's a horrific algorithm/implementation, I agree, and
  urge you to write a better one.  Oh, and it works by starting with
  the base-pair graph representation generated by brkt2bpgr.m, that's
  how lame it is.
  The resulting representation is a graph described in the paper
  (minus the labelling scheme).  The edges are labeled with the
  lengths of the unpaired strand, and the vertices are labeledd with
  concatenated nucleotide sequences.

utils/countFreqs.m

  Computes the 1st (mono-nucl) and 2nd (di-nucl) statistics of an RNA
  sequence.
  
utils/getSVMlabels.m

  Compares the RNA family name of all RNAs in the Graph structure to 
  the 'targetfam' name in order to assign the "true labels" for SVM
  classification.

svm/train.m

  Train the SVM.  Computes the kernel matrix and the classifier output
  parameters.  
  This code uses the simple matlab implementation of least-squares
  SVM (http://www.esat.kuleuven.ac.be/sista/lssvmlab/).

svm/classify.m

  Using the computed support vectors, classifies given training data.
  Again, adapted from LS-SVM.

svm/graphKernel.m

  The core of this kernel machine.  Implements the kernel between two
  labeled graphs as described in Koshima, Tsuda, Inokuchi (2003).  

svm/edgeKernel.m

  Simple routing for computing the kernel between two edge labels.  The
  type of kernel used is specified in the parameters.

svm/vertexKernel.m

  An identical function that compues the kernel between two vertex labels.

analysis/drawBPGraph.m

  Draws the base pair graph.  Uses arrangeVert.m to spread out the
  graph a bit.

analysis/drawDual.m

  Draws the dual graph.  Not a particularly intelligent function,
  but does the trick.

analysis/arrangeVert.m

  Arranges (by spreading out) the vertices of a graph.  Starts by
  arranging them in a circle and using some spring-type
  repulsion/attraction mechanism for pulling adjacent vertices
  together.

analysis/plotSVMresult.m

  Computes the TruePositive/TrueNeg/FP/FN statistics from the Result
  structure and plot them as bars.

analysis/computeROC.m

  Computes the ROC curve, area under it, and an error estimate if
  multiple classification runs (e.g. in cross-validation) are given.

data/tRNA.sec

  An example of the secondary structure file used by this
  software. This was produced (possibly with some post-processing) by
  folding the RFAM RNA sequences with Vienna RNA's folding algorithm.
  You'll have to decipher the file input scheme in loadGraphs.m to see
  how things really work.