Software for experiments reported in "Classification of non-Coding RNA using graph representations of secondary structure" Yan Karklin, Richard F. Meraz, and Stephen R. Holbrook. (Pacific Symposium on Biocomputing 2005). This package includes only the single family classifier code. The rest (multi-class classification and analysis, for example) is included in the "rnagraphs_misc.tgz" package. It's not documented or guaranteed to work - it's only there to make your life easier, not mine more complicated. Send questions and comments to yan+@cs.cmu.edu. yan karklin. nov 2004. ##################################################################### Here's what's in this distribution: run.m loadParams.m loadGraphs.m labelGraphs.m makeSynthRNA.m doClassification.m utils/brkt2bpgr.m utils/bpgr2dual.m utils/countFreqs.m utils/getSVMlabels.m svm/train.m svm/classify.m svm/graphKernel.m svm/edgeKernel.m svm/vertexKernel.m analysis/drawBPGraph.m analysis/drawDual.m analysis/arrangeVert.m analysis/plotSVMresult.m analysis/computeROC.m data/tRNA.sec Here's what's not: RFAM - Data used in experiments. http://www.sanger.ac.uk/Software/Rfam/ Vienna RNA Package - Used for folding RFAM and synthetic RNAs http://www.tbi.univie.ac.at/~ivo/RNA/ To duplicate results in the paper, you will have to download RNAs from RFAM and put them in the ./data/ directory. Then install vienna RNA, point this code to it, and fold the RNAs. You might have to post-process the generated bracket-notation files; I've included an example of what these should look like, in ./data/tRNA.sec (a few tRNAs from RFAM). #################################################################### A brief overview of the included functions. See comments in individual files for detailed instruction. run.m This is the main run function. Use it as an example of how to run your own classifier. If you're lazy (like me), ignore all the text below and start by looking at this function. loadParams.m Specify the representation (graph type, label scheme) and learning (kernel, svm) parameters, as well as parameters for generating the negative (synthetic) data set. loadGraphs.m Read secondary structure files and convert them into graph structures. labelGraphs.m Convert nucleotide labels on the graph vertices to labels for kernel computation. The scheme used here, together with exact form of the vertex/edge computation specifies the kernel for the SVM classifier. makeSynthRNA.m Take the positive data set (real RNAs), compute their statistics, generate random sequences, fold them, and write bracket-notation secondary structure files, to be read by loadGraphs.m. doClassification.m Run the SVM classifier (withh cross-validation) on the loaded data. The returned "Results" structure contains the true labels, the kernel matrix, the assigned scores, computed SVM quantities alpha/beta, and the indeces used for the randomized cross-validation procedure. utils/brkt2bpgr.m Converts RNA secondary structures in bracket notation to the base-pair graph representation, in which a vertex is devoted to each single nucleotide and each nucleotide pair in the stem. Sort of the way you'd draw an RNA secondary structure, with paired nucl merged together. A, C, G, U are kept as vertex labels (concatenated for paired nucl) and edges are labeled 2 for within-stem edges and 1 for the regular (backbone) edges. Again, just like you'd draw a folded molecule, with a double edge for the stem (helix). utils/bpgr2dual.m Converts the above format to a dual graph. Don't ask me how this works, it would take too long to figure out and longer to explain. If you think it's a horrific algorithm/implementation, I agree, and urge you to write a better one. Oh, and it works by starting with the base-pair graph representation generated by brkt2bpgr.m, that's how lame it is. The resulting representation is a graph described in the paper (minus the labelling scheme). The edges are labeled with the lengths of the unpaired strand, and the vertices are labeledd with concatenated nucleotide sequences. utils/countFreqs.m Computes the 1st (mono-nucl) and 2nd (di-nucl) statistics of an RNA sequence. utils/getSVMlabels.m Compares the RNA family name of all RNAs in the Graph structure to the 'targetfam' name in order to assign the "true labels" for SVM classification. svm/train.m Train the SVM. Computes the kernel matrix and the classifier output parameters. This code uses the simple matlab implementation of least-squares SVM (http://www.esat.kuleuven.ac.be/sista/lssvmlab/). svm/classify.m Using the computed support vectors, classifies given training data. Again, adapted from LS-SVM. svm/graphKernel.m The core of this kernel machine. Implements the kernel between two labeled graphs as described in Koshima, Tsuda, Inokuchi (2003). svm/edgeKernel.m Simple routing for computing the kernel between two edge labels. The type of kernel used is specified in the parameters. svm/vertexKernel.m An identical function that compues the kernel between two vertex labels. analysis/drawBPGraph.m Draws the base pair graph. Uses arrangeVert.m to spread out the graph a bit. analysis/drawDual.m Draws the dual graph. Not a particularly intelligent function, but does the trick. analysis/arrangeVert.m Arranges (by spreading out) the vertices of a graph. Starts by arranging them in a circle and using some spring-type repulsion/attraction mechanism for pulling adjacent vertices together. analysis/plotSVMresult.m Computes the TruePositive/TrueNeg/FP/FN statistics from the Result structure and plot them as bars. analysis/computeROC.m Computes the ROC curve, area under it, and an error estimate if multiple classification runs (e.g. in cross-validation) are given. data/tRNA.sec An example of the secondary structure file used by this software. This was produced (possibly with some post-processing) by folding the RFAM RNA sequences with Vienna RNA's folding algorithm. You'll have to decipher the file input scheme in loadGraphs.m to see how things really work.