Coding Potential
Introduction
Thanks to the Next Generation Sequencing methods, transcriptomes are becoming more and more abundant. Once the transcripts have been assembled and we dispose of the sequences that have been transcribed into RNA, we must distinguish between the transcripts that will be coding (mRNA) and the non-coding ones (ncRNA). This classification can be done assigning to each transcript a score based on his nucleotide composition and patterns.
Coding Potential Assessment Tool
The "Coding Potential Assessment Tool" provides an easy and fast way to classify the transcripts according to their coding score. This tool integrates the CPAT algorithm within Blast2GO. The CPAT algorithm needs of models in order to assign the coding potential scores to each sequence. Blast2GO incorporates the standard CPAT models and adds some of the most commonly organisms models used on molecular biology. In addition to the prebuilt models, this tool adds the option to create your species-specific model.
Figure 1: Coding Potential Assessment Tool in the Blast2GO Analysis Menu
Run Coding Potential Assessment Tool
This tool can be found under Analysis → Coding Potential → Coding Potential Assessment (CPAT). The wizard allows to adjust anaysis parameters (Figure 3).
Accuracy: By default, the accuracy is set automatically in order to reduce the false positives and the false negatives, this means that the threshold equals to the value where the sensitivity has the same value than the specificity.
If a higher accuracy is desired the accuracy can be set manually. Raising the accuracy will allow to classify the sequences in three categories: coding, non-coding, and transcripts with unknown coding potential (Figure 2).The accuracy can be set manually but it can never be lower than the default value. In this case the accuracy value will automatically fallback to the default value.
Figure 2: Accuracy: Interpretation of the double ROC Curve
- Models: The algorithm needs models to calculate the coding potential for each transcript. Here we can choose the origin of these models:
Prebuilt: Use one of the prebuilt models available. Selecting one
of this prebuilt models, the algorithm will run faster.Species Accuracy Coding Cutoff Arabidopsis thaliana 0.984 0.415 Bos Taurus 0.953 0.359 Caenorhabditis elegans 0.998 0.523 Danio rerio 0.984 0.38 Drosophila melanogaster 0.963 0.39 Gallus gallus 0.93 0.402 Homo sapiens 0.966 0.364 Mus musculus 0.955 0.440 Rattus norvegicus 0.98 0.363 Sus scrofa 0.946 0.467 Xenopus laevis 0.963 0.415 - From files: Create the model providing 2 FASTA files; one with coding sequences and another one with non-coding sequences.
From NCBI sequences: Create a new species-specific model from the sequences available on the NCBI database by selecting his scientific name or ID on the search box. A minimum of 1000 non-coding and coding sequences are required.
Note: Checking the `Get Parent-Taxa ncRNA` allows to use non coding RNA sequences from higher parent taxa until complete the 1000 necessary non-coding sequences.
Figure 3: Wizard Page
Results
Once finished three result types are automatically created:
- Coding Potential Table: Here you can see the results for each sequence:
- Tag: The tag marking for each sequence whether it is a coding, non-coding or unknown coding potential transcript.
- Sequence: The name of the sequence.
- mRNA size: The length of the original transcript.
- ORF size: The size of the potential ORF within the sequence.
- Fickett score: The Fickett score which is a linguistic feature that distinguishes protein-coding RNA and ncRNA according to the combinational effect of nucleotide composition and codon usage bias.
- Hexamer score: The hexamer score is calculated using a log-likelihood ratio to measure differential hexamer usage between coding and noncoding sequences.
- Coding Probability: The coding probability assigned to each transcript.
- Pie Chart: The coding potential distribution shown as pie chart of the classification results for the corresponding sequences depending on the provided cutoffs (Figure 4).
- Model Accuracy via a double ROC-Curve chart: This charts opens when a new model is created or when the accuracy is manually set. In this chart we can check the quality, the accuracy and the different thresholds of a model (Figure 5).
Figure 4: Distribution of the coding potential
Figure 5: Double ROC curve showing the model quality, accuracy and threshold