isoformCleaner

Often proteomes contain annotated isoforms. While in general this is a good thing, is some cases that might cause problems (e.g. when estimating the quality of a proteome with DOGMA (https://domainworld.uni-muenster.de/programs/dogma/)). In such cases it is a common approach to only keep the longest isoform of a protein and remove the others from the set. Unfortunately, there is no general agreement on how to mark isoforms. However, this program tries to handle the most common ways.

Options

General options

The general options influence the general behaviour of isoformCleaner:

-h, --help: Prints a simple help message with a small description of all the available options.

-i <FILE>, --in: The sequence file to filter. The format should be FASTA format.

-o <FILE>, --out <FILE>: The output file. If none is provided, the sequences will be printet to the console.

-s <CHAR>, --split-char <CHAR>: The split character to use. (default: ‘-‘). Using a split character is the default option. If you use any other available cleaning option, this one will be turned off. If no split character is found in the name, a warning message is displayed and the sequence will be kept in the output.

--summary: List the number of sequences in the input, the output and difference between the two.

Regex options

Regular expressions can help to identify the gene name inside the fasta header of a file.

-r <ARG>, --regular <ARG>: A regular expression that determines the gene name of the isoform. For more information on the allowed C++ regular expression have a look at the following website: http://www.cplusplus.com/reference/regex/ECMAScript/

-n, --name: Search name only

-c, --comment: Search comment only

-p <ARG>, --preset <ARG>: Currently we have two presets that can be used to identify gene names. Preset regex: Can be either ‘flybase’ or ‘gene’

Simple Usage

If you have a fasta file with proteins which isoforms are marked by a name after a split character (e.g. name-RA, name-RB) one can use the split char option to identify the different isoforms.

File: proteome.fa

>seq1-RA
ThisIsAShortIsoform
>seq1-RB
ThisIsALongerIsoformOfTheSameProtein

$ isoformCleaner -i proteome.fa -s '-'
>seq1-RB
ThisIsALongerIsoformOfTheSameProtein

In some cases a simple split character is not sufficient. In nthat case maybe a regular expression can help. For some cases we have predefined expressions:

File: regex.fa

>seq1 gene:1
ThisIsAShortIsoform
>seq2 gene:1
ThisIsALongerIsoformOfTheSameProtein

$ isoformCleaner -i regex.fa -r "gene[:=]\\s*([\\S]+)[\\s]*"
>seq2 gene:1
ThisIsALongerIsoformOfTheSameProtein

# The regular expression above is already provided as a preset:
$ isoformCleaner -i regex.fa -p gene
>seq2 gene:1
ThisIsALongerIsoformOfTheSameProtein