seqCheck

Unfortunately sometimes protein sequences sets may contain stop characters (usually ‘*’ or ‘.’). In many programs those chars will cause problems. stopCleaner can help you with this. The default settings will simply remove stopc chars at the end of a sequence. But there are other options that allow you to take care of stop chars in the middle of sequences as well.

Warning

This program allows you to fastly clean a sequence file so that issues that can cause problems are solved. However, you sould have a look at what caused the issues to appear in the first place.

If you decide to correct the problems, the following order is applied:

  1. end stops

  2. all other stops

  3. strange symbols

  4. duplicates

Options

General options

The general options influence the general behaviour of stopCleaner:

-h, --help

Prints a simple help message with a small description of all the available options.

-i <FILE>, --in

The sequence file to filter. The format should be FASTA format.

-o <FILE>, --out <FILE>

The output file. If none is provided, the sequences will be printet to the console.

-r <FILE>, --report <FILE>

The file to store the report in. If not provided report is printed to console.

-A, --check-all

Enables all available checks

--fix-and-keep

Fixes all detected problems and keeps all sequences (enables all checks)

--fix-and-remove

Fixes all detected problems by removing all sequences that were detected to have problems. Only exception are end stops which are still only shortened.

Stop check options

Stop signes can appear for normal reasons (e.g. translated last codon) or for problematic reasons (e.g. wrong gene model). Whatever the reason, Stop signs in protein sequences (e.g. ‘.’ or ‘*’) often cause problems for other programs.

--check-stops

This option will check for stop chars

--fix-ending

Removes all stops at the end of a sequence. This step is done before the other stop corrections!

--remove-stop-genes

Remove all genes with stops

--stop-char

The stop chars to use. (default: .*)

--replace-stop

Replace in sequence stops with ambigious char (‘N’ in case of DNA/RNA sequences, ‘X’ in proteins).

Alphabet check options

Sometimes sequences contain rare amino acids (e.g. U - Selenocystein). These symbols will create problems in some problems. This section contains options to either remove sequences containing the sysmbol or replace it with a less problematic symbol. These options here ignore the stop symbols.

--check-alphabet

Check if a usual alphabet is used

--set-alphabet

The alphabet to use (protein, DNA or RNA).

--replace-char

Replace weird chars with ambigious one (‘N’ in case of DNA/RNA sequences, ‘X’ in proteins).

--remove-alpha

Removes sequences with problematic chars

Duplicate check options

Sequence names should be uniq within a file. However, sometimes due to errors in a GFF sequence names can be duplicated because sequence segments are not merged properly.

--check-duplications

Check for sequences with the identical identifiers

--rename-duplicates

Renames the duplicate enties. The first sequence found will keep the original name, the second will be rename with <name>-2, the third <name>-3 etc…

--remove-duplicates

Removes all but one sequence with the same name. Only the first one encountered is kept.