seqCheck
Unfortunately sometimes protein sequences sets may contain stop characters (usually ‘*’ or ‘.’). In many programs those chars will cause problems. stopCleaner can help you with this. The default settings will simply remove stopc chars at the end of a sequence. But there are other options that allow you to take care of stop chars in the middle of sequences as well.
Warning
This program allows you to fastly clean a sequence file so that issues that can cause problems are solved. However, you sould have a look at what caused the issues to appear in the first place.
If you decide to correct the problems, the following order is applied:
end stops
all other stops
strange symbols
duplicates
Options
General options
The general options influence the general behaviour of stopCleaner:
- -h, --help
Prints a simple help message with a small description of all the available options.
- -i <FILE>, --in
The sequence file to filter. The format should be FASTA format.
- -o <FILE>, --out <FILE>
The output file. If none is provided, the sequences will be printet to the console.
- -r <FILE>, --report <FILE>
The file to store the report in. If not provided report is printed to console.
- -A, --check-all
Enables all available checks
- --fix-and-keep
Fixes all detected problems and keeps all sequences (enables all checks)
- --fix-and-remove
Fixes all detected problems by removing all sequences that were detected to have problems. Only exception are end stops which are still only shortened.
Stop check options
Stop signes can appear for normal reasons (e.g. translated last codon) or for problematic reasons (e.g. wrong gene model). Whatever the reason, Stop signs in protein sequences (e.g. ‘.’ or ‘*’) often cause problems for other programs.
- --check-stops
This option will check for stop chars
- --fix-ending
Removes all stops at the end of a sequence. This step is done before the other stop corrections!
- --remove-stop-genes
Remove all genes with stops
- --stop-char
The stop chars to use. (default: .*)
- --replace-stop
Replace in sequence stops with ambigious char (‘N’ in case of DNA/RNA sequences, ‘X’ in proteins).
Alphabet check options
Sometimes sequences contain rare amino acids (e.g. U - Selenocystein). These symbols will create problems in some problems. This section contains options to either remove sequences containing the sysmbol or replace it with a less problematic symbol. These options here ignore the stop symbols.
- --check-alphabet
Check if a usual alphabet is used
- --set-alphabet
The alphabet to use (protein, DNA or RNA).
- --replace-char
Replace weird chars with ambigious one (‘N’ in case of DNA/RNA sequences, ‘X’ in proteins).
- --remove-alpha
Removes sequences with problematic chars
Duplicate check options
Sequence names should be uniq within a file. However, sometimes due to errors in a GFF sequence names can be duplicated because sequence segments are not merged properly.
- --check-duplications
Check for sequences with the identical identifiers
- --rename-duplicates
Renames the duplicate enties. The first sequence found will keep the original name, the second will be rename with <name>-2, the third <name>-3 etc…
- --remove-duplicates
Removes all but one sequence with the same name. Only the first one encountered is kept.