extractDomains
The purpose of this program is to extract the protein sequences of domains. The input is a domain annotation and the sequences from which to extract the domains.
Options
General options
The general options influence the general behaviour of extractDomains:
- -h, --help
Prints a simple help message with a small description of all the available options.
- -d <FILE>, --domains <FILE>
The domain annotation file(s) to be used used. These files are created for example by pfam_scan.pl. See domainAnnotation for more information on how to annoate your fasta file with domains.
- -s <FILE>, --sequences <FILE>
The sequence file to use. Should be the same that has been used to create the domain annotation file.
- -o <FILE>, --out <FILE>
The output file. In this file the extracted domain sequences will be saved in FASTA format.
- -n <ARG>, --name <ARG>
The Pattern to be used for the name of the extracted sequence. The following placeholders can be used:
%s The sequence name
%b The start position of the domain in the sequence
%e The end position of the domain in the sequence
%d The domain ID
The default value of this parameters is: (=%s_%b-%e %d)
- --id <ARG>
The pfam accessesion number of the protein (sub-)sequences to extract. If it is not provided all domain sequences as described in the domain file will be extracted.
- --min-length <ARG>
The minimum length of the sequences that will be extracted. Shorter domain sequences will simply be ignored.
- -e, --use-evelope
If available (for example when using Pfam annotations) the envelope coordinates will be used.
Simple Usage
extractDomains -i seqs.fasta -d seqs.pfam -o domains.fa --id PF00042
Example Use Case: Building a more sensitive HMM
While HMMs are already quite sensitive it is sometimes useful to create a HMM that is better fitting to the species set you want to analyse. The general workflow in this case would be:
Annotate the proteome(s) of interest and closely related ones with a normal domain annotation program
Extract the domain sequences using
extractDomains
Combine these extracted sequences with the original PFAM seed sequences into a single sequence Set
Create a new multiple sequence alignment based on the the set created in the previous step
You can now rescan your proteomes using the new HMM.