seqExtract

seqExtract is a program to extract sequences or subsequences from a given set. For large sequence files it has the option to build a index of a file for faster future access.

Options

General options

The general options influence the general behaviour of seqExtract:

-h, --help

Prints a simple help message with a small description of all the available options.

-i <FILE>, --in <FILE>

The sequence file

-I, --index

An index file is used. If none exists it will be created. If -F is not set the extension of the provides sequence file will be removed and replaced with ‘.sei’.

-F <FILE>, --indexFile <FILE>

The index file to use. Will be created if it doesn’t exist yet.

Note

If the sequence file is changed the index file will need to be deleted so that a new index file will be created. It will not be done automatically.

-l <FILE>, --inputList <FILE>

File containing input files

Output options

-o <FILE>, --out <FILE>

The output file

-a, --append

Appends the sequences to an existing file instead of overwriting it

-c, --remove-comments

Remove comments from output sequences

--extract-order

Keeps the order given in the extraction line

Extract options

Here are the different options listed which influence which sequences (or subsequences) are extracted from the whole data set.

-e <ARG> --extract <ARG>

The (sub)sequence(s) to extract from the sequence file. Providing only the sequence name will extract the whole sequence. But you can also provide coordinates. For example -e mySeq:1-10 will extract the first ten amino accids of the sequence with name “mySeq”. You can also mere several coodrinates together: -e mySeq:1-10,21-30 will create a sequence of length 20 which contains both sequence parts. -e mySeq:1-10 mySeq:21-30 on the other hand will create to seperate sequences.

-N <FILE>, --namesFile <FILE>

File with extraction lines to use.

-d <ARG>, --delim-extract <ARG>

The delimiter to use in the extraction file (default: Tab)

-C <ARG>, --column <ARG>

The column in the extraction file to use (default: 1)

-D <ARG>, --delim-pos  <ARG>

The delimiter to use to seperate name from positions (default: ‘:’). This might need to be changed if you sequence names contain a “:” already.

-r, --remove

Removes the sequences with names provided in “-e” instead of extracting them. Always the whole sequence is removed even if subsection are provided.

-n <ARG>, --numSeqs <ARG>

The number of random sequences to extract (default: 0)

-s <ARG>, --seed <ARG>

Seed for random extract function

-L <ARG>, --length <ARG>

Length based extraction. allowed values are ‘<NUM’, ‘>NUM’, ‘=NUM’. For example the code -L '>3' '<7' would extract all sequences with length 4-6

-m <ARG>, --ignore-missing <ARG>

Usually a warning message is shown if a sequence cannot be found. This can be disabled using this option.

Modifying options

These options modify the sequence that are extracted.

-t, --translate

Translate into amino acid

-T <ARG>, --table <ARG>

The translation table to use (default: standard)

-R, --revComp

Calculate the reverse complement