POA installation notes Sept. 2001 Chris Lee Dept. of Chemistry & Biochemistry UCLA I. COMPILATION To compile this program, simply type 'make poa'. This produces an executable for sequence alignment (poa) and also a linkable library liblpo.a. The software has been compiled and tested on LINUX. II. RUNNING POA Poa has a variety of command line options. Running poa without any arguments will print a list of the possible command line arguments. POA may be used to construct a PO-MSA, or to analyze a PO-MSA. Usage: poa [OPTIONS] matrixfile -read_fasta FILENAME Reads in FASTA sequence file. -toupper Switches all letters in FASTA sequences to upper case. -tolower Switches all letters in FASTA sequences to lower case. -read_po FILENAME Reads in PO file. -fuse_all Fuses identical letters on align rings. -subset FILENAME Filters PO-MSA to include list of seqs in file. -remove FILENAME Filters PO-MSA to exclude list of seqs in file. -hb Performs heaviest bundling to generate consensi. -hbmin VALUE Bundles into heaviest bundle seqs with percent id >= value. -best Restricts MSA output to heaviest bundles. -pir FILENAME Writes out MSA in PIR format. -clustal FILENAME Writes out MSA in CLUSTAL format. -po FILENAME Writes out MSA in PO format. -printmatrix LETTERSET Prints score matrix to stdout. -v Runs in verbose mode. Note: Either the -read_fasta or -read_po argument must be used with poa, since a FASTA file or PO file must be read in by poa. Thank you for using out software. Please, cite our paper: Lee, C., Grasso, C., M. Sharlow (2001) Multiple sequence alignment using partial order graphs. Bioinformatics 18: in press. A. Constructing a PO-MSA ------------------------- 1. Required Input: i. An Alignment Score Matrix File: A score matrix file is required, because poa uses it to get the residue alphabet and indexing. Even if poa is not being used to perform multiple sequence alignment, this file must be provided. Any basic alignment matrix which may be used with BLAST may be used here. This file must be the first command line argument without a flag in order to be interpreted by poa as the score matrix file. An example score matrix file, blosum80.mat, is provided in this directory. Note: Poa is Case Sensitive. Poa treats the scoring matrix as case sensitive. For instance, the entries in the scoring matrix column labelled 'A' are the scores for matching residue 'A' to another residue, while the entries in the scoring matrix column labelled 'a' are the scores for matching residue 'a' to another residue. In this way poa can handle mixed scoring matrices containing scores for matching amino acid residues to nucleotide residues. For example, in the blosum80.mat scoring matrix file amino acids are represented by upper case letters while nucleotides are represented by lower case letters. The '-toupper' and '-tolower' flags, described below, may be used with poa to switch the case of all of the letters in the FASTA file input. ii. A FASTA File: A FASTA file is required only if poa is being used to construct a new PO-MSA from a list of sequences, or to align a list of sequences to an already existing PO-MSA (see Analyzing a PO-MSA below). This FASTA file should contain sequences to be aligned by poa. The command line argument to get poa to accept a FASTA file as input is '-read_fasta FILENAME'. Poa will interpret FILENAME as the FASTA sequence file. An example file, multidom.seq, is provided in this directory. Poa is case sensitive (see note above). To force POA to switch the letters in a FASTA file to upper case use the '-toupper' command line argument. To force POA to switch the letters in a FASTA file to lower case use the '-tolower' command line argument. 2. MSA Construction Options: i. Aggressive Fusion: During the building up of a PO-MSA, if a node i with label 'a' is aligned to an align ring which already contains a node j with label 'a', poa simply adds the node to the align ring. It is possible to force poa to do aggressive fusion, so that when a node i with label 'a' is aligned to an align ring which already contains a node j with label 'a', node i is fused to node j. The command line argument for accomplishing this is '-fuse_all'. 3. MSA Output Formats: Poa can output a PO-MSA in several formats simultaneously including CLUSTAL, PIR, and PO. The PO format is the best format since it contains all of the information in the PO-MSA. The other formats accurately represent the MSA, but since they are RC-MSA formats, they may lose some of the information in the full PO-MSA. i. CLUSTAL format: This format is the standard CLUSTAL format. The command line argument to get the MSA output in this format is '-clustal FILENAME'. ii. PIR format: This format is the standard PIR format, which is like FASTA with a '.' character representing gaps. The command line argument to get the MSA output in this format is '-pir FILENAME'. iii. PO format: This format is the standard PO format. It is described below in the section PO format. The command line argument to get the MSA output in this format is '-po FILENAME'. Example: Constructing a MSA of Four Protein Sequences Running poa with the following statement will take the fasta formatted sequences in the multidom.seq file, construct a PO-MSA using the scoring matrix in the file blosum80.mat, and then output the PO-MSA in CLUSTAL format in the file multidom.aln. poa -read_fasta multidom.seq -clustal multidom.aln blosum80.mat The output should be identical to the results of figs. 6 & 7 in the paper. 4. Other Output: i. Score Matrix Poa will also print to stdout the score matrix stored in the '.mat' file. The command line argument to get poa to do this is '-printmatrix LETTERSET', where LETTERSET is a string of letters to be printed with the score matrix. For example, if the score matrix is designed for protein alignment the letter set might be 'ARNDCQEGHILKMFPSTWYV'. ii. Verbose Mode Poa will run in verbose mode, printing additional information generated during the run to stdout. The command line argument to get poa to do run in verbose mode is '-v'. B. Analyzing a PO-MSA ----------------------- Poa can also take a PO format file as input and rebuild the PO-MSA data structure. Once this data structure has been rebuilt, it may be analyzed for features. In 'liblpo.a', the linkable poa library, we have included the functions necessary to do heaviest bundling and thereby find consensus sequences in the PO-MSA (the details of the heaviest bundling algorithm are described elsewhere). Poa has been written so that users may create their own functions for analyzing a PO-MSA. We have not included in the 'liblpo.a' library the functions that we wrote to analyze PO-MSAs constructed with ESTs and genome sequence to find snps and alternative splice sites. However, it is possible to design modular library functions that will look for highly specific biological features in any PO-MSA data structure. 1. Input: POA requires a scoring matrix file as input. It does not have a default scoring matrix. Before the PO-MSA data structure can be analyzed it must be built. It can be built either from a PO file or from a FASTA file, or from both a PO file and a FASTA file. Note: POA Requires Either A PO File or a FASTA File! If neither files are read in by POA it will terminate early, since it has not received any data. i. A PO file: Poa will read in a PO formatted file. The command line argument to get poa to read in a PO formatted file and rebuild the PO-MSA data structure is '-read_po FILENAME'. It is possible to filter the PO-MSA data structure as it is being rebuilt. In order to filter the PO-MSA in the PO file to include only a subset of sequences use the command line argument '-subset FILENAME', where the file named FILENAME contains the list of sequence names to be included in the new PO-MSA. In order to filter the PO-MSA in the PO file to exclude a subset of sequences use the command line argument '-remove FILENAME', where the file named FILENAME contains the list of sequences to be excluded from the new PO-MSA. The names of sequences to be included or excluded should be in the format 'SOURCENAME= *" as they are in the PO file. Lists of sequence source names can be created by using the unix grep utility on the PO file. ii. A FASTA File: The FASTA file should contain sequences to be aligned by poa. The command line argument to get poa to accept a FASTA file as input is '-read_fasta FILENAME'. Poa will interpret FILENAME as the FASTA sequence file. An example file, multidom.seq, is provided in this directory. (See note above on case sensitivity). Note: POA Can Take Both A PO File And A FASTA File As Input! If both the '-read_po FILENAME' argument and the '-read_fasta FILENAME' argument are given to poa on the command line, then poa will first rebuild the PO-MSA in the PO file, and then it will align the sequences in the FASTA file to this PO-MSA. 2. Additional PO Utilities: i. Consensus Generation Via Heaviest Bundling Algorithm: The heaviest bundling algorithm finds consensus sequences in the PO-MSA. The command line argument for heaviest bundling is '-hb'. This function adds the new consensus sequences to the PO-MSA by storing new consensus sequence indices on the in the PO-MSA nodes corresponding to the consensus sequence paths. The sequence source names for consensus sequences generated by heaviest bundling is CONSENS'i' where 'i' is the index of the bundle corresponding to the consensus sequence. The heaviest bundling algorithm can also take as input a bundling threshold value. The command line argument for setting a bundling threshold value for heaviest bundling is '-hbmin VALUE'. This threshold is used during the process of associating sequences with bundles. If a sequence has a percentage of nodes shared with bundle 'i' greater than this threshold value, it is associated with bundle 'i'. Iterative heaviest bundling can also be affected by the bundling threshold. A detailed description of heaviest bundling and heaviest bundling thresholds is given elsewhere. The consensus sequences corresponding to bundles generated by heaviest bundling are listed in the sequence source list. Additionally, in the SOURCEINFO line for each sequence the index of the bundle to which that sequence belongs is give. Finally, using the command line argument '-best' restricts the MSA output to the consensus sequences generated by heaviest bundling. III. PO FILE FORMAT ****************************HEADER**************************************** VERSION= ~Current version of poa~ NAME= ~Name of PO-MSA. Defaults to name of 1st sequence in PO-MSA~ TITLE= ~Title of PO-MSA. Defaults to title of 1st sequence in PO-MSA~ LENGTH= ~Number of nodes in PO-MSA~ SOURCECOUNT= ~Number of sequences in PO-MSA~ *********************SEQUENCE SOURCE LIST********************************* /* For each sequence in the PO-MSA: */ SOURCENAME= ~Name of sequence taken from FASTA sequence header~ SOURCEINFO= ~Number of nodes in sequence~ ~Index of first node containing sequence~ ~Sequence weight~ ~Index of bundle containing sequence~ ~Title of sequence taken from FASTA sequence header~ /* Example: */ SOURCENAME=GRB2_HUMAN SOURCEINFO=217 10 0 3 GROWTH FACTOR RECEPTOR-BOUND PROTEIN 2 (GRB2 ADAPTOR PROTEIN)(SH2) ********************PO-MSA DATA STRUCTURE********************************* /* For each node in the PO-MSA: */ ~Residue label~:~'L' delimited index list of other nodes with edges into node~ ~'S' delimited index list of sequences stored in each node~ ~'A' index of next node in same align ring~ NB: align ring indices must form a cycle. e.g. if two nodes 121 and 122 are aligned, then the line for node 121 indicates "A122", and the line for node 122 indicates "A121". /* Example: */ F:L156L155L22S2S3S7A158 ********************END*************************************************** For more information, see http://www.bioinformatics.ucla.edu/poa.