Subsections


2 Module Documentation

The following subsections provide details about how to use specific modules of Pygr functionality.


2.1 Installation

Installing pygr is quite simple:
tar -xzvf pygr-0.3.tar.gz 
cd pygr
python setup.py install

Once the test framework has completed successfully, the setup script will install pygr into python's respective site-packages directory.

If you don't want to install pygr into your system-wide site-packages, replace the "python setup.py install" command with

python setup.py build
This will build pygr but not install it in site-packages.

Pygr contains several modules imported as follows:

from pygr import seqdb # IMPORT SEQUENCE DATABASE MODULE

If you did not install pygr in your system-wide site-packages, you must set your PYTHONPATH to the location of your pygr build. For example, if your top-level pygr source directory is PYGRDIR then you'd type something like:

setenv PYTHONPATH PYGRDIR/build/lib.linux-i686-2.3
where the last directory name depends on your specific architecture.


2.2 graphs and graph query module

The basic idea of Pygr is that all Python data can be viewed as a graph whose nodes are objects and whose edges are object relations (in Python, references from one object to another). This has a number of advantages.

1. All data in a Python program become a database that can be queried through simple but general graph query tools. In many cases the need to write new code for some task can be replaced by a database query.

2. Graph databases are more general and flexible in terms of what they can represent and query than relational databases, which is very important for complex bioinformatics data.

3. Indeed, in Pygr, a query is itself just a graph that can be stored and queried in a database, opening paths to automated query construction.

4. Pygr graphs are fully indexed, making queries about edge relationships (which are often unacceptably slow in relational databases) fast.

5. The interface can be very simple and pythonic: it's just a Mapping. In Python "everything is a dictionary", also known as "the Mapping protocol": a dictionary maps some set of inputs to some set of outputs. e.g. m[a]=b maps a onto b, as a unique relation. In Pygr, if we want to be able to map a node to multiple target nodes (i.e. allow it to have multiple edges), we simply add another layer of mapping: m[a][b]=edgeInfo (where edgeInfo is optional edge info.)

Examples of the Pygr syntax:

graph += node1 # ADD node1 TO graph
graph[node1] += node2 # ADD AN EDGE FROM node1 TO node2
graph[node1][node2]=edge_info # ADD AN EDGE WITH ASSOCIATED edge_info
# ADD SCHEMA BINDING WITH graph[node] BOUND AS node.attr
setschema(node,attr,graph) 
# SEARCH graph FOR SUBGRAPH {1->2; 1->3; 2->3}, 
# I.E. EXONSKIP, WHERE THE SPLICE FROM 2 -> 3 HAS ATTRIBUTE type 'U11/U12' 
for m in GraphQuery(graph,{1:{2:None,3:None},\
                   2:{3:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},\
                   3:{}}):
    print m[1].id,m[2].id,m[1,2].id

Let's examine these examples one by one:

2.2.1 dictGraph

dictGraph is Pygr's main graph class. It provides all the standard behaviors described above. The current reference implementation uses standard Python dict objects to store the graph. All the usual Mapping protocol methods can be used on dictGraph objects (top-level interface, in the examples above graph) and dictEdge objects (second-level interface; in the examples above graph[node]). e.g.

2.2.2 Directionality and Reverse Traversal

Note that dictGraph stores directed edges, that is, a->b does not imply b->a; those are two distinct edges that would have to be added separately if you want an edge going both directions. Moreover, the current implementation of dictGraph does not provide a mechanism for traveling an edge backwards. To do so with algorithmic efficiency requires storing each edge twice: once in a forward index and once in a reverse index. Since that doubles the memory requirements for storing a graph, the default dictGraph class does not do this. If you want such a "forward-backwards" graph, use the dictGraphFB subclass that stores both forwad and reverse indexes, and supports the inverse operator ($\sim$). $\sim$ graph gets the reverse mapping, e.g. ($\sim$ graph)[node2] corresponds to the set of nodes that have edges to node2. This area of the code hasn't been tested much yet.

2.2.3 Schema: binding object attributes to graphs

The goal of Pygr is to provide a single consistent model for working with data explicitly modeled as graphs (i.e. dictGraph-like objects) and standard Python objects that were not originally designed to be queried (or thought of) as a "graph". Since Python uses the Mapping concept throughout the language and object model, and provides introspection, there is no reason why Pygr can't work with both kinds of data transparently. One mechanism for making this idea explicit is the idea of binding an object attribute to a graph, via the new method we've called setschema(obj,attr,graph). The idea here is that once you bind an object attribute to a graph, the two different data models obj.attr (object model) or graph[obj] (graph model) are made equivalent and interchangeable. Operating on one affects the other and vice versa; they are two ways of referring to the same relation. This concept can be applied at several different levels

class ExonForm(object): # ADD ATTRIBUTES STORING SCHEMA INFO
    __class_schema__=SchemaDict(((spliceGraph,'next'),(alt5Graph,'alt5'),(alt3Graph,'alt3')))

In this class we bound the next attribute to spliceGraph, alt5 attribute to alt5Graph, and alt3 attribute to alt3Graph. That means, every instance obj of this class will have an attribute obj.next that is equivalent to spliceGraph[obj], etc. Note that this is schema, not the actual operation of adding the object as a node to the graph. Indeed, when obj is first created, it is not automatically added to spliceGraph; that is up to the user. Unless your code has added the node to the graph (e.g. spliceGraph+=obj), obj.next should throw a KeyError exception.

The general method getschema(obj,attr) works regardless of whether the schema was stored on an individual object or at the class level.

2.2.4 GraphQuery

The GraphQuery class implements simple node-to-node matching, in which each new node-set is generated by an iterator associated with a specific node in the query graph. This iterator model is general: since indexes (mappings) support the iterator protocol, a given iterator may actually be an index lookup (or other clever search algorithm). The GraphQuery constructor takes two arguments: the default data graph being queried, and the query graph. The query graph is just a graph; its nodes can be any object that can be a graph node (i.e. any object that is indexible, e.g. by adding a __hash__() method). Its node objects will not be modified in any way by the GraphQuery. Its edges are expected to be dictionaries that can be checked for specific keyword arguments:

The attr - subqueries options are all implemented as extremely simple subclasses of GraphQuery. If you want to see just how easy it is to write new subclasses of GraphQuery functionality, look at the graphquery.py module (the entire graph query module is only 237 lines long).

Note: an easy way to pass keyword dictionaries (e.g. as edge information) is simply using the dict() constructor, e.g. dict(dataGraph=myGraph,filter=my_filter). I think this is a little more readable than 'dataGraph':myGraph, 'filter':my_filter.

Note on current behavior: currently, the GraphQuery iterator returns the same mapping object for each iteration (simply changing its contents). So to save these multiple values safely in a list comprehension we have to copy each one into a new dict object via dict(m).

2.2.5 What is GraphQuery actually doing?

A GraphQuery is basically an iterator that returns all possible mappings of the query graph onto the datagraph that match all of the nodes and edges of the query graph onto nodes and edges of the data graph. As an iterator, it does not instantiate a list of the matches, but simply returns the matches one by one. The current design is very simple. The GraphQuery constructor builds an "iterator stack" of GraphQueryIterators, each representing one node in the query graph; they are enumerated in order by a breadth-first-search of the query graph. The GraphQuery iterator processes the stack of GraphQueryIterators: any match simply pushes the stack to the next level; any match at the deepest level of the stack is a complete match (yield the queryMatch mapping); the end of any GraphQueryIterator simply pops the stack. One obvious idea for improving all this is to replace this "interpreter" with a "compiler" that compiles Python for loops that are equivalent to this stack, and run that... likely to be many fold faster.


2.3 sequence Module

Base classes for representing sequences and sequence intervals.

2.3.1 Overview

Pygr provides one base class representing both sequences and sequence intervals (SeqPath), from which all sequence classes are derived (Sequence, SQLSequence, BlastSequence etc.). In this section we document both the features of the base class, and ways to extend or customize it by creating your own subclasses derived from SeqPath. The IntervalTransform class represents a coordinate system mapping from one interval of a sequence, onto another interval of the same or a different sequence.

2.3.2 SeqPath

This class provides the basic capabilities of a sliceable sequence or sequence interval, widely used in Pygr. It tries to provide core operations on sequences in a highly Pythonic way:

2.3.3 Sequence class

sequence.Sequence( s, id)
The Sequence class provides a SeqPath flavor that stores a sequence string s and identifier id for this sequence.

from pygr import sequence
seq=sequence.Sequence('GPTPCDLMETQ','FOOG_HUMAN')

update( s)
You can change the actual string sequence to a new string s using the update method:

seq.update('TKRRPLEDKMNEPS')

seqtype( )
returns DNA_SEQTYPE for DNA sequences, RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.

reverse_complement( s)
returns the reverse complement of the sequence string s.

2.3.4 Coordinate System

SeqPath follows Python slicing conventions (i.e. 0-based indexing, positive indexes count forward from start, negative indexes count backwards from the sequence end, and always s.start<s.stop).

Each SeqPath object has a number of attributes giving information about its ``location'':

2.3.5 Extending and Customizing

There are several methods and attributes you can override to extend or customize the behavior of your own SeqPath-derived classes. Typically you will derive either from the Sequence class, or in some cases from the SeqPath class.

strslice( start, stop)
called to get the string sequence of the interval (start, stop). You can provide your own strslice() method to customize how sequence is stored and accessed. For example, SQLSequence.strslice() gets the sequence via a SQL query, and BlastSequence.strslice() obtains it using the fastacmd -L start,stop UNIX shell command from the NCBI toolkit.

__len__( )
called to compute the length of the sequence. You can customize this to provide an efficient length method for your particular sequence storage. e.g. SQLSequence obtains it via a SQL query; BlastSequence obtains it from a precomputed length index. The default Sequence.__len__() method computes it from len(self.seq), assuming that the sequence can be accessed from the seq attribute.

__getitem__( slice_obj)
if you want to monitor or intercept slicing requests on your sequence, you can do so by providing your own getitem method. See seqdb.BlastSequenceCache class for an example.

__getattr__( attr)
if you subclass a SeqPath-derived class and supply a __getattr__ method for your subclass, it must call the parent class's __getattr__. This is essential for ``delayed evaluation'' of start and stop attributes, which are generated automatically by SeqPath's __getattr__. If your subclass inherits from more than one parent class, check whether both parents supply a __getattr__, in which case your subclass must supply a __getattr__ that explicitly calls both of them. Failing to do so will lead to strange bugs.

2.3.6 IntervalTransform

This class provides a mapping transform between the coordinate systems of a pair of intervals.

xform=IntervalTransform(srcPath,destPath)
d2=xform(s2) # MAPS s2 FROM srcPath coords to destPath coord system
d3=xform[s2] # CLIPS s2 TO NOT EXTEND OUTSIDE srcPath, THEN XFORMS
s3=xform.reverse(d3) # MAP BACK TO srcPath COORD SYSTEM

2.3.7 Seq2SeqEdge

This class represents a segment of alignment between two sequences. It is a temporary object created in association with a MSASlice object (see Alignment Module below).

__init__( msaSlice, targetPath, sourcePath=None)
Create a Seq2SeqEdge for the targetPath, on the specified alignment slice. If sourcePath is None, it will be calculated automatically by calling the slice's methods.

__iter__( sourceOnly=True, **kwargs)
iterate over source intervals within this segment of alignment. kwargs will be passed on to the msaSlice's groupByIntervals and groupBySequences methods.

items( **kwargs)
same as __iter__, but gets tuples of (source_interval,target_interval).

pIdentity( mode=max,trapOverflow=True)
Compute the percent identity between the source and target sequence intervals in this segment of the alignment. mode controls the method used for determining the denominator based on the lengths of the two aligned sequence intervals. trapOverflow controls whether overflow (due to multiple mappings of the query sequence to different regions of the alignment) is trapped as an error. To turn off such error trapping, set trapOverflow=False.

pAligned( mode=max,trapOverflow=True)
Compute the percent alignment between the source and target sequence intervals in this segment of the alignment, i.e. the fraction of residues that are actually aligned as opposed to gaps / insertions, in the two intervals.

Warning: if your query sequence has multiple mappings in the alignment (i.e. it is aligned to two or more different regions in the alignment), pIdentity() and pAligned() may return fractions larger than 1.0. This is because the query sequence may align to a given target sequence via more than one region in the alignment. If you encounter this problem, you can iterate through the individual mappings yourself (by calling the iter(), items() or edges() iterator methods for your alignment slice object), and calculating the percentage identity or alignment (via your own algorithm) individually for each specific mapping. For more background on this problem, see ``Multiple Mappings'', below.

Note that the presence of multiple mappings is not a Pygr bug, but simply reflects the alignment data loaded into Pygr; if the input alignment contains multiple mappings, there is nothing that Pygr can do to make this problem magically vanish.

2.3.8 SeqFilterDict

This dict-like class provides a simple way for masking a set of sequences to specific intervals. It stores a specific interval for each sequence. Subsequent look-up using a sequence interval as a key will return the intersection between that interval and the stored interval for that sequence in the dictionary. If there is no overlap, it raises KeyError.

d=SeqFilterDict(seqIntervalList)
overlap=d[ival] # RETURNS INTERSECTION OF ival AND STORED IVAL, OR KeyError

You can pass a list of intervals to store to the class constructor (as shown above). You can also add a single interval using the syntax d[saveInterval]=saveInterval. (This syntax reflects the actual mapping that the dictionary will perform if later called with the same interval).

2.3.9 LetterEdge

This class represents an edge from origin -> target node.

__iter__( )
iterate over seqpos for sequences that traverse this edge.

iteritems( )
generate origin, target seqpos for sequences that traverse this edge.

__getitem__( seq)
return origin,target seqpos for sequence seq; raise KeyError if not in this edge

2.3.10 Functions

The sequence module also provides convenience functions:

guess_seqtype( s)
based on the letter composition of the string s, returns DNA_SEQTYPE for DNA sequences, RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.

absoluteSlice( seq, start, stop)
returns the sequence interval of top-level sequence object associated with seq, interpreting start and stop according to the Pygr convention: a pair of positive values represents an interval on the forward strand; a pair of negative values represents an interval on the reverse strand (see Coordinate System, above).


2.4 Alignment Module

Pygr interface to sequence alignment, and scalable storage for multigenome alignments

2.4.1 Overview

Pygr provides a general model for interfacing with any kind of sequence alignment, and also a uniquely scalable storage system for working with huge multiple sequence alignments such as multigenome alignments. Specifically, it lets you work with an alignment both in the traditional Row-Column model (each row is a sequence, each column is a set of individual letters from different sequences, that are aligned; we will refer to this as the RC-MSA model), and also as a graph structure (known as a Partial Order Alignment, which we will refer to as the PO-MSA model). This supports ``traditional'' alignment analysis, as well as graph-algorithms, and even graph query of alignments.

This model has a few basic classes:

2.4.2 NestedList Storage

Pygr provides a highly scalable storage mechanism for working with multi-genome alignments. One fundamental challenge in working with very large alignments is the interval overlap query problem: to obtain a portion of an alignment (defined by some interval of interest) requires finding all interval elements in the ``alignment database'' that overlap the query interval. Since the intervals can be indexed by start (or end) position, one can typically find the first overlapping element in $O(\log N)$ time, where $N$ is the total number of intervals in the database. The problem is that since standard index structures cannot index both start and end, to obtain all intervals that overlap the query interval, one must scan forwards (or backwards) from that point. Furthermore, one cannot stop at the first non-overlapping interval; there might be an extremely long interval at the very beginning of the index, that extends to overlap the query interval. In this case, one would have to scan the entire database ($O(N)$ time) to guarantee that all overlapping intervals are found.

The nested list data structure solves this problem, by moving any interval in the database that is contained in another interval out of the top-level interval list, into the sublist of the parent interval. Based on this, one can prove that one can stop the scanning operation at the first non-overlapping interval (i.e. the overlapping intervals in any list form a single contiguous block). Overall, this reduces the query time to $O(\log N + n)$, where $n$ is the number of intervals in the database that actually overlap the query (i.e. results to return). Moreover, the nested list data structure can be implemented very well both in computer memory (RAM) or as indexed disk files. Pygr's disk-based cnestedlist database can complete a typical interval query of the 26GB UCSC 8 genome alignment in about 60 microseconds, compared with 10-30 seconds per query using MySQL.

2.4.3 Multiple Mappings: a Warning

Multi-genome alignments take traditional models of alignment to an entirely different scale, and inevitably many of the assumptions of standard row-column multiple sequence alignment are broken (e.g. no inversions; no cycles; etc.). One major issue that users should be aware of in UCSC multi-genome alignments is the possibility of multiple mappings, in which a given query sequence interval is mapped to two or more different regions of the alignment (and thus potentially to two or more different locations in a given target genome). Currently, UCSC multi-genome alignment are typically based on a single reference genome, to which all other genomes are aligned. While a given region of the reference genome might be guaranteed to have a unique mapping in the UCSC multi-genome alignment, other genomes do not appear to have any such guarantee: a region in any of those genome can have multiple mappings. This is problematic for several reasons:

If you encounter multiple mappings, you can always iterate over them one by one, and perform your own computations for each one. However, to avoid them altogether, you can restrict your queries to the reference genome for this specific alignment (UCSC offers different versions of each alignment set, each based on a different reference genome).

2.4.4 NLMSA

Top-level object representing an entire multiple sequence alignment, stored using a set of disk-based nested list interval databases. The alignment is stored as an interval representation of a linearized partial order (LPO), using nested list databases. This has several elements:

This functionality is encapsulated in the NLMSA class, which has a number of methods and attributes.

Construction Methods:

NLMSA( pathstem='', mode='r', seqDict=None, mafFiles=None, maxOpenFiles=1024, maxlen=None, nPad=1000000, maxint=41666666)
Constructor for the class. pathstem specifies a path and filename prefix for the NLMSA files (since multiple files are used to store one NLMSA, it will automatically add a number of suffixes automatically to open the necessary set of files for the NLMSA). mode is either ``r'' to open an existing NLMSA (from the pathstem disk files), or ``w'' to create a new one (which will be saved to the pathstem disk files). seqDict specifies a dictionary which maps sequence names to actual sequence objects representing those sequences. If seqDict is None, the constructor will look for a file pathstem.seqDict that is a prefixUnionDict header file for opening all the sequence database files for you automatically.

mafFiles can be used to specify a list of filenames storing a multiple sequence alignment in the UCSC MAF format. maxOpenFiles limits the open file descriptors the NLMSA will use. Since each sequence has a separate nested list database file, a large multi-genome alignment (with each genome containing 20 different chromosomes, say) can rapidly open a large number of file descriptors. Note: NLMSA only opens a given sequence's nested list database when one of your queries actually requires access to that sequence; it then keeps that file descriptor open to make subsequent queries to it fast. If the number of open file descriptors would exceed maxOpenFiles, it will close other open database files, which may slow down query performance (due to having to open and close databases repeatedly to process queries). maxlen specifies the maximum coordinate value for an LPO database. Its default value is 2GB, to prevent int overflow. Using a smaller value can be useful, to 1) limit the size of the LPO in memory during initial construction, and 2) to limit the size of LPO database files on disk (if for example, your file system does not support files above some maximum size). During initial construction of the NLMSA (from MAF files or user-specified interval alignments), the algorithm performs a one-pass sort of the LPO intervals. Thus, this set of intervals is briefly held in RAM for this sort. If you have insufficient RAM, the construction step may raise a MemoryError. If so, you can avoid this problem by using a smaller maxlen value. The maxint option provides another way of limiting the size of LPO databases. It specifies the maximum number of intervals to store per database. Since each interval takes 24 bytes, the default setting limits each LPO to a total size of 1 GB. Note that the current NLMSA construction algorithm requires loading each database index into memory as one-time operation during construction. If your NLMSA build fails due to running out of memory, simply reduce this value. The nPad option sets the maximum number of LPO coordinate systems (specifically, the offset for the start of real sequence IDs in the NLMSA sequence index). You are unlikely to need to change this default value.

__setitem__( i1, s1)
store an alignment interval; specifically, align the sequence interval s1 to the LPO coordinate interval i1. This provides a trivial user interface for you to build any desired alignment. If the sequence containing interval s1 is not already in the NLMSA, it will be added for you automatically (i.e. creating the necessary indexing, nested list database files, etc.). In this case, the sequence must supply a unique string identifier, which will be used on subsequent attempts to open the NLMSA database, to match the individual sequence nested-list databases against corresponding sequence objects (using seqDict, see above).

build( )
to construct the final nested list databases, after all the desired alignment intervals have been saved (using either the mafFiles constructor argument, or using setitem above). This method simply calls the build() method on all the constituent NLMSASequence objects in this alignment.

Alignment Usage Methods:

__getitem__( s1)
get the alignment slice for the sequence interval s1, i.e. get an NLMSASlice object representing the set of intervals aligned to s1. You can also use a regular Python slice object using integer indices ie. nlmsa[1:45], in which case, it gets the NLMSA slice corresponding to that region of the LPO coordinate system.

2.4.5 NLMSASlice

A temporary object created on-the-fly to represent (an interface to provide information about) the portion of the alignment associated with a specific sequence interval. Specifically, it acts like a dictionary whose keys are sequence intervals that are aligned to this region, and whose values are Seq2SeqEdge objects providing detailed information about the alignment of the target interval (key) to the source interval (the sequence interval used to create the NLMSASlice in the first place). You can use this dictionary interface in several ways:

__iter__( )
iterates over all sequence intervals that have a 1:1 mapping (i.e. a block of alignment containing no indels) to all or part of the source interval.

keys( maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeAll=False, ivalMethod=None, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs)
Provides a more general interface than iter(), with two types of group-by capabilities, ``group-by'' operations on the alignment intervals contained within this slice (``horizontal'' grouping), and on the sets of sequences aligned to this slice (``vertical'' grouping).

1. ``group-by'' operations on the alignment intervals contained within this slice. It allows the user to supply various parameters for controlling when alignment intervals will be merged or split in the results that it returns. mergeAll forces it to combine intervals of a given sequence irrespective of the size of gaps or inserts separating them. maxgap sets the maximum gap size for merging two adjacent intervals. If the target sequence for the two alignment intervals has a gap longer than maxgap letters between the two alignment intervals, they will be returned as separate intervals; otherwise they will be merged as a single alignment region. maxinsert sets the maximum length of insert in the target sequence that allows to adjacent intervals to be merged as a single alignment region in the results. mininsert is specifically for handling alignments that may have small ``cycles'' (due to slight inconsistencies in the reported alignment intervals, for example, if a portion of sequence can align at both the end of one interval or at the beginning of another, and the intervals are actually added to the NLMSA that way, then the start of the second interval will actually be before the stop of the first interval; this corresponds to a negative insert value). A mininsert value of zero (the default), prevents any such interval pairs from being merged. Giving a negative mininsert value will allow interval pairs whose insert value is greater than or equal to this value, to be merged. filterSeqs, if not None, should be a dict of sequences used to filter the group-by analysis; i.e. only alignment intervals containing these sequences are considered in the analysis. More specifically, filterSeqs can be used to mask the group-by analysis to a specific interval of a sequence, by having filterSeqs return only the intersection between the interval it is passed as a key, and the masking interval that it stores. If there is no overlap, it must raise KeyError. The sequence.SeqFilterDict class provides exactly this masking capability, i.e.

d=sequence.SeqFilterDict(someIntervals)
overlap=d[ival] # RETURNS INTERSECTION BETWEEN ival AND someIntervals, OR KeyError
ivalMethod, if not None, allows the user to provide a Python function that performs interval grouping. Specifically it is called as ivalMethod(l, ns,msaSlice=self, **kwargs), where l is the list of intervals for NLMSASequence ns within the current slice msaSlice; all other args are passed as a dict in kwargs.

2. merge groups of sequences using "vertical" group-by rules. seqGroups: a list of one or more lists of sequences to group. If None, the whole set of sequences will be treated as a single group. Each group will be analyzed separately, as follows: sourceOnly: output intervals will be reported giving only the corresponding interval on the source sequence; redundant output intervals (mapping to the same source interval) are culled. Has the effect of giving a single interval traversal of each group. indelCut: for sourceOnly mode, do not merge separate intervals that the groupByIntervals analysis separated due to an indel). minAligned: the minimum number of sequences that must be aligned to the source sequence for masking the output. Regions below this threshold are masked out; no intervals will be reported in these regions. pMinAligned: the minimum fraction of sequences (out of the total in the group) that must be aligned to the source sequence for masking the output. seqMethod: you may supply your own function for grouping. Called as seqMethod(bounds,seqs,**kwargs), where bounds is a sorted list of (ipos,isStart,i,ns,isIndel,(start,end,targetStart,targetEnd)) and seqs is a list of sequences in the group. Must return a list of (sourceIval,targetIval). See the docs.

iteritems( **kwargs)
same keys as iter, but for each provides the source interval to target interval mapping (Seq2SeqEdge). Uses same group-by arguments as keys().

edges( **kwargs)
same interval mappings as iteritems, but for each provides a tuple of three objects: the source interval, the corresponding target interval, and the Seq2SeqEdge providing detailed information about the alignment between the source and target intervals (such as percent identity, etc.). Uses same group-by arguments as keys().

__getitem__( s1)
treats s1 as a key (target sequence interval), and returns an Seq2SeqEdge object providing detailed information about the alignment between this target interval and the source interval.

__len__( )
returns the number of distinct sequences that are aligned to the source interval. Note: this is NOT necessarily equal to the number of items that will be returned by the above iterators, since a single target sequence might have multiple 1:1 intervals of alignment to the source interval, due to indels.

In addition to these standard dictionary methods, NLMSASlice provides several additional methods and attributes:

split( **kwargs)
this method provides a way to perform group-by operations on the slice; the output of split() is one or more NLMSASlice objects; if the group-by analysis results in no splitting of the current slice, then it is returned unchanged (i.e. the method just returns self). Uses same group-by arguments as keys(). For further details on group-by operations, see keys() above.

regions( **kwags)
performs the same group-by analysis as split(), but replaces the source interval by the corresponding interval in the LPO. The main practical consequence of this is that target sequence inserts are included in the resulting slice (because they are present in the LPO interval corresponding to the original source interval), whereas they were NOT included in the original slice (because they are not aligned to the source interval). The main place where this matters is in graph traversal of the slice's letters attribute: whereas the nodes and edges corresponding to these inserts are not considered to be part of the letters graph for the original slice, they are part of the LPO slice. Also, the ``source interval'' in any subsequent operations with the LPO slice will be LPO coordinates instead of subintervals of the original source sequence interval. Uses same group-by arguments as keys().

groupByIntervals( maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeAll=True, ivalMethod=None, **kwargs)
This method performs the interval grouping analysis for all the iterators described above. Users will not need to call it directly. Its arguments are described above (see keys()). It returns a dictionary whose keys are sequences aligned to this slice, and whose values are the list of intervals produced by the group-by analysis for the corresponding sequence. The values are tuples of the form (source_start, source_stop, target_start, target_stop), showing the mapping of a source sequence interval onto a target sequence interval. This dictionary is the primary input to the groupBySequences() method below.

groupBySequences( seqIntervals, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs)
This method performs the sequence grouping analysis for all the iterators described above. seqIntervals must be a dictionary of sequences and their associated list of intervals (produced by groupByIntervals() above). It returns a list of output sequence intevals, which is either a list of source sequence intervals (sourceOnly mode), or a list of tuples of the form (source_interval, target_interval).

matchIntervals( seq=None)
this method returns the set of 1:1 match intervals for the target sequence seq (or all aligned sequences, if seq is None), as a dictionary whose keys are target sequence intervals, and whose values are the corresponding source sequence intervals to which they are aligned.

findSeqEnds( seq)
returns the largest possible interval of seq that is aligned to this slice, i.e. it merges all alignment intervals in this slice containing seq, and returns the merged sequence interval based on the minimum start value and maximum stop value found.

2.4.6 NLMSASliceLetters

represents the letters graph of a specific NLMSASlice. It is a graph whose nodes are the NLMSANode objects in this slice, and whose edges are sequence.LetterEdge objects. Note: currently the edge objects are just returned as None - please implement!

This graph has the following methods:

__iter__( )
generates all the nodes in the slice, in order from left to right.

items( )
also iteritems(). Generate the same set of nodes as above, as keys, but for each also returns a value representing its outgoing directed edges (see getitem, below).

__getitem__( node)
gets a dictionary indicating all the outgoing directed edges from node to subsequence nodes, whose keys are the target nodes, and whose edges are the sequence.LetterEdge objects representing each edge.

2.4.7 NLMSANode

A temporary object (created on-the-fly) representing a single letter ``column'' in the alignment. It acts like a container of the sequence letters aligned to the source sequence in this column. It has the following methods:

__iter__( )
generates all the individual sequence letters (as SeqPath intervals, presumably of length 1) that are aligned to the source sequence, in this column of the alignment.

edges( )
generates the same list of of target sequence letters as the iterator, but as a tuple of (target letter, source letter, edge). Currently, edge is just None.

__len__( )
returns the number of distinct sequences aligned to the source interval, in this column.

Other, internal methods that regular users are unlikely to need:

getSeqPos( seq)
returns the sequence interval of seq that is aligned to this column, or raises KeyError if it is not aligned here.

getEdgeSeqs( node2)
returns a dictionary of sequences that traverse the edge directly from this node to node2, i.e. if letter i of seq is aligned to this node, then letter letter i+1 is aligned to node2. The dictionary's keys are top-level sequence objects, and its value for each is the letter position index i as defined above.

nodeEdges( )
returns a dictionary of the outgoing edges from this node, whose keys are target nodes, and whose values are the corresponding edge objects (of type sequence.LetterEdge).

2.4.8 NLMSASequence

You are unlikely to need to manipulate NLMSASequence objects directly; they perform the back-end work for accessing the nested list disk storage of the alignment of the associated sequence.

However, one thing you should know is that for a sequence to be stored in a NLMSA, it needs to have a unique string identifier. NLMSASequence obtains a string identifier for the sequence in one of the following ways (in decreasing order of precedence): 1) the sequence ``object'' can itself just be a Python string, in which case that string is used as the identifier. 2) otherwise, the object should be a SeqPath instance. If it has a name attribute, that will be used as the identifier. 3) Otherwise, if it has a id attribute (which is present by default on sequence.Sequence objects), that will be used.


2.5 Seqdb Module

Pygr interface to sequence databases stored in FASTA, BLAST or relational databases.

2.5.1 Overview

The seqdb module provides a simple, consistent interface to sequence databases from a variety of different storage sources such as FASTA, BLAST and relational databases. Sequence databases are modeled (like other Pygr container classes) as dictionaries, whose keys are sequence IDs and whose values are sequence objects. Pygr sequence objects use the Python sequence protocol in all the ways you'd expect: a subinterval of a sequence object is just a Python slice (s[0:10]), which just returns a sequence object representing that interval; the reverse complement is just -s; the length of a sequence is just len(s); to obtain the actual string sequence of a sequence object is just str(s). Pygr sequence objects work intelligently with different types of back-end storage (e.g. relational databases or BLAST databases) to efficiently access just the parts of sequence that are requested, only when an actual sequence string is needed.

2.5.2 External Requirements

This module makes use of several external programs:

If you are lacking one or more of these requirements, you can still install Pygr and use all Pygr functionality that does not depend on the missing requirements. If you try to use a function for which a requirement is missing, Pygr will raise an appropriate exception (e.g. unable to run blastall).

2.5.3 BlastDB

Interface to an existing BLAST database or FASTA sequence file; in the latter case, it will automatically construct BLAST database files for you using the NCBI tool formatdb. Here's a simple example of opening a BLAST database and searching it for matches to a specific piece of sequence:

from pygr.sequence import *
db=BlastDB('sp') # OPEN SWISSPROT BLAST DB
s=Sequence(str(db['CYGB_HUMAN'][40:-40]),'boo')
m=db.blast(s) # DO BLAST SEARCH
myg=db['MYG_CHICK']
for i in m[s][myg]:
    print repr(i.srcPath),repr(i.destPath),i.blast_score,i.percent_id

Let's go through this example line by line:

Options for constructing a BlastDB:

BlastDB( filepath=None,skipSeqLenDict=False,ifile=None,idFilter=None, blastReady=False)
Open a sequence file as a ``database'' object, giving the user access to its sequences, easy searching via blast or megablast, etc. skipSeqLenDict prevents construction of a sequence length index file (which will be named ``filepath.seqlen'') and a fast sequence access file (which will be named ``filepath.pureseq''). Setting this option to True can be useful if you either wish to speed up initial opening of the BlastDB (note: construction of these indexes is only a one-time event) or avoid the extra disk space required by these indexes. Explanation: To facilitate the rapid creation of sequence objects (which requires the length of the sequence), it creates a sequence length index (as a Python shelve). This enables it to avoid actually loading the sequence string into memory each time a sequence object is created; instead it just looks up the sequence length. While this speeds up access to genomic sequence databases (where each sequence tends to be extremely long), this initial step may be slow for databases of short sequences. Setting skipSeqLenDict option to True, will prevent construction of this sequence length index. ifile lets you open the database directly from a file object rather than a filename. If you have a file object, you can pass it directly to BlastDB instead of a filepath. NB: the BlastDB() constructor will close ifile when it is done reading from the file object. idFilter allows you to provide a function for re-mapping the FASTA sequence identifiers read from the sequence file. This can be useful in the case of NCBI FASTA files, since NCBI often treats the sequence ID as a ``blob'' into which any number of database fields can be stuffed, rather than a true ID. blastReady option specifies whether BLAST index files should be automatically constructed (using formatdb). Note, if you attempt to run the blast() method, it will automatically create the index files for you if they are missing.

Useful methods:

iter( )
iterate over all IDs in the BLAST database.

len( )
returns number of sequences in the BLAST database.

blast( seq,al=None,blastpath='blastall',blastprog=None,expmax=0.001,maxseq=None)
run a BLAST search on sequence object seq. Maxseq will limit the number of returned hits to the best maxseq hits.

megablast( seq,al=None,blastpath='megablast',expmax=1e-20,maxseq=None,minIdentity =None,maskOpts='-U T -F m',rmOpts='')
first performs repeat masking on the sequence by converting repeats to lowercase, then runs megablast with command line options to prevent seeding new alignments within repeats, but allowing extension of alignments into repeats. minIdentity should be a number (maximum value, 100) indicating the minimum percent identity for hits to be returned.

__invert__( )
The invert operator (~, the ``tilde'' character) enables reverse-mapping of sequence objects to their string ID.
id=(~db)[seq] # GET IDENTIFIER FOR THIS SEQUENCE FROM ITS DATABASE

Useful attributes:

2.5.4 FileDBSequence

The default class for sequence objects returned from BlastDB. It provides efficient, fast access to sequence slices (subsequences). When the BlastDB is initially opened, Pygr constructs a length and offset index that enables FileDBSequence to seek() to the correct location for any substring of the sequence. This is a newly implemented class to be released in Pygr 0.4.

2.5.5 BlastSequence

This was previously the default class for sequence objects returned from BlastDB, but has been deprecated because we found that NCBI fastacmd was much too slow and consumed enormous amounts of memory. BlastSequence relies on fastacmd for ``fast'' access to individual sequence slices. The advantage is that it only requires BLAST database files (produced by Pygr using formatdb), whereas the new FileDBSequence requires a specially indexed sequence file (constructed by default by BlastDB), which may be a disadvantage if you are low on disk space.

BlastSequence has several optimizations for working with BLAST databases:

2.5.6 BlastSequenceCache

Implements a variant of BlastSequence designed to merge and cache requests for local intervals of sequence so that repeated accesses to these regions are bundled and cached for efficiency. You work with sequence objects of this type normally, using Python slicing to obtain subintervals, and str() to get the sequence string for a subinterval. But behind the scenes, it does two things:

2.5.7 SQLSequence

Implements a subclass inheriting from SQLRow and SequenceBase, to use a relational database table to obtain the actual sequence. There are three minor variants DNASQLSequence, RNASQLSequence, ProteinSQLSequence (so that the sequence does not have to analyze itself to determine what kind of sequence it is). Its constructor takes the same arguments as SQLRow(table, id), where table is the SQLTable object representing the table in which the sequence is stored, and id is the primary key of the row representing this sequence. However, normally this class is simply passed to the Table object itself so that it will use it to instantiate new row objects whenever they are requested via its dictionary interface. Here's a simple example:

class YiProteinSequence(ProteinSQLSequence): # CREATE A NEW SQL SEQUENCE CLASS
    def __len__(self): return self.protein_length  # USE LENGTH STORED IN DATABASE
protein=jun03[protein_seq_t] # protein IS OUR SQLTable OBJECT REPRESENTING PROTEIN SEQUENCE TABLE
protein.objclass(YiProteinSequence) # FORCE PROTEIN SEQ TABLE TO USE THIS TO INSTANTIATE ROW OBJECTS
pseq=protein['Hs.1162'] # GET PROTEIN SEQUENCE OBJECT FOR A SPECIFIC CLUSTER

Let's go through this line by line:

2.5.8 StoredPathMapping

A second major area in Pygr is representation and query of multiple sequence alignment databases in a way that is scalable to whole genomes. We have previously showed (in our work on Partial Order Alignment) that graphs provide both a compact and algorithmically powerful way to store alignments. Combining this with "interval alignment" makes it scalable and gives a simple interface. In Pygr, alignments are just another kind of graph, whose nodes are sequence intervals, and edges are alignment relations. This provides a general-purpose facility for working with sets of sequence intervals, sequence annotation databases, and multiple sequence alignments, all queryable via Pygr graph queries. We have implemented different container subclasses to work with these data in memory or to work transparently with data stored in relational databases. The consistency and simplicity of the Pygr framework makes it a good interface both to run external tools like BLAST, and to store or query the results in persistent storage like a MySQL database.

hg17=BlastDB('/data/ucsc/hg17') # GET CONTAINER FOR HUMAN GENOME DATABASE
bcl2m=hg17['chr22'][16544303:16588541] # GET INTERVAL WITH BCL2L13 GENE
al=hg17.megablast(mouse_bcl2,maxseq=1) # GET REPEAT-MASKED MEGABLAST ALIGNMENT, ONLY TOP HIT
al[bcl2m[1000:1100] ]+=mrna[210:310] #ADD ALIGNMENT OF A 100nt SEGMENT TO mrna SEGMENT
al.storeSQL('test.table',db_cursor) # STORE COMPLETE ALIGNMENT IN RELATIONAL DATABASE
for e in MAFStoredPathMapping(bcl2m,'ucsc_maf8',u).edges(): #GET ITS MULTIGENOMEALIGNMENTS
   print str(e.srcPath),str(e.destPath) # PRINT THE ACTUAL ALIGNED SEQUENCE INTERVALS

Note: We will be unifying all sequence alignment functionality under the NLMSA interface design sometime in the near future. Specifically, the PathMapping and related classes, while similar to NLMSA, will be replaced with interfaces that are identical to the NLMSA interface.

2.5.9 SliceDB

This class enables you to apply ``slicing information'' from one database to sequences from a second database. For example, you could have a database that lists genes as intervals (slices) on genomic sequences stored in a BlastDB database. The only requirements are:

Both databases should raise KeyError for bad identifiers. The current SliceDB implementation caches sequence objects so that subsequent calls for the same identifier will not require repeating the database queries to the two databases. To remove a sequence object from the cache, just use del db[id] as usual.

SliceDB inherits from the builtin Python dict class, so all standard methods can be used.

db=SliceDB(sliceDB,seqDB) # CREATE OUR DATABASE
gene=db[cluster_id] # USE IT TO GET A GENE SEQUENCE...

2.5.10 VirtualSeq

This class provides an empty sequence object that acts purely as a reference system. Automatically elongates if slice extends beyond current stop. This class avoids setting the stop attribute, taking advantage of SeqPath's mechanism for allowing a sequence to grow in length.
s=VirtualSeq('FOOG_HUMAN')
len(s) # ONLY 1 LETTER LONG BY DEFAULT
s1=s[100:215] # GET A SLICE OF THIS SEQUENCE
len(s) # NOW IT'S 215

The associated VirtualSeqDB class provides a ``sequence database'' that returns a VirtualSeq object for every identifier requested of it. It acts like a Python dictionary:

db=VirtualSeqDB()
s=db['FOOG_HUMAN'] # ASK FOR A SEQUENCE BY ITS IDENTIFIER
s1=s[100:215] # GET A SLICE OF THIS SEQUENCE
For a given identifier it always returns the same VirtualSeq object (i.e. the object returned from the first request for that identifier). In other words, if the identifier was previously requested, it returns the VirtualSeq for that identifier; if not, it creates a new one. This can be convenient when performing operations that just need a coordinate reference system, not actual sequence.

2.5.11 PrefixUnionDict

This class acts as a wrapper for a set of dictionaries, each of which is assigned a specific string ``prefix''. It provides a dictionary interface that accepts string keys of the form ``prefix.suffix'', and returns d['suffix'] where d is the dictionary associated with the corresponding prefix. This is useful for providing a unified ``database interface'' to a set of multiple databases.
hg17=BlastDB('/usr/tmp/ucsc_msa/hg17')
mm5=BlastDB('/usr/tmp/ucsc_msa/mm5')
... # LOAD A BUNCH OF OTHER GENOMES TOO...
genomes={'hg17':hg17,'mm5':mm5, 'rn3':rn3, 'canFam1':cf1, 'danRer1':dr1,
'fr1':fr1, 'galGal2':gg2, 'panTro1':pt1} # PREFIX DICTIONARY FOR THE UNION 
					 # OF ALL OUR GENOMES
genomeUnion=PrefixUnionDict(genomes)
ptChr7=genomeUnion['panTro1.chr7'] # GET CHIMP CHROMOSOME 7

if 'panTro1.chr5' in genomeUnion: # CHECK IF THIS ID IN OUR UNION
    pass # DO SOMETHING...

s= -(ptChr7[1000:2000]) # GET A BIT OF THIS SEQUENCE
if s in genomeUnion: # THIS IS HOW TO CHECK IF s DERIVED FROM OUR UNION
    pass # DO SOMETHING...

It provides a __contains__ method that tests whether a given sequence object is derived from the PrefixUnionDict (see example above). Here are some additional methods:

__init__( prefixDict=None,separator='.',filename=None,dbClass=BlastDB)
You can create a PrefixUnionDict either using a prefixDict (whose keys are string prefixes, and whose values are sequence databases), or using a previously created header file filename. Using the header file, the constructor will automatically open all the sequence databases for you. When opening from a header file, you can also specify a dbClass to be used for opening individual sequence databases listed in the header file; the default is BlastDB. The database class constructor must take a single argument, which is the filepath for opening the database. The separator character is used to form ``prefix.suffix'' identifiers.

__invert__( )
The invert operator (~, the ``tilde'' character) enables reverse-mapping of sequence objects to their string ID. This is the recommended way to get the ``fully qualified sequence ID'', i.e. with the appropriate prefix prepended.
id=(~db)[seq] # GET PROPERLY PREFIXED-IDENTIFIER FOR THIS SEQUENCE
For a given sequence object seq derived from the union (or a slice of a sequence from the union), return a string identifier in the form of ``foo.bar''.

getName( path)
This method is deprecated; instead use the __invert__ operator above.

writeHeaderFile( filename)
Save a header file for this union, to reopen later. It saves the separator character, and a list of prefixes and filepaths to the various sequence databases (which must have a filepath attribute). This header file can be used for later reopening the prefix-union in a single step.

2.5.12 Functions

The seqdb module also provides several convenience functions:

read_fasta( ifile, onlyReadOneLine=False)
a generator function that yields tuples of id,title,seq from ifile. The onlyReadOneLine option is useful if you only want to determine the sequence type (e.g. DNA vs protein) and the whole sequence might be extremely long (e.g. a genome!).

write_fasta( ofile, s, chunk=60, id=None)
writes the sequence s to the output file ofile, using chunk as the line width. id can provide an identifier to use instead of the default s.id.


2.6 sqlgraph Module

This module provides back-end database access.

2.6.1 SQLTable

Provides a dict-like interface to an SQL table. It accepts an identifier as a key, and returns a Python object representing the corresponding row in the database. Typically, these ``row'' objects have an id attribute that represents the primary key, and all column names in the SQL table can be used as attribute names on the row object.

This class is derived from the Python builtin dict class, so all standard methods can be used.

objclass( oclass)
Specify a object class to use for creating new ``row'' objects. oclass must accept a single argument, a tuple object representing a row in the database.

Otherwise, the default oclass for SQLTable is the TupleO class, which provides a named attribute interface to the tuple values representing the row.

load( oclass=None)
Load all data from the table, using oclass as the row object class if specified (otherwise use the oclass for this table). All rows are loaded from the database and saved as row objects in the Python dictionary of this class.

select( whereClause,params=None,oclass=None)
Generate the list of objects that satisfy the whereClause via a SQL SELECT query. This function is a generator, so you use it as an iterator. params is passed to the cursor execute statement to allow additional control over the query.

There are several variants of this class:

2.6.2 SQLTableNoCache

Provide on-the-fly access to rows in the database, but never cache results. Use this when memory constraints or other considerations (for example, if the data in the database may change during program execution, and you want to make sure your program is always working with the latest version of the data) make it undesirable to cache recently used row objects, as the standard SQLTable does.

2.6.3 SQLMultiTableNoCache

Drops the assumption of a one-to-one mapping between each key and a row object (i.e. removes the assertion that the key is unique, a ``primary key''), allowing multiple row objects to be returned for a given key. Therefore, the standard __getitem__ must act as a generator, returning an iterator for one or more row object.

2.6.4 TupleO

Default class for ``row objects'' returned by SQLTable. Provide attribute interface to a tuple. To subclass this, add an _attrcol attribute that maps attribute names to tuple index values (integers). Constructor takes a single tuple argument representing a row in the database.

2.6.5 SQLRow

Default class for row objects from NoCache variants of SQLTable. Provides transparent interface to a row in the database: attribute access will be mapped to SELECT of the appropriate column, but data is not cached on this object. Constructor takes two arguments: a database table object, and an identifier for this row. Actual data requests will be relayed by SQLRow to the database table object.


2.7 Coordinator Module

Framework for running subtasks distributed over many computers, in a pythonic way, using SSH for secure process invocation and XMLRPC for message passing. Also provides simple interface for queuing and managing any number of such "batch jobs".

2.7.1 Overview

The coordinator module provides a simple system for running a large collection of tasks on a set of cluster nodes. It assumes:

To see how to use this, let's look at an example script, mapclusters5.py:

from pygr.apps.leelabdb import *
from pygr import coordinator

def map_clusters(server,genome_rsrc='hg17',dbname='HUMAN_SPLICE_03',
                 result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
                 rmOpts='',**kwargs):
    "map clusters one by one"
    # CONSTRUCT RESOURCE FOR US IF NEEDED
    genome=BlastDB(ifile=server.open_resource(genome_rsrc,'r'))
    # LOAD DB SCHEMA
    (clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna, \
     protein,clusterExons,clusterSplices)=getSpliceGraphFromDB(spliceCalcs[dbname])

    for cluster_id in server:
        g=genomic_seq[cluster_id]
        m=genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
        # SAVE ALIGNMENT m TO DATABASE TABLE test.mytable USING cursor
        createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
                            {'src_id':'varchar(12)','dest_id':'varchar(12)'})
        yield cluster_id # WE MUST FUNCTION AS GENERATOR

def serve_clusters(dbname='HUMAN_SPLICE_03',
                   source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
    "serve up cluster_id one by one"
    cursor=getUserCursor(dbname)
    t=SQLTable(source_table,cursor)
    for id in t:
        yield id

if __name__=='__main__':
    coordinator.start_client_or_server(map_clusters,serve_clusters,['hg17'],__file__)

Let's analyze the script line by line:

2.7.2 Log and Error Information

Process logging and error information go to three different types of logs:

2.7.3 Coordinator

To start a job coordinator (which in turn will the run the whole job by starting Processors on cluster nodes using SSH):

python mapclusters5.py mm5_jan02 --errlog=/usr/tmp/leec/mm5_jan02.log \ 
  --dbname=MOUSE_SPLICE --source_table=genomic_cluster_jan02 \
  --genome_rsrc=mm5 --result_table=GENOME_ALIGNMENT.mm5_cluster_jan02_all \ 
  --rmOpts=-rodent \

Here we have told the Coordinator to name itself "mm5_jan02" in all its communications with the ResourceController. Since we gave no command-line flags, the Coordinator will assume that a ResourceController is already running on port 5000 of the current host. You must have an ssh-agent running BEFORE you start the Coordinator, since the Coordinator will attempt to spawn jobs using SSH. The Coordinator will exit with an error message if it is unable to connect to ssh-agent. A few notes:

2.7.4 ResourceController

Whereas you start a separate Coordinator for each set of jobs you want to run, you only need a single ResourceController running. To start the ResourceController, run:

python coordinator.py --rc=bi