Subsections


2 Module Documentation

The following subsections provide details about how to use specific modules of Pygr functionality.


2.1 Installation

Installing pygr is quite simple:
tar -xzvf pygr-0.3.tar.gz 
cd pygr
python setup.py install

Once the test framework has completed successfully, the setup script will install pygr into python's respective site-packages directory.

If you don't want to install pygr into your system-wide site-packages, replace the "python setup.py install" command with

python setup.py build
This will build pygr but not install it in site-packages.

Pygr contains several modules imported as follows:

from pygr import seqdb # IMPORT SEQUENCE DATABASE MODULE

If you did not install pygr in your system-wide site-packages, you must set your PYTHONPATH to the location of your pygr build. For example, if your top-level pygr source directory is PYGRDIR then you'd type something like:

setenv PYTHONPATH PYGRDIR/build/lib.linux-i686-2.3
where the last directory name depends on your specific architecture.


2.2 sequence Module

Base classes for representing sequences and sequence intervals.

2.2.1 Overview

Pygr provides one base class representing both sequences and sequence intervals (SeqPath), from which all sequence classes are derived (Sequence, SQLSequence, BlastSequence etc.). In this section we document both the features of the base class, and ways to extend or customize it by creating your own subclasses derived from SeqPath. The IntervalTransform class represents a coordinate system mapping from one interval of a sequence, onto another interval of the same or a different sequence.

2.2.2 SeqPath

This class provides the basic capabilities of a sliceable sequence or sequence interval, widely used in Pygr. It tries to provide core operations on sequences in a highly Pythonic way:

before( )
This method returns the entire sequence interval preceding this interval. For example, if exon is an interval of genomic sequence, then exon.before()[-2:] is its acceptor splice site (i.e. the 2 nt immediately before exon).

after( )
This method returns the entire sequence interval following this interval. For example, if exon is an interval of genomic sequence, then exon.after()[:2] is its donor splice site, (i.e. the 2 nt immediately after exon).

2.2.3 Sequence class

sequence.Sequence( s, id)
The Sequence class provides a SeqPath flavor that stores a sequence string s and identifier id for this sequence.

from pygr import sequence
seq = sequence.Sequence('GPTPCDLMETQ','FOOG_HUMAN')

update( s)
You can change the actual string sequence to a new string s using the update method:

seq.update('TKRRPLEDKMNEPS')

seqtype( )
returns DNA_SEQTYPE for DNA sequences, RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.

reverse_complement( s)
returns the reverse complement of the sequence string s.

2.2.4 Coordinate System

SeqPath follows Python slicing conventions (i.e. 0-based indexing, positive indexes count forward from start, negative indexes count backwards from the sequence end, and always s.start<s.stop).

Each SeqPath object has a number of attributes giving information about its ``location'':

2.2.5 Extending and Customizing

There are several methods and attributes you can override to extend or customize the behavior of your own SeqPath-derived classes. Typically you will derive either from the Sequence class, or in some cases from the SeqPath class.

strslice( start, stop, useCache=True)
called to get the string sequence of the interval (start, stop). You can provide your own strslice() method to customize how sequence is stored and accessed. For example, SQLSequence.strslice() gets the sequence via a SQL query, and BlastSequence.strslice() obtains it using the fastacmd -L start,stop UNIX shell command from the NCBI toolkit. The optional useCache argument controls whether your strslice method should attempt to get the sequence slice from its database cache (if any), or, if false, only directly from its back-end storage (in the usual way described above).

__len__( )
called to compute the length of the sequence. You can customize this to provide an efficient length method for your particular sequence storage. e.g. SQLSequence obtains it via a SQL query; BlastSequence obtains it from a precomputed length index. The default Sequence.__len__() method computes it from len(self.seq), assuming that the sequence can be accessed from the seq attribute.

__getitem__( slice_obj)
if you want to monitor or intercept slicing requests on your sequence, you can do so by providing your own getitem method. See seqdb.BlastSequenceCache class for an example. If the sequence object has a db attribute, and that database object it points to has an itemSliceClass attribute, SeqPath.__getitem__ will use that class to construct the subinterval object. Similarly, if the sequence object has an annot attribute, and that annotation object has a db attribute, again the itemSliceClass attribute of that database will be used as the class to construct the subinterval object. Otherwise it will use SeqPath itself as the class for constructing the subinterval object.

Note: this itemSliceClass behavior applies not only to sequence slices obtained via __getitem__, but also from all other methods that return sequence slices, such as the following list: before, after, __mul__, __neg__. __add__, __iadd__.

__mul__( other)
get the sequence interval intersection of self and other.

__neg__( )
get the sequence interval representing the opposite strand of self i.e. the slice whose string value is the reverse complement of the string value of self.

__add__( other)
get the sequence interval union of self and other, i.e. the smallest sequence interval that contains both of them.

__getattr__( attr)
if you subclass a SeqPath-derived class and supply a __getattr__ method for your subclass, it must call the parent class's __getattr__. This is essential for ``delayed evaluation'' of start and stop attributes, which are generated automatically by SeqPath's __getattr__. If your subclass inherits from more than one parent class, check whether both parents supply a __getattr__, in which case your subclass must supply a __getattr__ that explicitly calls both of them. Failing to do so will lead to strange bugs.

2.2.6 IntervalTransform

This class provides a mapping transform between the coordinate systems of a pair of intervals.

xform = IntervalTransform(srcPath,destPath)
d2 = xform(s2) # MAPS s2 FROM srcPath coords to destPath coord system
d3 = xform[s2] # CLIPS s2 TO NOT EXTEND OUTSIDE srcPath, THEN XFORMS
s3 = xform.reverse(d3) # MAP BACK TO srcPath COORD SYSTEM

2.2.7 Seq2SeqEdge

This class represents a segment of alignment between two sequences. It is a temporary object created in association with a MSASlice object (see Alignment Module below).

__init__( msaSlice, targetPath, sourcePath=None)
Create a Seq2SeqEdge for the targetPath, on the specified alignment slice. If sourcePath is None, it will be calculated automatically by calling the slice's methods.

__iter__( sourceOnly=True, **kwargs)
iterate over source intervals within this segment of alignment. kwargs will be passed on to the msaSlice's groupByIntervals and groupBySequences methods.

items( **kwargs)
same as __iter__, but gets tuples of (source_interval,target_interval).

pIdentity( mode=max,trapOverflow=True)
Compute the percent identity between the source and target sequence intervals in this segment of the alignment. mode controls the method used for determining the denominator based on the lengths of the two aligned sequence intervals. trapOverflow controls whether overflow (due to multiple mappings of the query sequence to different regions of the alignment) is trapped as an error. To turn off such error trapping, set trapOverflow=False.

pAligned( mode=max,trapOverflow=True)
Compute the percent alignment between the source and target sequence intervals in this segment of the alignment, i.e. the fraction of residues that are actually aligned as opposed to gaps / insertions, in the two intervals.

conservedSegment( pIdentityMin=.9,minAlignSize=1,mode=max)
Return the longest alignment interval (possibly including gaps) with a %identity fraction higher than pIdentityMin. If there is no such interval, or the longest such interval is shorter than minAlignSize, it returns None. The interval is returned as a tuple of integers (srcStart,srcEnd,destStart,destEnd).

Warning: if your query sequence has multiple mappings in the alignment (i.e. it is aligned to two or more different regions in the alignment), pIdentity() and pAligned() may return fractions larger than 1.0. This is because the query sequence may align to a given target sequence via more than one region in the alignment. If you encounter this problem, you can iterate through the individual mappings yourself (by calling the iter(), items() or edges() iterator methods for your alignment slice object), and calculating the percentage identity or alignment (via your own algorithm) individually for each specific mapping. For more background on this problem, see ``Multiple Mappings'', below.

Note that the presence of multiple mappings is not a Pygr bug, but simply reflects the alignment data loaded into Pygr. Seq2SeqEdge should be able to avoid this problem mostly, beginning with release 0.6. (It tries to screen out hits not consistent with the specific region-region mapping stored with this edge).

2.2.8 SeqFilterDict

This dict-like class provides a simple way for masking a set of sequences to specific intervals. It stores a specific interval for each sequence. Subsequent look-up using a sequence interval as a key will return the intersection between that interval and the stored interval for that sequence in the dictionary. If there is no overlap, it raises KeyError.

d = SeqFilterDict(seqIntervalList)
overlap = d[ival] # RETURNS INTERSECTION OF ival AND STORED IVAL, OR KeyError

You can pass a list of intervals to store to the class constructor (as shown above). You can also add a single interval using the syntax d[saveInterval]=saveInterval. (This syntax reflects the actual mapping that the dictionary will perform if later called with the same interval).

2.2.9 LetterEdge

This class represents an edge from origin -> target node.

__iter__( )
iterate over seqpos for sequences that traverse this edge.

iteritems( )
generate origin, target seqpos for sequences that traverse this edge.

__getitem__( seq)
return origin,target seqpos for sequence seq; raise KeyError if not in this edge

2.2.10 Functions

The sequence module also provides convenience functions:

guess_seqtype( s)
based on the letter composition of the string s, returns DNA_SEQTYPE for DNA sequences, RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.

absoluteSlice( seq, start, stop)
returns the sequence interval of top-level sequence object associated with seq, interpreting start and stop according to the Pygr convention: a pair of positive values represents an interval on the forward strand; a pair of negative values represents an interval on the reverse strand (see Coordinate System, above). Note: if seq is itself a subinterval, then the start,stop coordinates are interpreted relative to its parent sequence, i.e. seq.pathForward[start:stop].

relativeSlice( seq, start, stop)
returns the sequence interval of seq, interpreting start and stop according to the Pygr convention: a pair of positive values represents an interval on the forward strand; a pair of negative values represents an interval on the reverse strand (see Coordinate System, above). Note: if seq is itself a subinterval, then the start,stop coordinates are interpreted relative to seq itself, i.e. seq[start:stop].


2.3 Alignment Module

Pygr interface to sequence alignment, and scalable storage for multigenome alignments

2.3.1 Overview

Pygr provides a general model for interfacing with any kind of sequence alignment, and also a uniquely scalable storage system for working with huge multiple sequence alignments such as multigenome alignments. Specifically, it lets you work with an alignment both in the traditional Row-Column model (each row is a sequence, each column is a set of individual letters from different sequences, that are aligned; we will refer to this as the RC-MSA model), and also as a graph structure (known as a Partial Order Alignment, which we will refer to as the PO-MSA model). This supports ``traditional'' alignment analysis, as well as graph-algorithms, and even graph query of alignments.

This model has a few basic classes:

2.3.2 NestedList Storage

Pygr provides a highly scalable storage mechanism for working with multi-genome alignments. One fundamental challenge in working with very large alignments is the interval overlap query problem: to obtain a portion of an alignment (defined by some interval of interest) requires finding all interval elements in the ``alignment database'' that overlap the query interval. Since the intervals can be indexed by start (or end) position, one can typically find the first overlapping element in $O(\log N)$ time, where $N$ is the total number of intervals in the database. The problem is that since standard index structures cannot index both start and end, to obtain all intervals that overlap the query interval, one must scan forwards (or backwards) from that point. Furthermore, one cannot stop at the first non-overlapping interval; there might be an extremely long interval at the very beginning of the index, that extends to overlap the query interval. In this case, one would have to scan the entire database ($O(N)$ time) to guarantee that all overlapping intervals are found.

The nested list data structure solves this problem, by moving any interval in the database that is contained in another interval out of the top-level interval list, into the sublist of the parent interval. Based on this, one can prove that one can stop the scanning operation at the first non-overlapping interval (i.e. the overlapping intervals in any list form a single contiguous block). Overall, this reduces the query time to $O(\log N + n)$, where $n$ is the number of intervals in the database that actually overlap the query (i.e. results to return). Moreover, the nested list data structure can be implemented very well both in computer memory (RAM) or as indexed disk files. Pygr's disk-based cnestedlist database can complete a typical interval query of the 26GB UCSC 8 genome alignment in about 60 microseconds, compared with 10-30 seconds per query using MySQL.

2.3.3 Multiple Mappings: a Warning

Multi-genome alignments take traditional models of alignment to an entirely different scale, and inevitably many of the assumptions of standard row-column multiple sequence alignment are broken (e.g. no inversions; no cycles; etc.). One major issue that users should be aware of in UCSC multi-genome alignments is the possibility of multiple mappings, in which a given query sequence interval is mapped to two or more different regions of the alignment (and thus potentially to two or more different locations in a given target genome). Currently, UCSC multi-genome alignment are typically based on a single reference genome, to which all other genomes are aligned. While a given region of the reference genome might be guaranteed to have a unique mapping in the UCSC multi-genome alignment, other genomes do not appear to have any such guarantee: a region in any of those genome can have multiple mappings. This is problematic for several reasons:

If you encounter multiple mappings, you can always iterate over them one by one, and perform your own computations for each one. However, to avoid them altogether, you can restrict your queries to the reference genome for this specific alignment (UCSC offers different versions of each alignment set, each based on a different reference genome).

2.3.4 NLMSA

Top-level object representing an entire multiple sequence alignment, stored using a set of disk-based nested list interval databases. The alignment is stored as an interval representation of a linearized partial order (LPO), using nested list databases. This has several elements:

This functionality is encapsulated in the NLMSA class, which has a number of methods and attributes.

Construction Methods:

NLMSA( pathstem='', mode='r', seqDict=None, mafFiles=None, axtFiles=None, maxOpenFiles=1024, maxlen=None, nPad=1000000, maxint=41666666, trypath=None, bidirectional=True, pairwiseMode= -1, bidirectionalRule=nlmsa_utils.prune_self_mappings, maxLPOcoord=None)
Constructor for the class. pathstem specifies a path and filename prefix for the NLMSA files (since multiple files are used to store one NLMSA, it will automatically add a number of suffixes automatically to open the necessary set of files for the NLMSA). mode is either ``r'' to open an existing NLMSA (from the pathstem disk files); ``w'' to create a new one (which will be saved to the pathstem disk files); or ``memory'' to create a new in-memory NLMSA (i.e. stored in your computer's RAM instead of using files on your hard disk). Obviously, this limits you to the amount of RAM in your computer, but will make the NLMSA much, much faster.

seqDict specifies a dictionary which maps sequence names to actual sequence objects representing those sequences. If seqDict is None, the constructor will call nlmsa_utils.read_seq_dict() to try to obtain it from files associated with the NLMSA. It first looks for a file pathstem.seqDictP that is simply a pickle of the seqDict data. If this is not found, it next looks for a file pathstem.seqDict that is a prefixUnionDict header file for opening all the sequence database files for you automatically. This header file will itself specify a list of sequence database files; the trypath option, if provided, specifies a list of directories in which to look for these sequence database files.

The bidirectional option indicates whether you wish the NLMSA to save each input alignment relationship A:B in both possible directions (i.e. nlmsa[A] will yield B, and nlmsa[B] will yield A). In general, the bidirectional=True mode is most appropriate for true multiple sequence alignments, i.e. where it is guaranteed that for a given pair of sequences A,B each interval of A maps to a unique interval in B, which in turn maps back to the same interval of A (and only that interval in A). There are many possible scenarios where you might prefer bidirectional=False mode:

pairwiseMode=True indicates a PAIRWISE sequence alignment, in which the stored alignment relationships each consist of a pair of sequence intervals that are aligned. Note: this pairwise format can store the alignment of any number of sequences, but the key point is that the individual alignment relations are pairwise, sequence-to-sequence. The opposite model (pairwiseMode=False) indicates a true MULTIPLE sequence alignment, in which the stored alignment relationships each consist of an integer coordinate interval (the alignment's internal coordinate system, for technical reasons called the ``LPO'') and a sequence interval that is aligned to it. Under normal circumstances, you will not need to specify a value for the pairwiseMode option; the NLMSA will infer the correct setting automatically based on the input data. Note: the pairwise format (pairwiseMode=True) and multiple alignment format (pairwiseMode=False) cannot be mixed in a single NLMSA. It must be either one format or the other.

mafFiles can be used to specify a list of filenames containing a multiple sequence alignment in the UCSC MAF format, for saving as a new NLMSA (i.e. mode='w'). Note that this automatically sets pairwiseMode=False. After the MAF data are read, it will automatically call the build() method to construct the alignment index files.

axtFiles can be used to specify a list of filenames containing a set of pairwise alignments in UCSC axtNet format, for saving as a new NLMSA (i.e. mode='w'). Note that this automatically sets pairwiseMode=True. After the axtNet data are read, it will automatically call the build() method to construct the alignment index files.

bidirectionalRule allows the user to provide a function that has complete control over the desired bidirectional setting to use for each possible pair of sequence databases. Currently, this is only used for axtFiles reading; the default method (nlmsa_utils.prune_self_mappings) filters out duplicate mappings for a sequence database onto itself (since these are provided in both forward and reverse directions in the axtNet file), but stores mappings for one sequence database to another bidirectionally (since the axtNet files give such mappings in only one direction normally). To implement your own bidirectionalRule function, see nlmsa_utils.prune_self_mappings() as an example.

maxlen specifies the maximum coordinate value for a union or LPO coordinate system. Its default value is 2GB, to prevent int overflow. Using a smaller value can be useful, to 1) limit the size of the LPO in memory during initial construction, and 2) to limit the size of LPO database files on disk (if for example, your file system does not support files above some maximum size). During initial construction of the NLMSA (from MAF files or user-specified interval alignments), the algorithm performs a one-pass sort of the LPO intervals. Thus, this set of intervals is briefly held in RAM for this sort. If you have insufficient RAM, the construction step may raise a MemoryError. If so, you can avoid this problem by using a smaller maxlen value.

The maxint option provides another way of limiting the size of LPO databases. It specifies the maximum number of intervals to store per LPO database. Since each interval takes 24 bytes, the default setting limits each LPO to a total size of 1 GB. Note that the current NLMSA construction algorithm requires loading each database index into memory as one-time operation during construction. If your NLMSA build fails due to running out of memory, simply reduce this value.

The nPad option sets the maximum number of LPO coordinate systems (specifically, the offset for the start of real sequence IDs in the NLMSA sequence index). You are unlikely to need to change this default value.

maxOpenFiles limits the open file descriptors the NLMSA will use. This option is no longer of much importance. In versions prior to pygr 0.5, however, it was important because each sequence in the alignment had its own index file (in v.0.5 and later this problem is solved by unionization; for details see below). Since each sequence has a separate nested list database file, a large multi-genome alignment (with each genome containing 20 different chromosomes, say) can rapidly open a large number of file descriptors. Note: NLMSA only opens a given sequence's nested list database when one of your queries actually requires access to that sequence; it then keeps that file descriptor open to make subsequent queries to it fast. If the number of open file descriptors would exceed maxOpenFiles, it will close other open database files, which may slow down query performance (due to having to open and close databases repeatedly to process queries).

__iadd__( sequence)
As part of constructing an alignment, adds sequence to the alignment graph, so that you can subsequently save specific alignments of intervals of sequence, using code like nlmsa[s]+=s2, where s is an interval of sequence and s2 is some other sequence interval. If sequence had not been added to the alignment, this later operation will raise a KeyError.

addAnnotation( annotation)
adds an alignment relationship to annotation from its underlying sequence interval. Note: to use this, the NLMSA must have been created with the pairwiseMode=True option.

__getitem__( seqInterval)
prepare to store an alignment relationship for the sequence interval seqInterval, i.e. get a BuildMSASlice object representing seqInterval, to which you can then add other sequence intervals to align them. I.e. nlmsa[s1]+=s2 saves the alignment of intervals s1 and s2. You can also use a regular Python slice object using integer indices ie. nlmsa[1:45], in which case, it indicates that region of the LPO coordinate system. If the sequence containing interval s2 is not already in the NLMSA, it will be added for you automatically (i.e. creating the necessary indexing, nested list database files, etc.). In this case, the sequence must supply a unique string identifier, which will be used on subsequent attempts to open the NLMSA database, to match the individual sequence nested-list databases against corresponding sequence objects (using seqDict, see above).

build( buildInPlace=True,saveSeqDict=False,verbose=True)
to construct the final nested list databases, after all the desired alignment intervals have been saved (using the iadd/getitem above). This method simply calls the build() method on all the constituent NLMSASequence objects in this alignment. NOTE: you do not need to call build() if you provided a mafFiles constructor argument, since that automatically calls build().

buildInPlace=False forces it to use an older NLMSA construction method (higher memory usage, but more tested). The new in-place construction method (made the default in release 0.7) is described in the Alekseyenko & Lee 2007 paper published in Bioinformatics.

saveSeqDict=True forces it to write the NLMSA's seqDict (dictionary of sequences that are included in the alignment) to disk. This is unnecessary if you intend to store the NLMSA in pygr.Data, as pygr.Data will automatically save the NLMSA's seqDict as part of that process. However, if you plan on re-opening the NLMSA directly from disk, you should save the seqDict to disk by passing this option, or by directly calling the NLMSA's save_seq_dict() method.

verbose controls whether the method will print explanatory messages to stderr about the saveSeqDict=False mode. To suppress printing of these messages, use verbose=False.

save_seq_dict( )
Forces saving of the NLMSA's seqDict to a disk file named 'FILESTEM.seqDictP' (where FILESTEM is the base path to your NLMSA files). This is unnecessary if you intend to store the NLMSA in pygr.Data, as pygr.Data will automatically save the NLMSA's seqDict as part of that process. The seqDictP file format is a pygr.Data-aware pickle; that is, references to any pygr.Data resources will simply be saved by their pygr.Data IDs, and loaded in the usual pygr.Data way.

Alignment Usage Methods:

__getitem__( s1)
get the alignment slice for the sequence interval s1, i.e. get an NLMSASlice object representing the set of intervals aligned to s1. You can also use a regular Python slice object using integer indices ie. nlmsa[1:45], in which case, it gets the NLMSA slice corresponding to that region of the LPO coordinate system.

doSlice( s1)
If you subclass NLMSA and provide a doSlice method, the NLMSA will call your doSlice(seq) method to find alignment results for seq, instead of querying its stored alignment data. You can thus use this to provide an NLMSA interface around virtually any source of alignment information that you have. To see an example, see the xnestedlist.NLMSAClient class.

2.3.5 dump_textfile, textfile_to_binaries

These two functions enable you to dump a constructed NLMSA binary database to a platform-independent text format, and to restore an NLMSA binary database from this text format. This can be useful for

dump_textfile( pathstem,outfilename=None,verbose=True)
Dumps a text representation of an existing NLMSA binary database. pathstem must be the path to the NLMSA. For example if you have an NLMSA database index file /loaner/hg17_NLMSA/hg17_msa.idDict (and many other index files with different suffixes), then you would supply a pathstem value of /loaner/hg17_NLMSA/hg17_msa.

outfilename gives the path for the output text file into which the NLMSA database will be dumped. If None, it will default to pathstem with a .txt suffix added.

Setting verbose=False will prevent printing of warning messages to stderr (for details about possible warnings, see below).

Note: dump_textfile attempts to save information about the seqDict (or, alternatively, the PrefixUnionDict dictionary of multiple sequence databases), using their pygr.Data IDs if possible. Specifically, for a PrefixUnionDict (i.e. multiple sequence databases in one NLMSA), it saves a dictionary of the prefixes for each sequence database in the NLMSA, with its pygr.Data ID if it has one. Assigning a pygr.Data ID to each sequence database has the great advantage that the reconstruction method textfile_to_binaries() can simply request pygr.Data for these IDs on the destination machine, automatically. By contrast, if a sequence database has no pygr.Data ID, the user will have to supply that sequence database manually on the destination machine. In this case, dump_textfile will print a warning message to stderr explaining what the user must do. This provides yet another reason why it's a good idea to assign a pygr.Data ID to any sequence database that is a well-defined, commonly used public resource.

textfile_to_binaries( filename,seqDict=None,prefixDict=None)
Creates an NLMSA binary database from input text file filename. The NLMSA binary database will be created in the current directory, and will be given the same name as it originally had prior to being dumped to text. Since no build is required, this function does not require significant amounts of RAM memory.

Handling of sequence databases: textfile_to_binaries will attempt to obtain any needed sequence databases using their pygr.Data ID if assigned. If you obtain a PygrDataNotFoundError, this simply means that one of the pygr.Data IDs was not found in any of your pygr.Data resource databases. In this case, you must either add it to one of your resource databases, or add a resource database that does contain it to your PYGRDATAPATH, then re-run textfile_to_binaries.

On the other hand, if any of the needed sequence databases were NOT assigned a pygr.Data ID, then you will have to provide that sequence database(s) manually to the textfile_to_binaries() function, either via its seqDict argument (if the NLMSA contains only one sequence database), or via its prefixDict argument (if the NLMSA contains multiple sequence databases). If you do not do so, an appropriate error will be raised, explaining what you need to do. The prefixDict argument must be a dictionary whose keys match individual sequence database prefixes in the original NLMSA PrefixUnionDict, and whose associated values are the appropriate sequence database to use for each specified prefix. You only need to provide those sequence databases that textfile_to_binaries() is unable to obtain from pygr.Data. When in doubt, just run textfile_to_binaries() without the prefixDict argument, and it will raise an error message listing the prefixes that you need to provide.

2.3.6 xnestedlist.NLMSAServer, xnestedlist.NLMSAClient

These two classes, provided by the separate xnestedlist module, provide an XMLRPC client-server mechanism for querying NLMSA databases over a network.

NLMSAServer is constructed exactly the same as a normal NLMSA; it is a normal NLMSA with just two methods added for serving XMLRPC client requests. See the coordinator.XMLRPCServerBase reference documentation below for details about starting an XMLRPC server.

NLMSAClient provides a read-only client interface for querying data in a remote NLMSAServer. It takes two extra arguments for its constructor: url, the URL for the XMLRPC server; name, the name of the NLMSAServer server object in the XMLRPC server's dictionary. For example, to use an NLMSA stored on a remote XMLRPC server, assuming that myPrefixUnion stores a dictionary of all the sequence databases used by that NLMSA alignment, would just be:

from pygr import xnestedlist
nlmsa = xnestedlist.NLMSAClient(url='http://leelab.mbi.ucla.edu:5000',
                                name='ucsc17',seqDict=myPrefixUnion)

2.3.7 NLMSASlice

A temporary object created on-the-fly to represent (an interface to provide information about) the portion of the alignment associated with a specific sequence interval. This is the main class for querying information about alignments, and provides a number of useful methods for getting detailed information about alignment relationships.

In addition, the NLMSASlice is the basic unit of sequence caching control, by which you can ensure that pygr alignment analysis accesses sequence databases in the most efficient way. Here's how it works:

An NLMSASlice acts like a dictionary whose keys are sequence intervals that are aligned to this region, and whose values are Seq2SeqEdge objects providing detailed information about the alignment of the target interval (key) to the source interval (the sequence interval used to create the NLMSASlice in the first place). You can use this dictionary interface in several ways:

__iter__( )
iterates over all sequence intervals that have a 1:1 mapping (i.e. a block of alignment containing no indels) to all or part of the source interval.

keys( maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, mergeAll=False, maxsize=500000000, minAlignSize=None, maxAlignSize=None, pIdentityMin=None, ivalMethod=None, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs)
Provides a more general interface than iter(), with two types of group-by capabilities, ``group-by'' operations on the alignment intervals contained within this slice (``horizontal'' grouping), and on the sets of sequences aligned to this slice (``vertical'' grouping).

1. ``group-by'' operations on the alignment intervals contained within this slice. It allows the user to supply various parameters for controlling when alignment intervals will be merged or split in the results that it returns.

mergeAll forces it to combine intervals of a given sequence irrespective of the size of gaps or inserts separating them.

mergeMost forces it to combine intervals of a given sequence, within reason (but don't merge a whole chromosome if you get one interval from one end and one interval from the other end: maxgap=maxinsert=10000, mininsert=-10, maxsize=50000).

maxgap sets the maximum gap size for merging two adjacent intervals. If the target sequence for the two alignment intervals has a gap longer than maxgap letters between the two alignment intervals, they will be returned as separate intervals; otherwise they will be merged as a single alignment region.

maxinsert sets the maximum length of insert in the target sequence that allows to adjacent intervals to be merged as a single alignment region in the results.

mininsert is specifically for handling alignments that may have small ``cycles'' (due to slight inconsistencies in the reported alignment intervals, for example, if a portion of sequence can align at both the end of one interval or at the beginning of another, and the intervals are actually added to the NLMSA that way, then the start of the second interval will actually be before the stop of the first interval; this corresponds to a negative insert value). A mininsert value of zero (the default), prevents any such interval pairs from being merged. Giving a negative mininsert value will allow interval pairs whose insert value is greater than or equal to this value, to be merged.

maxsize: upper bound on maximum size for interval merging.

filterSeqs, if not None, should be a dict of sequences used to filter the group-by analysis; i.e. only alignment intervals containing these sequences are considered in the analysis. More specifically, filterSeqs can be used to mask the group-by analysis to a specific interval of a sequence, by having filterSeqs return only the intersection between the interval it is passed as a key, and the masking interval that it stores. If there is no overlap, it must raise KeyError. The sequence.SeqFilterDict class provides exactly this masking capability, i.e.

d = sequence.SeqFilterDict(someIntervals)
overlap = d[ival] # RETURNS INTERSECTION BETWEEN ival AND someIntervals, OR KeyError
minAlignSize if not None, sets a minimum size for filtering the resulting alignment regions. Regions smaller than the specified size will be culled from the output.

maxAlignSize if not None, sets a maximum size for filtering the resulting alignment regions. Regions larger than the specified size will be culled from the output.

pIdentityMin if not None, sets a minimum fractional sequence identity for filtering the resulting alignment regions. Regions with lower levels of identity will be clipped from the output. Specifically, within each region, the largest contiguous segment (possibly including indels, if permitted by maxgap and maxinsert) whose sequence identity is above the threshold will be returned (but only if it is larger than minAlignSize if set).

ivalMethod, if not None, allows the user to provide a Python function that performs interval grouping. Specifically it is called as ivalMethod(l, ns,msaSlice=self, **kwargs), where l is the list of intervals for NLMSASequence ns within the current slice msaSlice; all other args are passed as a dict in kwargs.

2. merge groups of sequences using "vertical" group-by rules. seqGroups: a list of one or more lists of sequences to group. If None, the whole set of sequences will be treated as a single group. Each group will be analyzed separately, as follows:

sourceOnly: output intervals will be reported giving only the corresponding interval on the source sequence; redundant output intervals (mapping to the same source interval) are culled. Has the effect of giving a single interval traversal of each group.

indelCut: for sourceOnly mode, do not merge separate intervals that the groupByIntervals analysis separated due to an indel).

minAligned: the minimum number of sequences that must be aligned to the source sequence for masking the output. Regions below this threshold are masked out; no intervals will be reported in these regions.

pMinAligned: the minimum fraction of sequences (out of the total in the group) that must be aligned to the source sequence for masking the output.

seqMethod: you may supply your own function for grouping. Called as seqMethod(bounds,seqs,**kwargs), where bounds is a sorted list of (ipos,isStart,i,ns,isIndel,(start,end,targetStart,targetEnd)) and seqs is a list of sequences in the group. Must return a list of (sourceIval,targetIval). See the docs.

iteritems( **kwargs)
same keys as iter, but for each provides the source interval to target interval mapping (Seq2SeqEdge). Uses same group-by arguments as keys().

edges( **kwargs)
same interval mappings as iteritems, but for each provides a tuple of three objects: the source interval, the corresponding target interval, and the Seq2SeqEdge providing detailed information about the alignment between the source and target intervals (such as percent identity, etc.). Uses same group-by arguments as keys().

__getitem__( s1)
treats s1 as a key (target sequence interval), and returns an Seq2SeqEdge object providing detailed information about the alignment between this target interval and the source interval.

__len__( )
returns the number of distinct sequences that are aligned to the source interval. Note: this is NOT necessarily equal to the number of items that will be returned by the above iterators, since a single target sequence might have multiple 1:1 intervals of alignment to the source interval, due to indels.

In addition to these standard dictionary methods, NLMSASlice provides several additional methods and attributes:

split( **kwargs)
this method provides a way to perform group-by operations on the slice; the output of split() is one or more NLMSASlice objects; if the group-by analysis results in no splitting of the current slice, then it is returned unchanged (i.e. the method just returns self). Uses same group-by arguments as keys(). For further details on group-by operations, see keys() above.

regions( **kwags)
performs the same group-by analysis as split(), but replaces the source interval by the corresponding interval in the LPO. The main practical consequence of this is that target sequence inserts are included in the resulting slice (because they are present in the LPO interval corresponding to the original source interval), whereas they were NOT included in the original slice (because they are not aligned to the source interval). The main place where this matters is in graph traversal of the slice's letters attribute: whereas the nodes and edges corresponding to these inserts are not considered to be part of the letters graph for the original slice, they are part of the LPO slice. Also, the ``source interval'' in any subsequent operations with the LPO slice will be LPO coordinates instead of subintervals of the original source sequence interval. Uses same group-by arguments as keys().

groupByIntervals( maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, maxsize=500000000, mergeAll=True, ivalMethod=None, pIdentityMin=None, minAlignSize=None, maxAlignSize=None,**kwargs)
This method performs the interval grouping analysis for all the iterators described above. Users will not need to call it directly. Its arguments are described above (see keys()). It returns a dictionary whose keys are sequences aligned to this slice (represented by their integer nlmsa_id), and whose values are the list of intervals produced by the group-by analysis for the corresponding sequence. The values are tuples of the form (source_start, source_stop, target_start, target_stop), showing the mapping of a source sequence interval onto a target sequence interval. This dictionary is the primary input to the groupBySequences() method below.

filterIvalConservation( seqIntervals,pIdentityMin=None,filterFun=None,**kwargs)
This method is used by groupByIntervals() to filter the results using the specified filterFun filter function, which should either return None if the specified alignment region does not pass the filter, or return the filtered interval. For an example filter function, see conservationFilter, which is used by default in filterIvalConservation. seqIntervals must be passed in the same format as expected by groupBySequences; it is modified in place by filterIvalConservation, which always returns None.

conservationFilter( seq,m,pIdentityMin=None,minAlignSize=None,maxAlignSize=None,**kwargs)
Tests an alignment mapping m for the specified size and sequence identity criteria. Returns the (possibly clipped) interval m if the criteria are met, and None if the criteria are not met. m is expected to be a tuple of integers (srcStart,srcEnd,destStart,destEnd). seq must be the destination sequence object (sliceable by the destination interval coordinates). The conservation criteria and clipping are performed using Seq2SeqEdge.conservedSegment().

groupBySequences( seqIntervals, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs)
This method performs the sequence grouping analysis for all the iterators described above. seqIntervals must be a dictionary of sequences and their associated list of intervals (produced by groupByIntervals() above). It returns a list of output sequence intevals, which is either a list of source sequence intervals (sourceOnly mode), or a list of tuples of the form (source_interval, target_interval).

matchIntervals( seq=None)
this method returns the set of 1:1 match intervals for the target sequence seq (or all aligned sequences, if seq is None), as a dictionary whose keys are target sequence intervals, and whose values are the corresponding source sequence intervals to which they are aligned.

findSeqEnds( seq)
returns the largest possible interval of seq that is aligned to this slice, i.e. it merges all alignment intervals in this slice containing seq, and returns the merged sequence interval based on the minimum start value and maximum stop value found.

2.3.8 NLMSASliceLetters

represents the letters graph of a specific NLMSASlice. It is a graph whose nodes are the NLMSANode objects in this slice, and whose edges are sequence.LetterEdge objects. Note: currently the edge objects are just returned as None - please implement!

This graph has the following methods:

__iter__( )
generates all the nodes in the slice, in order from left to right.

items( )
also iteritems(). Generate the same set of nodes as above, as keys, but for each also returns a value representing its outgoing directed edges (see getitem, below).

__getitem__( node)
gets a dictionary indicating all the outgoing directed edges from node to subsequence nodes, whose keys are the target nodes, and whose edges are the sequence.LetterEdge objects representing each edge.

2.3.9 NLMSANode

A temporary object (created on-the-fly) representing a single letter ``column'' in the alignment. It acts like a container of the sequence letters aligned to the source sequence in this column. It has the following methods:

__iter__( )
generates all the individual sequence letters (as SeqPath intervals, presumably of length 1) that are aligned to the source sequence, in this column of the alignment.

edges( )
generates the same list of of target sequence letters as the iterator, but as a tuple of (target letter, source letter, edge). Currently, edge is just None.

__len__( )
returns the number of distinct sequences aligned to the source interval, in this column.

Other, internal methods that regular users are unlikely to need:

getSeqPos( seq)
returns the sequence interval of seq that is aligned to this column, or raises KeyError if it is not aligned here.

getEdgeSeqs( node2)
returns a dictionary of sequences that traverse the edge directly from this node to node2, i.e. if letter i of seq is aligned to this node, then letter letter i+1 is aligned to node2. The dictionary's keys are top-level sequence objects, and its value for each is the letter position index i as defined above.

nodeEdges( )
returns a dictionary of the outgoing edges from this node, whose keys are target nodes, and whose values are the corresponding edge objects (of type sequence.LetterEdge).

2.3.10 NLMSASequence

You are unlikely to need to manipulate NLMSASequence objects directly; they perform the back-end work for accessing the nested list disk storage of the alignment of the associated sequence.

However, one thing you should know is that for a sequence to be stored in a NLMSA, it needs to have a unique string identifier. NLMSASequence obtains a string identifier for the sequence in one of the following ways (in decreasing order of precedence): 1) the sequence ``object'' can itself just be a Python string, in which case that string is used as the identifier. 2) otherwise, the object should be a SeqPath instance. If it has a name attribute, that will be used as the identifier. 3) Otherwise, if it has a id attribute (which is present by default on sequence.Sequence objects), that will be used.


2.4 Seqdb Module

Pygr interface to sequence databases stored in FASTA, BLAST or relational databases.

2.4.1 Overview

The seqdb module provides a simple, consistent interface to sequence databases from a variety of different storage sources such as FASTA, BLAST and relational databases. Sequence databases are modeled (like other Pygr container classes) as dictionaries, whose keys are sequence IDs and whose values are sequence objects. Pygr sequence objects use the Python sequence protocol in all the ways you'd expect: a subinterval of a sequence object is just a Python slice (s[0:10]), which just returns a sequence object representing that interval; the reverse complement is just -s; the length of a sequence is just len(s); to obtain the actual string sequence of a sequence object is just str(s). Pygr sequence objects work intelligently with different types of back-end storage (e.g. relational databases or BLAST databases) to efficiently access just the parts of sequence that are requested, only when an actual sequence string is needed.

2.4.2 External Requirements

This module makes use of several external programs:

If you are lacking one or more of these requirements, you can still install Pygr and use all Pygr functionality that does not depend on the missing requirements. If you try to use a function for which a requirement is missing, Pygr will raise an appropriate exception (e.g. unable to run blastall).

2.4.3 BlastDB

Interface to an existing BLAST database or FASTA sequence file; in the latter case, it will automatically construct BLAST database files for you using the NCBI tool formatdb. Here's a simple example of opening a BLAST database and searching it for matches to a specific piece of sequence:

from pygr.sequence import *
from pygr.seqdb import *
db = BlastDB('sp') # OPEN SWISSPROT BLAST DB
s = Sequence(str(db['CYGB_HUMAN'][40:-40]),'boo')
m = db.blast(s) # DO BLAST SEARCH
myg = db['MYG_CHICK']
for i in m[s][myg]:
    print repr(i.srcPath),repr(i.destPath),i.blast_score,i.percent_id

Let's go through this example line by line:

Options for constructing a BlastDB:

BlastDB( filepath=None,skipSeqLenDict=False,ifile=None,idFilter=None, blastReady=False,blastIndexPath=None,blastIndexDirs=None,**kwargs)
Open a sequence file as a ``database'' object, giving the user access to its sequences, easy searching via blast or megablast, etc. skipSeqLenDict prevents construction of a sequence length index file (which will be named ``filepath.seqlen'') and a fast sequence access file (which will be named ``filepath.pureseq''). Setting this option to True can be useful if you either wish to speed up initial opening of the BlastDB (note: construction of these indexes is only a one-time event) or avoid the extra disk space required by these indexes. Explanation: To facilitate the rapid creation of sequence objects (which requires the length of the sequence), it creates a sequence length index (as a Python shelve). This enables it to avoid actually loading the sequence string into memory each time a sequence object is created; instead it just looks up the sequence length. While this speeds up access to genomic sequence databases (where each sequence tends to be extremely long), this initial step may be slow for databases of short sequences. Setting skipSeqLenDict option to True, will prevent construction of this sequence length index. ifile lets you open the database directly from a file object rather than a filename. If you have a file object, you can pass it directly to BlastDB instead of a filepath. NB: the BlastDB() constructor will close ifile when it is done reading from the file object. idFilter allows you to provide a function for re-mapping the FASTA sequence identifiers read from the sequence file. This can be useful in the case of NCBI FASTA files, since NCBI often treats the sequence ID as a ``blob'' into which any number of database fields can be stuffed, rather than a true ID. blastReady option specifies whether BLAST index files should be automatically constructed (using formatdb). Note, if you attempt to run the blast() method, it will automatically create the index files for you if they are missing. blastIndexPath, if not None, specifies the path to the BLAST index files for this database. For example, if the BLAST index files are /some/path/foo.psd etc., then blastIndexPath='/some/path/foo'. blastIndexDirs, if not None, specifies a list of directories in which to search for and create BLAST index files. Entries in the list can be either a string, or a function that takes no parameters and returns a string path. A string value ``FILEPATH'' instructs it to use the filepath of the FASTA file associated with the BlastDB. The default value of this attribute on the BlastDB class is
['FILEPATH',os.getcwd,os.path.expanduser,pygr.classutil.default_tmp_path]
This corresponds to: self.filepath, current directory, the user's HOME directory, and the default temporary directory used by the Python function os.tempnam().

Useful methods:

iter( )
iterate over all IDs in the BLAST database.

len( )
returns number of sequences in the BLAST database.

blast( seq, al=None, blastpath='blastall', blastprog=None, expmax=0.001, maxseq=None, opts='', verbose=True)
run a BLAST search on sequence object seq. maxseq will limit the number of returned hits to the best maxseq hits. al if not None, must be an alignment object in which you want the results to be saved. Note: in this case, the blast function will not automatically call the alignment's build() method; you will have to do that yourself. blastpath gives the command to run BLAST. blastprog, if not None, should be a string giving the name of the BLAST program variant you wish to run, e.g. 'blastp' or 'blastn' etc. If None, this will be figured out automatically based on the sequence type of seq and of the sequences in this database. expmax should be a float value giving the largest ``expectation score'' you wish to allow homology to be reported for. opts allows you to specify arbitrary command line arguments to the BLAST program, for controlling its search parameters. verbose=False allows you to switch off printing of explanatory messages to stderr.

megablast( seq, al=None, blastpath='megablast', expmax=1e-20, maxseq=None, minIdentity=None, maskOpts='-U T -F m', rmPath='RepeatMasker', rmOpts='-xsmall', opts='', verbose=True)
first performs repeat masking on the sequence by converting repeats to lowercase, then runs megablast with command line options to prevent seeding new alignments within repeats, but allowing extension of alignments into repeats. In addition to the blast options (described above), minIdentity should be a number (maximum value, 100) indicating the minimum percent identity for hits to be returned. rmPath gives the command to use to run RepeatMasker. rmOpts allows you to give command line options to RepeatMasker. The default setting causes RepeatMasker to mark repetitive regions in the query in lowercase, which then works in concert with the maskOpts option, next. maskOpts gives command line options for controlling the megablast program's masking behavior. The default value prevents megablast from using repetitive sequence as a seed for starting a hit, but allows it to propagate a regular (non-repetitive hit) through a repetitive region.

__invert__( )
The invert operator (~, the ``tilde'' character) enables reverse-mapping of sequence objects to their string ID.
id = (~db)[seq] # GET IDENTIFIER FOR THIS SEQUENCE FROM ITS DATABASE

formatdb( filepath=None)
Forces the BlastDB to construct new BLAST index files, either at the location specified by filepath, if not None, or in the first directory in the blastIndexDirs list where the index files can be succesfully built. Index files are generated using the formatdb program provided by NCBI, which must be in your PATH for this method to work.

Useful attributes:

2.4.4 BlastDBXMLRPC

A subclass of BlastDB that adds a couple methods needed to serve the data to clients connecting over XMLRPC. For example, to make an XMLRPC server for a blast database, accessible on port 5020:
import coordinator
server = coordinator.XMLRPCServerBase(name,port=5020)
db = BlastDBXMLRPC('sp') # OPEN BlastDB AS USUAL, BUT WITH SUBCLASS
server['sp'] = db # ADD OUR DATABASE TO THE XMLRPC SERVER
server.serve_forever() # START SERVING XMLRPC REQUESTS, UNTIL KILLED.

2.4.5 XMLRPCSequenceDB

Class for a client interface that accesses a Blast database over XMLRPC (from the the BlastDBXMLRPC acting as the server).
__init__( url,name)
url must be the URL (including port number) for accessing the XMLRPC server; name must be the key of the BlastDBXMLRPC object in that server's dictionary (in the example above, it would be 'sp'). Thus to access the server above (assuming it is running on leelab port 5020):
db = XMLRPCSequenceDB('http://leelab:5020','sp')
hbb = db['HBB_HUMAN'] # GET A SEQUENCE OBJECT FROM THE DATABASE...
Currently, this class provides sequence access. You can work with sequences exactly as you would with a BlastDB, but cannot perform actual BLAST searches (i.e. the blast and megablast methods don't work over XMLRPC).

2.4.6 FileDBSequence

The default class for sequence objects returned from BlastDB. It provides efficient, fast access to sequence slices (subsequences). When the BlastDB is initially opened, Pygr constructs a length and offset index that enables FileDBSequence to seek() to the correct location for any substring of the sequence. New in Pygr 0.4.

2.4.7 BlastSequence

This was previously the default class for sequence objects returned from BlastDB, but has been deprecated because we found that NCBI fastacmd was much too slow and consumed enormous amounts of memory. BlastSequence relies on fastacmd for ``fast'' access to individual sequence slices. The advantage is that it only requires BLAST database files (produced by Pygr using formatdb), whereas the new FileDBSequence requires a specially indexed sequence file (constructed by default by BlastDB), which may be a disadvantage if you are low on disk space.

BlastSequence has several optimizations for working with BLAST databases:

2.4.8 AnnotationDB

This class provides a general interface for sequence annotation databases. This interface follows several principles:
__init__( sliceDB,seqDB,annotationType=None,itemClass=AnnotationSeq,itemSliceClass=AnnotationSlice,itemAttrDict=None,sliceAttrDict=dict(),filename=None,mode='r')
Constructs an annotation database using several arguments:

sliceDB, a database that takes an annotation ID as a key, and returns a slice information object with attributes that give the sequence ID and start/stop coordinates of the sequence interval representing the annotation, and any other information about the annotation. In general, any attribute on the slice information object, will also be accessible on the corresponding annotation object and slices derived from it.

You can give None as the sliceDB, in which case the AnnotationDB will create one for you, either using an in-memory dictionary, or by opening a Python shelve file if you provide the filename argument; see below.

seqDB, a sequence database that takes a sequence ID as a key, and returns a sequence object.

annotationType should be a string identifier for the type of annotation. This will be propagated to all annotation objects / slices derived from this annotation database.

itemClass: the class to use for constructing an annotation object to be returned from the AnnotationDB.__getitem__. You can extend the behavior of annotation objects by subclassing AnnotationSeq. If the AnnotationDB participates in important schema relations, pygr.Data may add properties to the itemClass that implement its schema relations to other database containers. (See the reference docs on pygr.Data below for details).

itemSliceClass: the class to use for slices of annotation objects returned from the AnnotationDB.__getitem__. You can extend the behavior of annotation objects by subclassing AnnotationSlice. If the AnnotationDB participates in important schema relations, pygr.Data may add properties to the itemSliceClass that implement its schema relations to other database containers. (See the reference docs on pygr.Data below for details).

sliceAttrDict, a dictionary providing the attribute name aliases for attributes on annotation objects to access attributes or tuple values in the sliceInfo objects. The minimal required attributes are the sequence ID, start and stop coordinates in each object returned from sliceDB. For example,

sliceAttrDict = dict(id='chromosome',start='gen_start',stop='gen_stop')
would make it use s.chromosome,s.gen_start,s.gen_stop as the ID and interval coordinates for each slice information object s. Note: the start,stop coordinates should follow the SeqPath sign convention, i.e. positive coordinates mean an interval on the positive strand, and negative coordinates mean an interval on the negative strand (i.e. the reverse complement of the positive strand. See the reference documentation on SeqPath above for details).

If the sliceAttrDict (or sliceInfo object directly) provides a orientation attribute, it will be used to be change positive intervals to negative intervals if the orientation attribute is negative. This gives the user an alternative method to represent orientation: give all coordinates in positive orientation (positive integer values), and give an orientation attribute that is a negative value if the interval should be reversed (to negative orientation).

If a sliceAttrDict value is an integer, then it will not be treated as an attribute name, but instead will be used as an index, treating the sliceInfo object as a tuple. This makes it possible to use a sliceDB whose items are tuples. Here's an example:

exon_db = AnnotationDB(exon_slices, db,
                       sliceAttrDict=dict(id=0, orientation=3, # GIVE ATTR INTERFACE TO 2PLE
                                          transcript_id=4, start=5, stop=6))
Additional tuples values beyond the required id,start,stop attributes may be used to provide additional informative attributes for the individual annotations.

filename, if not None, indicates a Python shelve file to store the sliceDB info. It will be opened according to the mode argument; see the Python shelve docs for details. Note: if you write data to an AnnotationDB stored using a shelve, you must call its close() method to ensure that all data is saved to the Python shelve file!

__getitem__( id)
Get the annotation object with primary key id. This annotation object is both a sequence interval (representing the region of sequence that it annotates, e.g. for an exon, the region of genomic sequence that constitutes that exon), and also an annotation (i.e. it may have additional attributes from the slice information object, that give useful information about this annotation).

Note: to save new annotations to the AnnotationDB, use either of the following two methods, instead of __setitem__, which is not permitted (because there would be no way of guaranteeing that the annotation object provided by the user could be stored persistently).

new_annotation( k,sliceInfo)
Use this method to save new annotations to an AnnotationDB, instead of using annoDB[k] = v, which is not permitted. Creates a new annotation with ID k, based on sliceInfo, which must provide a sequence ID, start, stop, either by attribute names or integer indices (as specified by the sliceAttrDict), and any addition attributes that we want to associate with this annotation. sliceInfo is saved in the AnnotationDB's sliceDB. Returns an annotation object associated with sliceInfo.

add_homology( seq,search='blast',id=None,idFormat='%s_%d',autoIncrement=False,maxAnnot=999999,maxLoss=None,sliceInfo=None,**kwargs)
Search for homology to seq in the sequence database self.seqDB using the named method specified by the search argument, and filtered using the NLMSASlice.keys() function, and store them as new annotations in the annotation database.

seq can be a string or sequence object or slice.

search can be a string, in which case it will be treated as an attribute name for a method on self.seqDB to run the homology search. Alternatively, search must be a function that runs the homology search. Either way, the search function must take a sequence object as its first argument, and optional keyword arguments for controlling its search parameters. Note: since both searching and filtering keyword arguments are passed as a single dictionary, the function should not die on unexpected keyword arguments. The function must return an alignment object (e.g. NLMSA).

id if not None, will be used as the annotation ID. Otherwise, the seq.id will be used as the annotation ID.

idFormat controls the generation of ID strings for cases where multiple hits pass the search and filter criteria. It simply appends an integer counter to the id.

autoIncrement=True forces it to generate its own integer IDs for each new annotation.

maxAnnot specifies the maximum numbers of hits that will be processed for seq. If the number of hits passing both search and filter criteria exceed this number, a ValueError will be raised.

maxLoss if not None, must be an integer indicating the maximum number of residues that can be missing from the alignment to seq to be acceptable as an annotation.

sliceInfo if not None, will be appended to the (id,start,stop) tuple that is saved for each annotation. This enables you to add annotation attributes, by giving a sliceAttrDict setting to your AnnotationDB constructor that defines these additional attributes. Note: add_homology() saves each annotation as a slice tuple to self.sliceDB, in the form: (id,start,stop)+sliceInfo.

You can (and should) specify many additional arguments for controlling the homology search, and results filtering. For the former, see the list of arguments for BlastDB.blast() and BlastDB.megablast(). For the latter, see the list of arguments for NLMSASlice.keys().

add_homology() returns a list of the annotation objects created as a result of the homology search.

close( )
You must call this method to ensure that any data added to the AnnotationDB will be written to its Python shelve file on disk. This method is irrelevant, but harmless, if you are instead using an in-memory dictionary as storage.

2.4.9 VirtualSeq

This class provides an empty sequence object that acts purely as a reference system. Automatically elongates if slice extends beyond current stop. This class avoids setting the stop attribute, taking advantage of SeqPath's mechanism for allowing a sequence to grow in length.
s = VirtualSeq('FOOG_HUMAN')
len(s) # ONLY 1 LETTER LONG BY DEFAULT
s1 = s[100:215] # GET A SLICE OF THIS SEQUENCE
len(s) # NOW IT'S 215

The associated VirtualSeqDB class provides a ``sequence database'' that returns a VirtualSeq object for every identifier requested of it. It acts like a Python dictionary:

db = VirtualSeqDB()
s = db['FOOG_HUMAN'] # ASK FOR A SEQUENCE BY ITS IDENTIFIER
s1 = s[100:215] # GET A SLICE OF THIS SEQUENCE
For a given identifier it always returns the same VirtualSeq object (i.e. the object returned from the first request for that identifier). In other words, if the identifier was previously requested, it returns the VirtualSeq for that identifier; if not, it creates a new one. This can be convenient when performing operations that just need a coordinate reference system, not actual sequence.

2.4.10 PrefixUnionDict

This class acts as a wrapper for a set of dictionaries, each of which is assigned a specific string ``prefix''. It provides a dictionary interface that accepts string keys of the form ``prefix.suffix'', and returns d['suffix'] where d is the dictionary associated with the corresponding prefix. This is useful for providing a unified ``database interface'' to a set of multiple databases.
hg17 = BlastDB('/usr/tmp/ucsc_msa/hg17')
mm5 = BlastDB('/usr/tmp/ucsc_msa/mm5')
... # LOAD A BUNCH OF OTHER GENOMES TOO...
genomes = {'hg17':hg17,'mm5':mm5, 'rn3':rn3, 'canFam1':cf1, 'danRer1':dr1,
'fr1':fr1, 'galGal2':gg2, 'panTro1':pt1} # PREFIX DICTIONARY FOR THE UNION 
					 # OF ALL OUR GENOMES
genomeUnion = PrefixUnionDict(genomes)
ptChr7 = genomeUnion['panTro1.chr7'] # GET CHIMP CHROMOSOME 7

if 'panTro1.chr5' in genomeUnion: # CHECK IF THIS ID IN OUR UNION
    pass # DO SOMETHING...

s = -(ptChr7[1000:2000]) # GET A BIT OF THIS SEQUENCE
if s in genomeUnion: # THIS IS HOW TO CHECK IF s DERIVED FROM OUR UNION
    pass # DO SOMETHING...

It provides a __contains__ method that tests whether a given sequence object is derived from the PrefixUnionDict (see example above). Here are some additional methods:

__init__( prefixDict=None,separator='.',filename=None,dbClass=BlastDB)
You can create a PrefixUnionDict either using a prefixDict (whose keys are string prefixes, and whose values are sequence databases), or using a previously created header file filename. Using the header file, the constructor will automatically open all the sequence databases for you. When opening from a header file, you can also specify a dbClass to be used for opening individual sequence databases listed in the header file; the default is BlastDB. The database class constructor must take a single argument, which is the filepath for opening the database. The separator character is used to form ``prefix.suffix'' identifiers.

__invert__( )
The invert operator (~, the ``tilde'' character) enables reverse-mapping of sequence objects to their string ID. This is the recommended way to get the ``fully qualified sequence ID'', i.e. with the appropriate prefix prepended.
id = (~db)[seq] # GET PROPERLY PREFIXED-IDENTIFIER FOR THIS SEQUENCE
For a given sequence object seq derived from the union (or a slice of a sequence from the union), return a string identifier in the form of ``foo.bar''.

getName( path)
This method is deprecated; instead use the __invert__ operator above.

writeHeaderFile( filename)
THIS METHOD IS DEPRECATED, because it is restricted to assuming that all sequence dictionaries it contains are of a single class. We recommend that you instead save it to pygr.Data, or pickle it directly using pygr.Data.dumps().

Save a header file for this union, to reopen later. It saves the separator character, and a list of prefixes and filepaths to the various sequence databases (which must have a filepath attribute). This header file can be used for later reopening the prefix-union in a single step.

newMemberDict( )
Returns a new member dictionary for testing membership in the distinct prefix groups. See PrefixUnionMemberDict.

cacheHint( owner,ivalDict)
Communicates a set of caching hints to the appropriate member databases. ivalDict must be a dictionary whose keys are sequence ID strings, and whose values are each a (start,stop) tuple for the associated covering interval coordinates to cache for each sequence. owner should be a python object whose existence controls the lifetime of these cache hints. When owner is garbage-collected by Python (due to its reference count going to zero), the member databases will clear these cache hints from their cache storage.

On PrefixUnionDict, this method simply passes along the cache hints to the appropriate member databases by calling their cacheHint method, without itself doing anything to cache the information.

2.4.11 PrefixUnionMemberDict

Implements membership testing on distinct prefix groups. Specifically, you can bind a given prefix to a value
d['prefix1'] = value
then test whether a given object k is a member of any of the prefix groups in the dictionary:
v = d[k] # raises KeyError if k not a member of 'prefix1' or other prefix group in d

__init__( puDict,default=None,attrMethod=lambda x:x.pathForward.db)
puDict must be a PrefixUnionDict, whose prefix groups constitute the allowed possible key groups for this membership dictionary. default provides a default value to apply to any key whose prefix has not been explicitly given a value in this dictionary. If no default is set, this dictionary will raise a KeyError for any key whose prefix has not been explicitly given a value in this dictionary. attrMethod specifies a method for obtaining the actual prefix container object from a given member key object. The default attrMethod treats the key as a sequence object and tries to determine what database container it is from.

possibleKeys( )
Returns an iterator for the key values (prefix strings) that are allowed for this dictionary, obtained from the bound PrefixUnionDict.

2.4.12 PrefixDictInverse

Provides the interface to the inverse mapping of the PrefixUnionDict.
__getitem__( k)
Returns the fully-qualified string ID for sequence object k. Properly handles both sequence annotation object and regular sequence objects.

2.4.13 PrefixDictInverseAdder

Adds the capability of automatically adding new sequence databases to the PrefixUnionDict, if needed. This is implemented by extending the standard __getitem__ method:
__getitem__( k)
Returns the fully-qualified string ID for sequence object k. Properly handles both sequence annotation object and regular sequence objects. If sequence object k is from a sequence database that is not in the PrefixUnionDict, it will be automatically added to the prefixUnion, if the prefixUnion has an addAll attribute set to True; if not, a KeyError is raised. This is used in the standard NLMSA write mode 'w' to allow users to add sequences to the alignment without having to previously add the sequence databases containing those sequences, to the prefixUnion for the NLMSA.

2.4.14 BlastSequenceCache

This class is deprecated; the FileDBSequence class and associated database container caching mechanisms provide a more powerful mechanism that is intended to replace BlastSequenceCache.

Implements a variant of BlastSequence designed to merge and cache requests for local intervals of sequence so that repeated accesses to these regions are bundled and cached for efficiency. You work with sequence objects of this type normally, using Python slicing to obtain subintervals, and str() to get the sequence string for a subinterval. But behind the scenes, it does two things:

2.4.15 SQLSequence

Implements a subclass inheriting from SQLRow and SequenceBase, to use a relational database table to obtain the actual sequence. There are three minor variants DNASQLSequence, RNASQLSequence, ProteinSQLSequence (so that the sequence does not have to analyze itself to determine what kind of sequence it is). Its constructor takes the same arguments as SQLRow(table, id), where table is the SQLTable object representing the table in which the sequence is stored, and id is the primary key of the row representing this sequence. However, normally this class is simply passed to the Table object itself so that it will use it to instantiate new row objects whenever they are requested via its dictionary interface. Here's a simple example:

class YiProteinSequence(ProteinSQLSequence): # CREATE A NEW SQL SEQUENCE CLASS
    def __len__(self): return self.protein_length  # USE LENGTH STORED IN DATABASE
protein = jun03[protein_seq_t] # protein IS OUR SQLTable OBJECT REPRESENTING PROTEIN SEQUENCE TABLE
protein.objclass(YiProteinSequence) # FORCE PROTEIN SEQ TABLE TO USE THIS TO INSTANTIATE ROW OBJECTS
pseq = protein['Hs.1162'] # GET PROTEIN SEQUENCE OBJECT FOR A SPECIFIC CLUSTER

Let's go through this line by line:

2.4.16 StoredPathMapping

Note: This class is deprecated; the NLMSA alignment database class provides a much more powerful interface that is intended to replace older mechanisms such as StoredPathMapping.

A second major area in Pygr is representation and query of multiple sequence alignment databases in a way that is scalable to whole genomes. We have previously showed (in our work on Partial Order Alignment) that graphs provide both a compact and algorithmically powerful way to store alignments. Combining this with "interval alignment" makes it scalable and gives a simple interface. In Pygr, alignments are just another kind of graph, whose nodes are sequence intervals, and edges are alignment relations. This provides a general-purpose facility for working with sets of sequence intervals, sequence annotation databases, and multiple sequence alignments, all queryable via Pygr graph queries. We have implemented different container subclasses to work with these data in memory or to work transparently with data stored in relational databases. The consistency and simplicity of the Pygr framework makes it a good interface both to run external tools like BLAST, and to store or query the results in persistent storage like a MySQL database.

hg17 = BlastDB('/data/ucsc/hg17') # GET CONTAINER FOR HUMAN GENOME DATABASE
bcl2m = hg17['chr22'][16544303:16588541] # GET INTERVAL WITH BCL2L13 GENE
al = hg17.megablast(mouse_bcl2,maxseq=1) # GET REPEAT-MASKED MEGABLAST ALIGNMENT, ONLY TOP HIT
al[bcl2m[1000:1100] ] += mrna[210:310] #ADD ALIGNMENT OF A 100nt SEGMENT TO mrna SEGMENT
al.storeSQL('test.table',db_cursor) # STORE COMPLETE ALIGNMENT IN RELATIONAL DATABASE
for e in MAFStoredPathMapping(bcl2m,'ucsc_maf8',u).edges(): #GET ITS MULTIGENOMEALIGNMENTS
   print str(e.srcPath),str(e.destPath) # PRINT THE ACTUAL ALIGNED SEQUENCE INTERVALS

Note: We will be unifying all sequence alignment functionality under the NLMSA interface design sometime in the near future. Specifically, the PathMapping and related classes, while similar to NLMSA, will be replaced with interfaces that are identical to the NLMSA interface.

2.4.17 SliceDB

For most applications, AnnotationDB is a better choice than this older class. This class enables you to apply ``slicing information'' from one database to sequences from a second database. For example, you could have a database that lists genes as intervals (slices) on genomic sequences stored in a BlastDB database. The only requirements are:

Both databases should raise KeyError for bad identifiers. The current SliceDB implementation caches sequence objects so that subsequent calls for the same identifier will not require repeating the database queries to the two databases. To remove a sequence object from the cache, just use del db[id] as usual.

SliceDB inherits from the builtin Python dict class, so all standard methods can be used.

db = SliceDB(sliceDB,seqDB) # CREATE OUR DATABASE
gene = db[cluster_id] # USE IT TO GET A GENE SEQUENCE...

2.4.18 Functions

The seqdb module also provides several convenience functions:

read_fasta( ifile)
a generator function that yields tuples of id,title,seq from ifile.

write_fasta( ofile, s, chunk=60, id=None)
writes the sequence s to the output file ofile, using chunk as the line width. id can provide an identifier to use instead of the default s.id.


2.5 pygr.Data Module

This module provides a simple but powerful interface for creating a ``data namespace'' in which users can access complex datasets by simply requesting the name chosen for a given dataset - much like Python's import mechanism enables users to access a specified code resource by name, without worrying about where it should be found or how to assemble its many parts. For an introduction, see the pygr.Data tutorial.

2.5.1 What kinds of data can be saved in pygr.Data?

There are a few basic principles you should be aware of:

2.5.2 pygr.Data is transactional

pygr.Data follows a transactional model: new resources added to pygr.Data are not saved to the resource database until you call pygr.Data.save(). This has several benefits:

2.5.3 pygr.Data Namespace Conventions

At this point, we're still just making this up as we go along. However, it is clearly advantageous to adopt some simple conventions that make it easy for people to use the same name for a given data resource, and to find what they're looking for. We are adopting the following conventions: Existing Area categories: You may obtain a directory of available resources available using the pygr.Data.dir() function:
>>> pygr.Data.dir('Bio.Seq.Swiss')
['Bio.Seq.Swissprot.sp42']
This returns the list of items beginning with the string you provided. Use its asDict=True argument to make it return a dictionary of matches with detailed information such as their docstring descriptions.

We suggest that you follow these conventions and extend them as needed. Please report new category names to us so we can add them to the list.

2.5.4 How does pygr.Data access resource databases?

The list of resource databases is read from the environment variable PYGRDATAPATH. If this variable is empty or missing, the default path for pygr.Data to search is the user's home directory ($HOME) and current directory, in that order. PYGRDATAPATH should be a comma separated list of ``resource path'' strings, which must be one of the following:

2.5.5 Convenience functions

save( layer=None)
Saves all pending pygr.Data additions to the resource database. If layer is not specified, each resource will be saved to the layer it was added to, or to the default layer if none was specified at the time of addition. If layer is not None, it forces all pending data to be saved specifically to that layer. You can call pygr.Data.save() multiple times with different layer values to make the same set of data (transaction) be saved to each of the specified resource databases.

rollback( )
Dumps all pending pygr.Data additions (since the last save() or rollback()) without adding them to the resource database.

list_pending( )
Returns a pair of two lists ([data],[schema]), where the first list shows newly added pygr.Data IDs that are currently pending, and the second list pygr.Data IDs that with newly added schema information pending.

addResource( id,obj,layer=None)
Add obj to pygr.Data as resource ID id, specifically within abstract resource layer if provided. Queues obj for addition to the resource database, and marks it with its _persistent_id attribute, whose value is just id. For a resource id 'A.Foo.Bar' this method is equivalent to the assignment statement
pygr.Data.A.Foo.Bar = obj
This method is provided mainly to enable writing code that automates saving of resources, e.g. via code like
for id,genome in nlmsa.seqDict.prefixDict.items(): # 1st SAVE THE GENOMES
    genome.__doc__ = 'draft genome sequence '+id
    addResource('Bio.Seq.Genome.'+id,genome)

deleteResource( id,layer=None)
Delete resource id from the resource database specified by layer if provided (or the default resource database otherwise). Also delete its associated schema information.

addSchema( name,schemaObj,layer=None)
Add a schema object for the pygr.Data resource indicated by the string passed as name, to the specified layer if provided (or the default resource database otherwise). For example:
addSchema('Bio.Genomics.ASAP2.hg17.geneExons',
          pygr.Data.OneToManyRelation(genes,exons,bindAttrs=('exons','gene')))
pygr.Data.save() # SAVE ALL PENDING DATA AND SCHEMA TO RESOURCE DATABASE
Note that schema information, like pending data, is not saved to the resource database until you call pygr.Data.save().

The pygr.Data module also provides a directory function for searching for resource names that begin with a given stem, either in all databases, or in a specific layer:

dir( prefix,layer=None,asDict=False)
get list or dict of resources beginning with the specified string. If the optional asDict argument is True, then they are returned as a dictionary whose keys are resource names, and whose values are their descriptions (taken from the resource object's __doc__ string). Otherwise they are returned as a list.

newServer( name,serverClasses=None,clientHost=None,withIndex=False, host=None, port=5000, excludeClasses=None, **kwargs)
Create and return a new XMLRPC server to serve all pygr.Data resources currently loaded in memory that are capable of XMLRPC client-server operation. The server name will be used for purposes of XMLRPC communication. The withIndex=True option will cause the server to also act as a pygr.Data resource database accessible via XMLRPC (i.e. add its URL to your PYGRDATAPATH environment variable, to make its resources accessible to any Python script). In this case, the server will add itself as new pygr.Data layer name, for any Python script that accesses its resource index.

serverClasses allows you to specify a list of tuples of classes that can be served via XMLRPC. Each tuple should consist of three values: (dbClass,clientClass,serverClass), where dbClass is a normal pygr class, clientClass is the class to use for the XMLRPC client version of this data, and serverClass is the class to use for the XMLRPC server of this data. If no value is provided to this option, the current default is

[(seqdb.BlastDB,seqdb.XMLRPCSequenceDB,seqdb.BlastDBXMLRPC),
 (cnestedlist.NLMSA,xnestedlist.NLMSAClient,xnestedlist.NLMSAServer)]
The clientHost option allows you to override the hostname that clients will be instructed to connect to. The default is simply the fully qualified hostname of your computer. But if, for example, you wished to access your server by port-forwarding localhost port 5000 to your server port via SSH, you could pass a clientHost='localhost' setting.

excludeClasses, if not None, should be a list of classes that should be excluded from the new server. If None, the default is [pygr.sqlgraph.SQLTableBase,pygr.sqlgraph.SQLGraphClustered], since such relational database resources are better accessed directly from the relational database server, rather than via the XMLRPC server as an intermediate step.

host, port arguments are passed to the XMLRPCServerBase constructor. For details see that section below.

Once you create a server using this method, you start it using its serve_forever() method. If the server does not provide its own index (i.e. withIndex=False), then you should first register it to your local resource database server (so that clients of that server will know about the new services your new server is providing), by calling its register() method.

ResourceDBMySQL( tablename,createLayer=LAYERNAME)
Create a resource database in a MySQL database table. tablename is the table to use in the database, in the format ``DBNAME.TABLENAME dbinfo'', where DBNAME is the name of the database in the MySQL server, and TABLENAME is the name of the table in that database that you wish to use to store the resource database. dbinfo is optional. If provided, it must be a whitespace separated list of arguments for connecting to the MySQL server, of the form host user passwd. You can provide one, two or three of these optional arguments. If no dbinfo is provided, host, port, user and password info are obtained from your .my.cnf config file as usual for the mysql client.

To create a new table in the MySQL database (automatically initializing its schema), instead of assuming that it already exists, you must provide the createLayer argument, which is saved as the layer name of the new resource database. If pygr.Data finds that it is unable to connect to a MySQL database table specified in your PYGRDATAPATH it will print a warning message, and ignore the offending database table. It will NOT silently create a database table for you in this case. The rationale is that whereas a misspelled directory name will result in an IOError (thus allowing pygr.Data to detect a bad directory name in PYGRDATAPATH), there would be no easy way for pygr.Data to tell whether you simply mistyped the name of an existing MySQL table, or whether you actually wanted to create a new MySQL table.

Example: create a new resource database, give it the layer name ``leelab'', and register it in our list of resource databases.

rdb = pygr.Data.ResourceDBMySQL('pygrdata.index',createLayer='leelab')
Note that you must provide the createLayer argument, in order to create a new resource database table. ResourceDBMySQL will not automatically create a new table without this argument, simply because the tablename you provided does not exist. In that case, it will raise an exception to alert you to the fact that either the correct table name was not given, or the table does not exist.

dumps( obj)
Provides a pygr.Data-aware pickling service; that is, if during pickling of obj any references are encountered to objects that pygr.Data IDs, it will simply save the ID. Returns a string pickle of obj. Use pygr.Data.loads() to restore an object pickled using this function.

loads( data,cursor=None)
Unpickles the string pickle contained in data in a pygr.Data-aware manner. I.e. any references in the pickle of the form ``PYGR_DATA_ID:'' will be retrieved by pygr.Data in the usual way.

data should have been generated by a previous call to pygr.Data.dumps().

cursor if not None, must be a Python DB API 2.0 compliant cursor object, that will be used to load any objects that require a database connection.

2.5.6 pygr.Data Layers

To provide an intuitive way to refer to different resource databases, pygr.Data associates ``layer names'' with them. For example, the layer name for the first resource database whose path is given relative to your home directory is my, and the first one whose path is given relative to current directory is here. Remote resource databases (XMLRPC; MySQL) each store their own layer name. For example, within the Lee lab, we keep a MySQL resource database whose layer name is ``leelab''.

2.5.7 pygr.Data Schema Concepts

Parallel to the pygr.Data namespace, pygr.Data maintains a schema namespace that records schema information for pygr.Data resources. Broadly speaking, schema is any relationship that holds true over a set of data in a given collection (e.g. in the human genome, ``genes have exons'', a one-to-many relation). In traditional (relational) databases, this schema information is usually represented by entity-relationship diagrams showing foreign-key relationships between tables. A pygr.Data resource is a collection of objects (referred to in these docs as a ``container'' or ``database''); thus in pygr, schema is a relation between pygr.Data resources, i.e. a relationship that holds true between the items of one pygr.Data resource and the items of another. For examples, items in a ``genes'' resource might each have a mapping to a subset of items in an ``exons'' resource. This is achieved in pygr.Data by adding the mapping object itself as a pygr.Data resource, and then specifying its schema to pygr.Data (in this example, its schema would be a one-to-many relation between the ``genes'' resource and the ``exons'' resource). Adding the mapping object as a pygr.Data resource, and adding its schema information, are two separate steps.
pygr.Data.Bio.Genomics.ASAP2.hg17.geneExons = geneToExons # SAVE MAPPING
pygr.Data.schema.Bio.Genomics.ASAP2.hg17.geneExons = \
  pygr.Data.OneToManyRelation(genes,exons,bindAttrs=('exons','gene'))
pygr.Data.save() # SAVE ALL PENDING DATA AND SCHEMA TO RESOURCE DATABASE
assuming that genes and exons are the pygr.Data resources that are being mapped. This would allow a user to obtain the mapping from pygr.Data and use it just as you'd expect, e.g. assuming that gene is an item from genes:
geneToExons = pygr.Data.Bio.Genomics.ASAP2.hg17.geneExons()
myexons = geneToExons[gene] # GET THE SET OF EXONS FOR THIS GENE
In practice, pygr.Data accomplishes this by automatically setting geneToExon's sourceDB and targetDB attributes to point to the genes and exons resources, respectively.

Since most users find it easier to remember object-oriented behavior (e.g. ``a gene has an exons attribute'', rather than ``there exists a mapping between gene objects and exon objects, called geneToExons''), pygr.Data provides an option to bind attributes of the mapped resource items. In the example above, we bound an exons attribute to each item of genes, which automatically performs this mapping, e.g. we can iterate over all exons in a given gene as easily as

for exon in gene.exons: # gene.exons IS EQUIVALENT TO geneToExons[gene]
  # DO SOMETHING...
Note: in this usage, the user does not even need to know about the existence of the geneToExons resource; pygr.Data will load it automatically when the user attempts to access the gene.exons attribute. It can do this because it knows the schema of the pygr.Data resources!

One additional aspect of pygr.Data schema relations goes a bit beyond ordinary mapping: a mapping between one object (source) and another (target) can have edge information that describes this specific relationship. For example, the connection between one exon and another in the alternative splicing of an mRNA isoform, is a splice. For alternative splicing analysis, it is actually crucial to have detailed information about the splice (e.g. what experimental evidence exists for that splice; what tissues it was observed, in what fraction of isoforms etc.) in addition to the exons. Therefore, pygr.Data allows us to save edge information also as part of the schema, e.g. for a splicegraph representing the set of all splices (edges) between pairs of exons (nodes), we can store the schema as follows:

pygr.Data.Bio.Genomics.ASAP2.hg17.splicegraph = splicegraph # ADD A NEW RESOURCE
pygr.Data.schema.Bio.Genomics.ASAP2.hg17.splicegraph = \
  pygr.Data.ManyToManyRelation(exons,exons,splices, # ADD ITS SCHEMA RELATIONS
                               bindAttrs=('next','previous','exons'))
pygr.Data.save() # SAVE ALL PENDING DATA AND SCHEMA TO RESOURCE DATABASE
This type of mapping (``edge'' relations between pairs of ``nodes'') is referred to in mathematics as a graph, and has very general utility for many applications. For further information on graphs in pygr, see the tutorial or the mapping module reference below.

What information does pygr.Data schema actually store? In practice, the primary information stored is attribute relations: i.e. for a specified resource ID, a specified attribute name should be added to the resource object (or to items obtained from it), which in turn maps to some specified target resource (or items of that resource).

Although users do not need to know how this information is saved, I will outline the methodology as a reference for developers who want to work directly with this internal data (skip this section otherwise).

2.5.8 Collection

Provides a dict-like container that can be directly saved as a container in pygr.Data. Ordinary dict instances cannot be conveniently saved as pygr.Data resources, because they do not allow attributes to be saved (which is required for storing pygr.Data information like _persistent_id and itemClass), and because older versions of Python have a bug that affects pickling of dicts with cyclic references (i.e. contents that refer to the container). pygr.Data.Collection provides a drop-in substitute that uses dict or a Python shelve as its internal storage, and provides a full dict-like interface externally. It takes several arguments:

saveDict, if not None, is the internal mapping to use as our storage.

filename: if provided, is a file path to a shelve (BerkeleyDB) file to store the data in. NOTE: if you add data to a Collection stored in such a file, you must call the Collection's close() method to ensure that all the data will be saved to the Python shelve. Otherwise, the Python shelve file might be left in an incomplete state. NOTE: opening a collection with the filename option will cause it to use the PicklableShelve or IntShelve class for the Collection.

mode=None is passed to the Python shelve.open() function to control whether filename is opened in read, write or create mode; see the Python shelve module documentation for details. If mode is None, it will first try to open the shelve in mode 'r' (read-only), but if the file is missing, will open it in mode 'c' (create).

writeback=True is passed to the Python shelve.open() function to control the saving of data to the shelve. See the Python shelve module documentation for details. The default writeback=True setting can consume large amounts of memory if you are writing a lot of data to the shelve. To avoid this problem, use writeback=False; note that this means updates to the shelve will only be saved when you explicitly set an item in the Collection (e.g. collection[k] = v; specifically, if v is a mutable object, subsequently changing the contents of v will not automatically update the shelve, whereas it would be with writeback=True).

dictClass: if provided, is the class to use for storage of the dict data.

For example,

ens_genes = pygr.Data.Collection(itemClass=Transcript) # DICTIONARY OF GENES
ens_genes[gene_id] = gene
pygr.Data generally needs to know the itemClass of items stored inside a resource, so that it can add shadow attributes (by adding properties, directly to the itemClass).

close( )
You must call this method to ensure that any data added to the Collection will be written to its Python shelve file on disk. This method is irrelevant, but harmless, if you are instead using an in-memory dictionary as storage.

2.5.9 Mapping

This class provides dict-like class suitable for persistent usages. It extracts ID values from keys and values passed to it, and saves these IDs into its internal dictionary instead of the actual objects. Thus, the external interface is objects, but the internal storage is ID values. This allows the mapping to be stored persistently (i.e. pickled) separately from the objects which it maps, because only IDs are stored in the Mapping.

You can use any object that obeys the Python mapping protocol (e.g. dict, or Python shelve) as the internal storage. Mapping behaves exactly like a standard Python dictionary, providing all the standard methods of the Mapping Protocol.

Mapping( sourceDB, targetDB, saveDict=None, IDAttr='id', targetIDAttr='id', itemAttr=None, multiValue=False, inverseAttr=None,filename=None,dictClass=None,mode=None)
Initializes a mapping between items of sourceDB and items of targetDB.

sourceDB: container whose items will serve as keys for this Mapping. i.e. sourceDB must be a dictionary that maps key ID values to key objects.

targetDB: container whose items will serve as values of this Mapping. i.e. targetDB must be a dictionary that maps value IDs to value objects.

saveDict, if not None, is the internal mapping to use as our storage. If None, attempts to open or create a suitable storage for you. See also the filename, dictClass and mode arguments. If none of these arguments are provided, a standard Python dictionary will be used.

IDAttr: attribute name to obtain an ID from a key object.

targetIDAttr: attribute name to obtain an ID from a value object.

itemAttr, if not None, the attribute to obtain target (value) ID from an internal storage value

multiValue: if True, treat each value as a list of values, i.e. this Mapping will serve as a one-to-many mapping from sourceDB to targetDB.

inverseAttr, if not None, attribute name to obtain a source ID from a value object.

filename: if not None, is a file path to a shelve (BerkeleyDB) file to store the data in.

NOTE: if you add data to a Mapping stored in such a disk file, you must call the Mapping's close() method to ensure that all the data will be saved to the Python shelve. Otherwise, the Python shelve file might be left in an incomplete state.

mode: if not None, specifies how the shelve file should be opened: 'r' (read-only), 'c' (create), 'w' (read/write). For more details see the Python Library shelve documentation.

dictClass: if not None, is the class to use for storage of the dict data.

close( )
You must call this method to ensure that any data added to the Mapping will be written to its Python shelve file on disk. This method is irrelevant, but harmless, if you are instead using an in-memory dictionary as storage.

Here's an example usage:

gene_exons = Mapping(ens_genes, exon_db, multiValue=True, inverseAttr='transcript_id')
for exon in exon_db:
    gene = ens_genes[exon.transcript_id]
    exons = gene_exons.get(gene, [])
    exons.append(exon)
    gene_exons[gene] = exons # SAVE EXPANDED EXON MAPPING LIST
# SAVE TO PYGR DATA, AND CREATE GENES -> EXONS SCHEMA RELATION
pygr.Data.Bio.Titus.Test1.GeneExons = gene_exons
pygr.Data.schema.Bio.Titus.Test1.GeneExons = \
     pygr.Data.OneToManyRelation(ens_genes,exon_db,bindAttrs=('exons','gene'))
pygr.Data.save() # SAVE ALL PENDING DATA AND SCHEMA TO RESOURCE DATABASE

2.5.10 ResourceFinder

The core functionality of the pygr.Data module is provided by the ResourceFinder class, an instance of which is created at the top-level of the module as pygr.Data.getResource. It provides methods for adding, deleting and controlling pygr.Data resources and schema.

getResource( id,layer=None,debug=None,*args,**kwargs)
Look up pygr.Data resource id, using the specified abstract resource layer if provided. Searches the resouce database(s) for id, constructs it from the saved resource rule (e.g. from a local resource database, by unpickling the object). Saves the object in its cache so that subsequent calls for the same resource ID will return the same object. Applies the stored pygr.Data schema rules to it using applySchema(). Marks the object with its _persistent_id attribute, whose value is just id. Passing the option debug=True will cause it to raise any exception that occurs during resource loading immediately, rather than continuing to search its resource database list. This is helpful for debugging purposes.

getResource.addResource( id,obj,layer=None)
Same as the top-level module function of the same name.

getResource.addSchema( name,schemaObj,layer=None)
Same as the top-level module function of the same name.

getResource.dir( prefix,layer=None,asDict=False)
Same as the top-level module function of the same name.

getResource.deleteResource( id,layer=None)
Same as the top-level module function of the same name.

getResource.dumps( obj)
Same as the top-level module function of the same name.

getResource.list_pending( )
Same as the top-level module function of the same name.

getResource.loads( data,cursor=None)
Same as the top-level module function of the same name.

getResource.newServer( name,serverClasses=None,clientHost=None,withIndex=False, host=None, port=5000, **kwargs)
Same as the top-level module function of the same name.

getResource.rollback( )
Same as the top-level module function of the same name.

getResource.save_pending( layer=None)
Same as the top-level module function pygr.Data.save().

The following methods are mainly for internal use, and are unlikely to be needed by users of pygr.Data. In general, you should not use them unless you have a very good reason to be working with the interal pygr.Data methods, and really know what you are doing!

update( )
Update getResource's list of resource databases, by parsing the environment variable PYGRDATAPATH and attempting to connect to the resource databases listed there. Does not return anything.

addLayer( layerName,rdb)
Add the resource database rdb to the current resource database list, as a named layer given by the string layerName. Over-writing an existing layer name is not allowed, for security reasons; the previous layer entry must first be deleted.

getLayer( layerName)
Get the specified resource database, by its layer name. If layerName is None, returns the default (first) resource database in its list.

resourceDBiter( )
Generates all the resource databases currently listed by getResource.

registerServer( locationKey,serviceDict)
Registers the set of resources specified by serviceDict to the first resource database index in PYGRDATAPATH that will accept them. serviceDict must be a dictionary whose keys are resource IDs and whose associated values are pickled resource objects (encoded as strings). locationKey should be a string name chosen to represent the ``location'' where the data are stored. This can be anything you wish, and is mainly used to let the user know where the data will come from. This might be used in future versions of pygr.Data to allow preferential screening of where to get data from (local disk is better than NFS mounted disk, which in turn might be preferable over remote XMLRPC data access).

findSchema( id)
Returns a dictionary for the schema (if any) found for the pygr.Data resource specified by id. The dictionary keys are attribute names (representing attributes of the specified resource or its contents that should have schema relations with other pygr.Data resources), and whose values are themselves dictionaries specifying the precise schema rules for constructing this specific attribute relation.

schemaAttr( id,attr)
Return the target data linked to by attribute attr of pygr.Data resource id, based on the stored pygr.Data schema. The target resource object will be obtained by pygr.Data.getResource as usual.

applySchema( id,obj)
Apply the pygr.Data schema for resource id to the actual data object representing it (obj), by decorating it (and / or its itemClass and itemSliceClass) with properties representing its schema attributes. These properties are implemented by adding descriptor attributes to the associated class, such as OneTimeDescriptor or ItemDescriptor.

saveResource( resID,obj,layer=None)
Raw interface to actually save a specific resource to the specified (or default) resource database. DO NOT use this internal interface unless you know what you are doing!

saveSchema( id,attr,bindingDict,layer=None)
Save a schema attribute relation for attribute attr of pygr.Data resource id, to the specified resource database layer (or the default, first resource database in the list, if no layer specified). bindingDict must be a dictionary specifying the rules for binding the attribute to a pygr.Data resource target; see below for details. DO NOT use this internal interface unless you know what you are doing!

delSchema( id,layer=None)
Delete schema bindings for all attributes of the resource id, in the specified resource database layer, as well as all schema relations on other resources that are targeted to resource id.

2.5.11 ResourceDBMySQL

Implements an interface to storage of a resource database in a MySQL database table.
__init__( tablename,finder=None,createLayer=None)
tablename is the table to use in the database, in the format ``DBNAME.TABLENAME dbinfo'', where DBNAME is the name of the database in the MySQL server, and TABLENAME is the name of the table in that database that you wish to use to store the resource database. dbinfo is optional. If provided, it must be a whitespace separated list of arguments for connecting to the MySQL server, of the form host user passwd. You can provide one, two or three of these optional arguments. If no dbinfo is provided, host, port, user and password info are obtained from your .my.cnf config file as usual for the mysql client.

finder, if specified gives the ResourceFinder instance in which the new resource DB should be registered. If None provided, defaults to pygr.Data.getResource.

createLayer, if specified forces it to create a new table in the MySQL database (instead of assuming that it already exists), and saves createLayer as the layer name of this resource database.

Example: create a new resource database, give it the layer name ``leelab'', and register it in our list of resource databases.

rdb = pygr.Data.ResourceDBMySQL('pygrdata.index',createLayer='leelab')
Note that you must provide the createLayer argument, in order to create a new resource database table. ResourceDBMySQL will not automatically create a new table without this argument, simply because the tablename you provided does not exist. In that case, it will raise an exception to alert you to the fact that either the correct table name was not given, or the table does not exist.

__getitem__( id)
Get resource id from this resource database, or KeyError if not found.

__delitem__( id)
Delete resource id from this resource database, or KeyError if not found.

__setitem__( id,obj)
Save resource id to this resource database, by pickling it with self.finder.dumps(obj).

registerServer( locationKey,serviceDict)
Saves the set of resources specified by serviceDict to the database. serviceDict must be a dictionary whose keys are resource IDs and whose associated values are pickled resource objects (encoded as strings). locationKey should be a string name chosen to represent the ``location'' where the data are stored. This can be anything you wish, and is mainly used to let the user know where the data will come from. This might be used in future versions of pygr.Data to allow preferential screening of where to get data from (local disk is better than NFS mounted disk, which in turn might be preferable over remote XMLRPC data access).

setschema( id,attr,ruleDict)
Save schema information for attribute attr on resource id by pickling the ruleDict.

delschema( id,attr)
Delete schema information for attribute attr on resource id.

getschema( id)
Get schema information for resource id, in the form of a dictionary whose keys are attribute names, and whose values are the associated schema ruleDict for each bound attribute.

2.5.12 ResourceDBShelve

Implements an interface to storage of a resource database in a Python shelve (i.e. BerkeleyDB file) stored on local disk. Provides the same interface as ResourceDBMySQL, except for no registerServer method. Note: any method call that would save information to the database temporarily re-opens the database file in write mode, saves the required information, and immediately closes and re-opens the databae in read-only mode. Thus, unless two clients try to save information to the same file at exactly the same time, successive writes by multiple clients will not interfere with each other.
__init__( dbpath,finder,mode='r')
dbpath is the path to the directory in which the shelve file is found (or should be created, if none present).

2.5.13 ResourceDBClient

Implements a client interface to storage of a resource database in an XMLRPC server. For security reasons, only provides the __getitem__, and registerServer methods.

2.5.14 ResourceDBServer

Implements a server interface for storage of a resource database in a standard Python dict, served to clients via an XMLRPC server (use coordinator.XMLRPCServerBase as the XMLRPC server to serve this object).

__init__( layerName,readOnly=True)
layerName is the layer name that this server will provide to pygr.Data clients. readOnly if True, makes the server reject any requests to add new database rules received via XMLRPC, i.e. only allows getName and getResource calls via XMLRPC. If False, also allows calls to registerServer and delResource.

2.5.15 ResourcePath

Used for providing the dynamically extensible pygr.Data namespace that provides the normal interface for users to access pygr.Data resources.
__init__( namepath,layerName=None)
namepath specifies the ID string to use for this resourcePath. layerName if specified, gives the layer name that should be used for finding this resource and any subattributes of it.

For example, Bio is added at the top-level of the pygr.Data module by the following code:

Bio = ResourcePath('Bio')

__getattr__( attr)
extends the resource path by one step, returning a ResourcePath object representing the requested attribute.

__setattr__( attr,obj)
saves obj as the specified resource ID, by calling getResource.addResource, with our layer name (if any).

__delattr__( attr)
deletes the specified resource ID, by calling getResource.deleteResource, with our layer name (if any).

__call__( *args,**kwargs)
Construct the specified resource ID, by calling getResource, with our layer name (if any), and the specified arguments (if any).

2.5.16 SchemaPath

Class for top-level object representing a schema namespace. e.g. in the pygr.Data module,
schema = SchemaPath() # CREATE ROOT OF THE schema NAMESPACE

2.5.17 ResourceLayer

Class for top-level object representing a pygr.Data layer. e.g. in the pygr.Data module,
here = ResourceLayer('here') # CREATE TOP-LEVEL INTERFACE TO here LAYER

2.5.18 ForeignKeyMap

Provides a mapping between two containers, assuming that items of the target container have a foreign key attribute that gives the ID of an item in the source container.
ForeignKeyMap( foreignKey,sourceDB=None,targetDB=None)
foreignKey must be a string attribute name for the foreign key on items of the targetDB. Furthermore, targetDB must provide a foreignKey method that takes two arguments: the foreignKey attribute name, and an identifier that will be used to search its items for those whose attribute matches this identifier. It must return an iterator or list of the matching items.

__getitem__( id)
get a list of items in targetDB whose attribute matches this id.

__invert__( )
get an interface to the reverse mapping, i.e. mapping object that takes an item of targetDB, and returns its corresponding item from sourceDB, based on the input item's foreign key attribute value.

For example, given a container of clusters, and a container of exons (that each have a cluster_id attribute), we create a mapping between them as follows:

m = ForeignKeyMap('cluster_id',clusters,exons)
for exon0 in m[cluster0]: # GET EXONS IN THIS CLUSTER
    do something...
cluster1 = (~m)[exon1]  # GET CLUSTER OBJECT FOR THIS EXON

2.5.19 ManyToManyRelation, OneToManyRelation, ManyToOneRelation, OneToOneRelation

Convenience class for constructing schema relations for a general graph mapping from a sourceDB to targetDB with edge info.
__init__( sourceDB,targetDB,edgeDB=None,bindAttrs=None)
sourceDB,targetDB, and edgeDB can be either a string resource ID, a ResourcePath object, or an actual pygr.Data resource (automatically marked with its ID as the _persistent_id attribute). bindAttrs, if provided, must give a list of string attribute names to be bound, in order, to items of sourceDB, targetDB, and edgeDB, in that order. A None value in this list simply means that no attribute binding will be made to the corresponding pygr.Data resource.
Note: this class simply records the information necessary for this schema relation. The information is not actually saved to the resource database until its saveSchema method is called by the SchemaPath object. In addition to saving attribute bindings given by bindAttrs, this will also create bindings on the mapping resource object itself (i.e. the resource whose schema is being set; see an example in the tutorial). Specifically, it will save bindings for its sourceDB,targetDB, and edgeDB attributes to the corresponding resources given by the sourceDB,targetDB, and edgeDB arguments.

OneToOneRelation, OneToManyRelation, ManyToOneRelation and ManyToManyRelation differ only in the uniqueness vs. multiplicity of the mapping indicated. E.g. ~m1[v] --> k vs. ~mMany[v] --> [k1,k2,...]

2.5.20 DirectRelation, ItemRelation, InverseRelation

Users are unlikely to have any reason to work directly with these internal interfaces. Instead, use ManyToManyRelation, OneToManyRelation, ManyToOneRelation, OneToOneRelation as these cover the normal schema relationships. You should only use internal interfaces like DirectRelation, ItemRelation, InverseRelation if you have a real need to do so, and really know what you are doing! This documentation is only provided for developers directly working on pygr internals.

DirectRelation is a convenience class for constructing a single schema attribute relation on a pygr.Data resource, linking it to another pygr.Data resource.

__init__( target)
target gives a reference to a pygr.Data resource, which will be the target of a bound schema attribute. target can be either a string resource ID, a ResourcePath object, or an actual pygr.Data resource (automatically marked with its ID as the _persistent_id attribute).

schemaDict( )
returns a basic ruleDict dictionary for saving this schema binding. Can be over-ridden by subclasses to customize schema binding behavior.

saveSchema( source,attr,layer=None,**ruleDict)
Saves a schema binding for attribute attr on pygr.Data resource source to the specified resource database layer (or to the default resource database if not specified). ruleDict if specified provides additional binding rules (which can add to or over-ride those returned by the schemaDict method). source can be either a string resource ID, a ResourcePath object, or an actual pygr.Data resource (automatically marked with its ID as the _persistent_id attribute).

ItemRelation provides a subclass of DirectRelation that binds to the items of resource source rather than to the source object itself.

InverseRelation provides a subclass of DirectRelation, that binds source and target as each other's inverse mappings. That is, it binds an inverseDB attribute to each resource that points to the other resource. When either resource is loaded, a special __invert__ method will be added, that simply loads and returns the resource pointed to by the inverseDB binding.

2.5.21 nonPortableClasses,SourceFileName

The variable pygr.Data.nonPortableClasses specifies a list of classes which have local data dependencies (e.g. requires reading a file that is on your local disk), and therefore cannot be transferred over XMLRPC to a remote client by simple pickling / unpickling. pygr.Data.newServer will automatically cull any data that has such dependencies from the list of resources it loads into the XMLRPC server it constructs, so that the server will not attempt to serve data that actually will not work on remote clients. You can add your own classes to this list if needed.

By default, the pygr.Data.nonPortableClasses list consists of simply a single class, pygr.Data.SourceFileName, which is a subclass of str that marks a string as representing a path to a file. It behaves just like a string, but allows pygr.Data to be smart about checking whether the required file actually exists and is readable before returning a resource to the user. If you save filenames on your own objects using this class, pygr.Data will therefore be able to handle them properly for many issues such as XMLRPC portability to remote clients. You do this simply as follows:

class Foo(object):
  def __init__(self,filename):
    self.filename = SourceFileName(str(filename)) # MARK THIS A BEING A FILE NAME
    ifile = file(self.filename) # OPEN THIS FILE NOW IF YOU WANT...


2.6 sqlgraph Module

This module provides back-end database access.

2.6.1 SQLTable

Provides a dict-like interface to an SQL table. It accepts an identifier as a key, and returns a Python object representing the corresponding row in the database. Typically, these ``row'' objects have an id attribute that represents the primary key, and all column names in the SQL table can be used as attribute names on the row object.

This class assumes that the database table has a primary key, which is used as the key value for the dictionary. For tables with no primary key see other variants below.

This class and its variants follow a simple rule for controlling how data is loaded into memory: if you simply iterate over IDs (i.e. for id in mytable) data is not pre-loaded into memory; each object will be fetched individually when you try to access it (e.g. obj=mytable[id]). By contrast, if you call the table's items method, it will load data for the entire table into memory, on the rational basis that making this call signals that you intend to work with each and every object in the database table. Other ``value'' iterators such as iteritems, values, and itervalues also load all the data.

__iter__( )
Iterate over all IDs (primary key values) in the table, without loading the entire table into memory.

items( )
return a list of all (id,obj) pairs representing all data in the table, after first loading the entire table into memory.

__getitem__( id)
get the object whose primary key is id, and cache it in our local dictionary (so that subsequent requests will return the same Python object, immediately, with no need to re-run an SQL query). For non-caching versions of SQLTable, see below.

load( oclass=None)
Load all data from the table, using oclass as the row object class if specified (otherwise use the oclass for this table). All rows are loaded from the database and saved as row objects in the Python dictionary of this class.

2.6.2 SQLTableBase

The base class for SQLTable and other variants below. This class is derived from the Python builtin dict class, so all standard methods of dict can be used.

__init__( name,cursor=None,itemClass=None,attrAlias=None,clusterKey=None)
Open a connection to the existing SQL table specified by name. You can supply a Python DB API cursor providing a connection to the database server. If cursor is None, it will attempt to connect to a MySQL server using authentication information either from your the name string (treated as a whitespace separated list in the form tablename host user passwd; at least tablename and host must be present), or from your .my.cnf configuration file in the usual MySQL way (in which case only tablename needs to be specified).

itemClass indicates the class that should be used for constructing item objects (representing individual rows in the database).

attrAlias, if provided, must be a dictionary whose keys are attribute names that should be bound to items from your database, and whose values are an SQL column name or SQL expression that should be used to obtain the value of the bound attribute.

clusterKey, if provided, is a caching hint for speeding up database access by ``clustering'' queries to load an entire block of rows that share the same value of the specified clusterKey column. This caching hint is only used by the Clustered SQLTable variants described in detail below.

objclass( itemClass)
Specify a object class to use for creating new ``row'' objects. itemClass must accept a single argument, a tuple object representing a row in the database.

Otherwise, the default oclass for SQLTable is the TupleO class, which provides a named attribute interface to the tuple values representing the row.

select( whereClause,params=None,oclass=None,selectCols='t1.*')
Generate the list of objects that satisfy the whereClause via a SQL SELECT query. This function is a generator, so you use it as an iterator. params is passed to the cursor execute statement to allow additional control over the query. selectCols allows you to control what subset of columns should actually be retrieved.

_attrSQL( attr)
Get a string expression for accessing attribute attr in SQL. This might either simply be an alias to the corresponding column name in the SQL table, or possibly an SQL expression that computes the desired value, executed on the database server.

There are several variants of this class:

2.6.3 SQLTableClustered

A subclass of SQLTable that groups its retrieval of data from the table (into its local dictionary, where it is cached), into ``clusters'' of rows that share the same value of a column specified by the clusterKey argument to the SQLTableBase constructor. For data that naturally subdivide into large clusters, this can speed up performance considerably. If the clustering closely mirrors how users are likely to access the data, this performance gain will have relatively little cost in terms of memory wasted on loading rows that the user will not need.

2.6.4 SQLTableNoCache

Provide on-the-fly access to rows in the database, but never cache results. Use this when memory constraints or other considerations (for example, if the data in the database may change during program execution, and you want to make sure your program is always working with the latest version of the data) make it undesirable to cache recently used row objects, as the standard SQLTable does. Instead it returns (by default) SQLRow objects that simply provide an interface to obtain desired data attributes via database SQL queries. Of course this reduces performance; every attribute access requires an SQL query. You can customize the class used for providing this interface by specifying a different itemClass to the constructor.

2.6.5 SQLMultiTableNoCache

Drops the assumption of a one-to-one mapping between each key and a row object (i.e. removes the assertion that the key is unique, a ``primary key''), allowing multiple row objects to be returned for a given key. Therefore, the standard __getitem__ must act as a generator, returning an iterator for one or more row object. You must set a _distinct_key attribute to inform it of which column to use as the key for searching the database; this defaults to ``id''.

2.6.6 SQLGraph

Provides a graph interface to data stored in a table in a relational database. It follows the standard pygr graph interface, i.e. it behaves like a dictionary whose keys are source nodes, and whose associated values are dictionaries whose keys are target nodes, and whose associated values are edges between a pair of nodes. This class is a subclass of SQLTableMultiNoCache. By default, it assumes that the column names for source, target and edge IDs are simply ``source_id'', ``target_id'', and ``edge_id'' respectively. To use different column names, simply provide an attrAlias dictionary to the constructor, e.g.
g = SQLGraph('YOURDB.YOURTABLE',attrAlias=dict(source_id='left_exon_form_id',
                                               target_id='right_exon_form_id',
                                               edge_id='splice_id'))
For good performance, the columns storing the source_id, target_id, and edge_id should each be indexed.

__init__( name,cursor=None,itemClass=None,attrAlias=None,sourceDB=None,targetDB=None,edgeDB=None,simpleKeys=False,unpack_edge=None,**kwargs)
name provides the name of the database table to use.

cursor, if provided, should be a Python DB API 2.0 compliant cursor for connecting to the database. If not provided, the constructor will attempt to connect automatically to the database using the MySQLdb module and your .my.cnf configuration file.

attrAlias, if provided, must be a dictionary that maps desired attribute names to actual column names in the SQL database.

simpleKeys, if True, indicates that the nodes and edge objects saved to the graph by the user should themselves be used as the internal representation to store in the SQL database table. This usually makes sense only for strings and integers, which can be directly stored as columns in a relational database, whereas complex Python objects generally cannot be. To use complex Python objects as nodes / edges for a SQLGraph, use the sourceDB,targetDB,edgeDB options below.

sourceDB, if provided, must be a database container (dictionary interface) whose keys are source node IDs, and whose values are the associated node objects. If no sourceDB is provided, that implies simpleKeys=True.

targetDB, if provided, must be a database container (dictionary interface) whose keys are target node IDs, and whose values are the associated node objects.

edgeDB, if provided, must be a database container (dictionary interface) whose keys are edge IDs, and whose values are the associated edge objects.

unpack_edge, if not None, must be a callable function that takes a ``packed'' edge value and returns the corresponding edge object.

__iadd__( node)
Add node to the graph, with no edges. node must be an item of sourceDB, if that option was provided.

__delitem__( node)
Delete node from the graph, and its edges. node must be a source node in the graph. __isub__ does exactly the same thing.

__contains__( id)
Test whether id exists as a source node in this graph.

__invert__( )
Return an SQLGraph instance representing the reverse directed graph (i.e. swap target nodes for source nodes).

2.6.7 SQLGraphClustered

Provides a read-only graph interface with improved performance based on using SQLTableClustered as the interface to the database table. This has several implications: 1. the table should have a primary key; 2. the table should have a clusterKey column that provides the value for clustering rows in the table. This class can offer much better performance than SQLGraph for several reasons: 1. it caches data so that subsequent requests for the same node or edge will be immediate, with no need to query the SQL database; 2. it employs clustering to group together data retrieval of many rows at a time sharing the same cluster key value, instead of one by one; 3. it provides a load method for loading the entire graph into cache (local dictionary); 4. use of the items method and other ``value iterator'' methods will automatically perform a load of the entire graph, so that only a single database query is used for the entire dataset, rather than a separate query for each row or cluster.

As for SQLTable, getting a list of node IDs using __iter__ or keys does not force an automatic load of the entire table into memory, but calling items or other ``value'' list / iterator methods will.

__init__( table,source_id='source_id',target_id='target_id',edge_id='edge_id',clusterKey=None,sourceDB=None,targetDB=None,edgeDB=None,simpleKeys=False,unpack_edge=None,**kwargs)
Similar to the SQLTableBase, but not exactly the same format. table can either be a string table name, or an actual SQLTableClustered object. You must provide a clusterKey value. The sourceDB,targetDB,edgeDB,simpleKeys,unpack_edges optional arguments have the same meanings as for SQLGraph (see above).

load( l=None)
Load all data from the table, and store in our local cache (a Python dictionary). If l is not None, it provides a list of tuples obtained via the select method that should be added to the cache, instead of loading the entire database table.

__contains__( id)
Test whether id exists as a source node in this graph.

__invert__( )
Return an SQLGraphClustered instance representing the reverse directed graph (i.e. swap target nodes for source nodes).

2.6.8 TupleO

Default class for ``row objects'' returned by SQLTable. Provide attribute interface to a tuple. To subclass this, add an _attrcol attribute that maps attribute names to tuple index values (integers). Constructor takes a single tuple argument representing a row in the database.

2.6.9 SQLRow

Default class for row objects from NoCache variants of SQLTable. Provides transparent interface to a row in the database: attribute access will be mapped to SELECT of the appropriate column, but data is not cached on this object. Constructor takes two arguments: a database table object, and an identifier for this row. Actual data requests will be relayed by SQLRow to the database table object.


2.7 mapping module: graphs and graph query

2.7.1 The Pygr graph model

The basic idea of Pygr is that all Python data can be viewed as a graph whose nodes are objects and whose edges are object relations (in Python, references from one object to another). This has a number of advantages.

1. All data in a Python program become a database that can be queried through simple but general graph query tools. In many cases the need to write new code for some task can be replaced by a database query.

2. Graph databases are more general and flexible in terms of what they can represent and query than relational databases, which is very important for complex bioinformatics data.

3. Indeed, in Pygr, a query is itself just a graph that can be stored and queried in a database, opening paths to automated query construction.

4. Pygr graphs are fully indexed, making queries about edge relationships (which are often unacceptably slow in relational databases) fast.

5. The interface can be very simple and pythonic: it's just a Mapping. In Python "everything is a dictionary", also known as "the Mapping protocol": a dictionary maps some set of inputs to some set of outputs. e.g. m[a]=b maps a onto b, as a unique relation. In Pygr, if we want to be able to map a node to multiple target nodes (i.e. allow it to have multiple edges), we simply add another layer of mapping: m[a][b]=edgeInfo (where edgeInfo is optional edge info.)

Examples of the Pygr syntax:

graph += node1 # ADD node1 TO graph
graph[node1] += node2 # ADD AN EDGE FROM node1 TO node2
graph[node1][node2] = edge_info # ADD AN EDGE WITH ASSOCIATED edge_info
# ADD SCHEMA BINDING WITH graph[node] BOUND AS node.attr
setschema(node,attr,graph) 
# SEARCH graph FOR SUBGRAPH {1->2; 1->3; 2->3}, 
# I.E. EXONSKIP, WHERE THE SPLICE FROM 2 -> 3 HAS ATTRIBUTE type 'U11/U12' 
for m in GraphQuery(graph,{1:{2:None,3:None},\
                   2:{3:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},\
                   3:{}}):
    print m[1].id,m[2].id,m[1,2].id

Let's examine these examples one by one:

2.7.2 Directionality and Reverse Traversal

Note that dictGraph stores directed edges, that is, a->b does not imply b->a; those are two distinct edges that would have to be added separately if you want an edge going both directions. Moreover, the current implementation of dictGraph does not provide a mechanism for traveling an edge backwards. To do so with algorithmic efficiency requires storing each edge twice: once in a forward index and once in a reverse index. Since that doubles the memory requirements for storing a graph, the default dictGraph class does not do this. If you want such a "forward-backwards" graph, use the dictGraphFB subclass that stores both forwad and reverse indexes, and supports the inverse operator ($\sim$). $\sim$ graph gets the reverse mapping, e.g. ($\sim$ graph)[node2] corresponds to the set of nodes that have edges to node2. This area of the code hasn't been tested much yet.

2.7.3 Graph

This class provides a graph interface that can work with external storage typically, a BerkeleyDB file, based on storing node ID and edgeID values in the external storage instead of the python objects themselves.
__init__( saveDict=None,dictClass=dict,writeNow=False,filename=None,sourceDB=None,targetDB=None,edgeDB=None,intKeys=False,simpleKeys=False,unpack_edge=None,**kwargs)
filename, if provided, gives a path to a BerkeleyDB file to use as the storage for the graph. If the file does not exist, it will be created automatically. If the intKeys=True option is provided, this will be an IntShelve, which allows the use of int values as keys. Otherwise a regular Python shelve will be used (via the PicklableShelve class), which only allows string keys. Note that in this case you must call the Graph's close() method when you are done adding nodes / edges, to ensure that all the data is written to disk (unless you are using the writeNow=True option, see below).

The writeNow=True option makes all writing operations atomic; i.e. the shelve file is opened read-only, and any attempt to write a single edge will re-open in write mode, save the data, and immediately close it, then re-open it in read-only mode. This minimizes the probability that multiple processes simultaneously accessing the graph database will over-write each others' data. Note: if you leave this option False, and write data to the graph, you must call the close() method once you have finished writing data to the graph, as described below.

saveDict, if provided, must be a graph-style interface that stores the graph purely in terms of node ID and edge ID values. This could be an IntShelve, PicklableShelve or dict instance, for example. If None provided, the constructor will create storage for you using the dictClass class, passing on kwargs to its constructor.

simpleKeys, if True, indicates that the nodes and edge objects saved to the graph by the user should themselves be used as the internal representation to store in the graph database file. This usually makes sense only for strings and integers, which can be directly stored as keys in a BerkeleyDB (Python shelve), whereas complex Python objects generally cannot be. To use complex Python objects as nodes / edges for a Graph, use the sourceDB,targetDB,edgeDB options below.

sourceDB, if provided, must be a database container (dictionary interface) whose keys are source node IDs, and whose values are the associated node objects. If no sourceDB is provided, that implies simpleKey=True.

targetDB, if provided, must be a database container (dictionary interface) whose keys are target node IDs, and whose values are the associated node objects.

edgeDB, if provided, must be a database container (dictionary interface) whose keys are edge IDs, and whose values are the associated edge objects.

__iadd__( node)
Add node to the graph, with no edges. node must be an item of sourceDB.

__delitem__( node)
Delete node from the graph, and its edges. node must be a source node in the graph. __isub__ does exactly the same thing.

close( )
If you chose to use a Python shelve as the actual storage, you used the default setting of writeNow=False, and you wrote data to the graph, then you must call the Graph object's close() method to finalize writing to the disk of any data that may be pending, once you have finished writing data to the graph. Failure to do so may leave the shelve index file in an incomplete and corrupted state.

The object's edges attribute provides an interface to iterating over or querying its edge dictionary.

2.7.4 dictGraph

dictGraph is Pygr's in-memory graph class. For persistent graph storage and query (e.g. stored in a relational database table or BerkeleyDB file), see the Graph class above.

This class provides all the standard behaviors described above. The current reference implementation uses standard Python dict objects to store the graph. All the usual Mapping protocol methods can be used on dictGraph objects (top-level interface, in the examples above graph) and dictEdge objects (second-level interface; in the examples above graph[node]).

2.7.5 Collection

Provides a generic holder for a collection of data objects, that can be pickled, stored in pygr.Data etc.

2.7.6 IntShelve

Provides an interface to the Python shelve persistent dictionary storage, that can accept int values as keys.
__init__( filename=None,mode='r')
Open the specified shelve BerkeleyDB file, using the specified mode.

close( )
calls the shelve file's close method.
In other respects the IntShelve behaves like a regular shelve (dictionary interface).

2.7.7 GraphQuery

The GraphQuery class implements simple node-to-node matching, in which each new node-set is generated by an iterator associated with a specific node in the query graph. This iterator model is general: since indexes (mappings) support the iterator protocol, a given iterator may actually be an index lookup (or other clever search algorithm). The GraphQuery constructor takes two arguments: the default data graph being queried, and the query graph. The query graph is just a graph; its nodes can be any object that can be a graph node (i.e. any object that is indexible, e.g. by adding a __hash__() method). Its node objects will not be modified in any way by the GraphQuery. Its edges are expected to be dictionaries that can be checked for specific keyword arguments:

The attr - subqueries options are all implemented as extremely simple subclasses of GraphQuery. If you want to see just how easy it is to write new subclasses of GraphQuery functionality, look at the graphquery.py module (the entire graph query module is only 237 lines long).

Note: an easy way to pass keyword dictionaries (e.g. as edge information) is simply using the dict() constructor, e.g. dict(dataGraph=myGraph,filter=my_filter). I think this is a little more readable than 'dataGraph':myGraph, 'filter':my_filter.

Note on current behavior: currently, the GraphQuery iterator returns the same mapping object for each iteration (simply changing its contents). So to save these multiple values safely in a list comprehension we have to copy each one into a new dict object via dict(m).

2.7.8 What is GraphQuery actually doing?

A GraphQuery is basically an iterator that returns all possible mappings of the query graph onto the datagraph that match all of the nodes and edges of the query graph onto nodes and edges of the data graph. As an iterator, it does not instantiate a list of the matches, but simply returns the matches one by one. The current design is very simple. The GraphQuery constructor builds an "iterator stack" of GraphQueryIterators, each representing one node in the query graph; they are enumerated in order by a breadth-first-search of the query graph. The GraphQuery iterator processes the stack of GraphQueryIterators: any match simply pushes the stack to the next level; any match at the deepest level of the stack is a complete match (yield the queryMatch mapping); the end of any GraphQueryIterator simply pops the stack. One obvious idea for improving all this is to replace this "interpreter" with a "compiler" that compiles Python for loops that are equivalent to this stack, and run that... likely to be many fold faster.


2.8 coordinator Module

Framework for running subtasks distributed over many computers, in a pythonic way, using SSH for secure process invocation and XMLRPC for message passing. Also provides simple interface for queuing and managing any number of such "batch jobs".

I will first describe some classes for simple XMLRPC services, then proceed to the job control classes.

2.8.1 XMLRPCServerBase

Base class for creating an XMLRPC server to serve data from multiple objects. On the server-side, this object can be treated as a dictionary whose keys are object names, and whose associated values are the server objects that will serve functionality to XMLRPC clients. It provides an XMLRPC method methodCall that takes an object name, method name, and arguments, and if the call is permitted by its security rules, calls the designated method on that object.
__init__( name,host=None,port=5000,logRequests=False)
name is an arbitrary string identifier for the XMLRPC server.

host allows you to override the default hostname to use for this server (which defaults to the fully-qualified domain name of this computer). Setting it to 'localhost' will typically make the XMLRPC server only accessible to processes running on this computer.

port specifies the port number on which this server should run.

logRequests is passed on to SimpleXMLRPCServer as a flag determining whether it outputs verbose log information.

__setitem__( name,obj)
Save obj as the service called name in this XMLRPC server. obj must have an xmlrpc_methods dictionary whose keys are the names of its methods that XMLRPC clients are allowed to call.
__delitem__( name)
Delete the service called name in this XMLRPC server.
register( url=None,name='index',server=None)
Send information describing the services in this XMLRPC server, stored by the user on its registrationData attribute, to the resource database server, which can be specified in several ways. If url is not None, it will make an XMLRPC connection to the resource database server using url (as the URL for the XMLRPC server) and name (as the name of the server object that stores the resource database dictionary). Otherwise, if server is not None, it is assumed to be a resource database object (or XMLRPC connection to such a database) providing a registerServer method that takes two arguments, a locationKey and the registration data. Otherwise, it tries to connect to pygr.Data's default resource database by calling its getResource.registerServer method with the same arguments.
serve_forever( )
Start the XMLRPC server, after detaching it from stdin, stdout and stderr; this call will never exit.

This XMLRPC server provides several interface methods to XMLRPC clients contacting it:

objectList( )
Returns a dictionary of its server objects, whose keys are their names, and whose values are in turn dictionaries whose keys are their allowed method names.
objectInfo( objname)
Returns a dictionary whose keys are the allowed method names for the server object named objname.
methodCall( objname,methodname,args)
Calls the designated method on the named server object, with the provided args, and returns its result to the XMLRPC client.

Example server objects that can be added to a XMLRPCServerBase include seqdb.BlastDBXMLRPC, xnestedlist.NLMSAServer.

2.8.2 XMLRPCClient, get_connection

Client for accessing a XMLRPCServerBase server. Provides a dictionary interface whose keys are names of available server objects, and whose associated values are client objects that provide a transparent interface to the server objects (i.e. calling a method on the client object returns the value of the result of calling the same named method on the server object).
__init__( url)
Makes a connection to the XMLRPC server running on the specified url, typically consisting of both a host name and port number.
__getitem__( name)
Obtain a client object for the server object specified by name. It will be decorated with the set of methods on the server object that are allowed to be accessed by XMLRPC.

As a convenience, the coordinator module provides a function get_connection that provides an efficient connection to XMLRPC server objects. Specifically, it caches past requests, so that multiple requests for the same server object will re-use the same client object, and requests for different server objects on the same XMLRPC server will share the same XMLRPCClient connection. It is simply used as follows: get_connection(url,name), where url is the URL of the XMLRPC server, and name is the name of the server object you wish to access. For example:

myclient = coordinator.get_connection('http://leelab.mbi.ucla.edu:5000','ucsc17')

2.8.3 coordinator Module Functionality Overview

The coordinator module provides a simple system for running a large collection of tasks on a set of cluster nodes. It assumes:

To see how to use this, let's look at an example script, mapclusters5.py:

from pygr.apps.leelabdb import *
from pygr import coordinator

def map_clusters(server,genome_rsrc='hg17',dbname='HUMAN_SPLICE_03',
                 result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
                 rmOpts='',**kwargs):
    "map clusters one by one"
    # CONSTRUCT RESOURCE FOR US IF NEEDED
    genome = BlastDB(ifile=server.open_resource(genome_rsrc,'r'))
    # LOAD DB SCHEMA
    (clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna, \
     protein,clusterExons,clusterSplices) = getSpliceGraphFromDB(spliceCalcs[dbname])

    for cluster_id in server:
        g = genomic_seq[cluster_id]
        m = genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
        # SAVE ALIGNMENT m TO DATABASE TABLE test.mytable USING cursor
        createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
                            {'src_id':'varchar(12)','dest_id':'varchar(12)'})
        yield cluster_id # WE MUST FUNCTION AS GENERATOR

def serve_clusters(dbname='HUMAN_SPLICE_03',
                   source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
    "serve up cluster_id one by one"
    cursor = getUserCursor(dbname)
    t = SQLTable(source_table,cursor)
    for id in t:
        yield id

if __name__=='__main__':
    coordinator.start_client_or_server(map_clusters,serve_clusters,['hg17'],__file__)

Let's analyze the script line by line:

2.8.4 Log and Error Information

Process logging and error information go to three different types of logs:

2.8.5 Coordinator

To start a job coordinator (which in turn will the run the whole job by starting Processors on cluster nodes using SSH):

python mapclusters5.py mm5_jan02 --errlog=/usr/tmp/leec/mm5_jan02.log \ 
  --dbname=MOUSE_SPLICE --source_table=genomic_cluster_jan02 \
  --genome_rsrc=mm5 --result_table=GENOME_ALIGNMENT.mm5_cluster_jan02_all \ 
  --rmOpts=-rodent \

Here we have told the Coordinator to name itself "mm5_jan02" in all its communications with the ResourceController. Since we gave no command-line flags, the Coordinator will assume that a ResourceController is already running on port 5000 of the current host. You must have an ssh-agent running BEFORE you start the Coordinator, since the Coordinator will attempt to spawn jobs using SSH. The Coordinator will exit with an error message if it is unable to connect to ssh-agent. A few notes:

2.8.6 ResourceController

Whereas you start a separate Coordinator for each set of jobs you want to run, you only need a single ResourceController running. To start the ResourceController, run:

python coordinator.py --rc=bigcheese

This starts the ResourceController (running as a demon process in the background) and names it "bigcheese"; a name argument (given by the -rc flag) is REQUIRED. Since you didn't specify command-line flags, it will run on the default port 5000. It will use several files based on the name you gave it:

2.8.7 RCMonitor

The coordinator module also provides a convenience interface for interrogating and controlling jobs. In an interactive Python shell, import the coordinator module, and create an RCMonitor object::

from pygr import coordinator
m = coordinator.RCMonitor()

Since you did not specify any arguments, it will default to searching for the ResourceController on the current host, port 5000. You can specify a host and or port as additional arguments. It also loads an index of coordinators currently registered with this ResourceController, accessible on its coordinators attribute:

for name,c in m.coordinators.items():
  print name,len(c.client_report)

will print a list of the coordinators and how many Processors each is currently running. Each coordinator is represented by a CoordinatorMonitor object in this coordinators index.

Both RCMonitor and CoordinatorMonitor objects give you access to the XMLRPC methods of the ResourceController and Coordinators they represent. That is, running a method on the RCMonitor actually runs the identically-named method on the ResourceController. Some of the most useful ResourceController methods are:

Both RCMonitor and CoordinatorMonitor objects have a get_status() method that updates them with the latest information from their associated ResourceController or Coordinator.

Here are some typical monitor usages:

c = m.coordinators['mapclusters3'] # GET MY COORDINATOR
c.client_report.sort() # MAKE IT SORT CLIENTS BY HOSTNAME
c.client_report # PRINT THE SORTED LIST, SHOWING HOST, PID, #TASKS DONE
c.pending_report # PRINT LIST OF TASK IDS CURRENTLY RUNNING
c.nsuccess # PRINT TOTAL #TASKS DONE
c.nerrors  # PRINT TOTAL #TASKS FAILED
c.logfile # PRINT LIST OF ALL PROCESSOR LOGFILES

m.rules # PRINT THE CURRENT RULES DATABASE
m.resources # PRINT THE CURRENT RESOURCES DATABASE
m.setrule('hg17',
('/usr/tmp/ucsc_msa/hg17',
'gunzip -c /data/yxing/databases/ucsc_msa/human_assembly_HG17/*.fa.gz
>%s'))
m.get_status() # UPDATE OUR RC INFO
m.set_hostinfo('llc22','maxload',2.0) # ADD A NEW HOST TO OUR DATABASE
m.setload('llc1','maxload',0.0) # STOP USING llc1 FOR THE MOMENT
m.load_balance() # MAKE IT ALLOCATE ANY FREE CPUS NOW...
m.locks # SHOW LIST OF RESOURCES CURRENTLY LOCKED, UNDER CONSTRUCTION

2.8.8 Security

Internal communication between Processors, Coordinators and ResourceController is performed using XMLRPC and thus is not secure. However, since no authentication information or actual commands are transmitted by XMLRPC, and the coordinator module does not enable the processes that use it to do anything that they are not ALREADY capable of doing on their own (i.e. spawn ssh processes), the main security vulnerabilty is Denial Of Service (i.e. an attacker listening to the XMLRPC traffic could send messages causing Processors to shutdown, or Coordinators to be blocked from running any Processors). In other words the security philosophy of this module is to avoid compromising your security, by leaving the security of process invocation entirely to your existing security mechanisms (i.e. ssh and ssh-agent). Commands are only sent using SSH, not XMLRPC, and the XMLRPC components are designed to prevent known ways that an XMLRPC caller might be able to run a command on an XMLRPC server or client. (I blocked known security vulnerabilities in Python's SimpleXMLRPCServer module).

In the same spirit, the current implementation does not seek to block users from issuing commands that could let them "hog" resources, for the simple reason that in an SSH-enabled environment, they would be able to do so regardless of this module's policy. I.e. the user can simply not use this module, and spawn lots of processes directly using SSH. In the current implementation, every user can send directives to the ResourceController that affect resource allocation to other users' jobs. This means everybody has to "play nice", only giving their Coordinator(s) higher priority if it is really appropriate and agreed by other users. Unless a different process invocation mechanism (other than SSH by each user) were adopted, it doesn't really make sense to me to try to enforce a policy that is stricter than the policy of the underlying process invocation mechanism (i.e. SSH). Since every user can use SSH to spawn as many jobs as they want, without regard for sharing with others, making this module's policy "strict" doesn't really secure anything.

See About this document... for information on suggesting changes.