The following subsections provide details about how to use specific modules of Pygr functionality.
tar -xzvf pygr-0.3.tar.gz cd pygr python setup.py install
Once the test framework has completed successfully, the setup script will install pygr into python's respective site-packages directory.
If you don't want to install pygr into your system-wide site-packages, replace the "python setup.py install" command with
python setup.py build
Pygr contains several modules imported as follows:
from pygr import seqdb # IMPORT SEQUENCE DATABASE MODULE
If you did not install pygr in your system-wide site-packages, you must set your PYTHONPATH to the location of your pygr build. For example, if your top-level pygr source directory is PYGRDIR then you'd type something like:
setenv PYTHONPATH PYGRDIR/build/lib.linux-i686-2.3
Base classes for representing sequences and sequence intervals.
len(s),
and you iterate over the ``letters'' in it using for l in s:
(Note, the individual letters produced by this iterator
will themselves be SeqPath objects (by default, of length 1)).
And all the slicing
operations defined for Python Sequences also apply to
SeqPath (see below).
s[start:stop],
where start and stop are in the local coordinate system of s
(i.e. s[0] is the first letter of the interval represented by
s). Note that SeqPath
follows the Python slicing coordinate conventions of positive integers as
forward coordinates (i.e. counting from the interval start) and negative integers
as reverse coordinates (i.e. counting from the interval end).
str(s).
Note that in most cases
a SeqPath object does not itself store the sequence string associated with it
but obtains it from somewhere else when the user requests it.
-s yields the interval of the opposite strand that
is base-paired to interval s (i.e. this is not just the reverse-complement of s
in the string 'atgc'
for l in s: # GET EACH LETTER OF THE SEQUENCE
c=str(l)
edge = s[l1][l2] # GET EDGE INFORMATION FOR l1 --> l2
for l1,l2,edge in s.edges(): # GET ALL l1 --> l2 EDGES
do_something(l1,l2,e)
# DUMB GRAPH QUERY TO FIND 'AG' SUBSTRINGS IN SEQUENCE s
for d in GraphQuery(s,{0:{1:dict(filter=lambda fromNode,toNode:
str(fromNode)=='A' and str(toNode)=='G')},
1:{}})
l1,l2,edge = d[0],d[1],d[0,1]
For more information about edges, see the LetterEdge class.
| ) |
exon is an interval of genomic sequence, then
exon.before()[-2:] is its acceptor splice site (i.e. the 2 nt immediately
before exon).
| ) |
exon is an interval of genomic sequence, then
exon.after()[:2] is its donor splice site, (i.e. the 2 nt immediately
after exon).
| s, id) |
from pygr import sequence
seq = sequence.Sequence('GPTPCDLMETQ','FOOG_HUMAN')
| s) |
seq.update('TKRRPLEDKMNEPS')
| ) |
| s) |
Each SeqPath object has a number of attributes giving information about its ``location'':
| start, stop, useCache=True) |
fastacmd -L start,stop
UNIX shell command from the NCBI toolkit.
The optional useCache argument controls whether your strslice method
should attempt to get the sequence slice from its database cache (if any),
or, if false, only directly from its back-end storage (in the usual way
described above).
| ) |
len(self.seq), assuming that the sequence can be accessed
from the seq attribute.
| slice_obj) |
db attribute, and that database object
it points to has an itemSliceClass attribute, SeqPath.__getitem__
will use that class to construct the subinterval object. Similarly,
if the sequence object has an annot attribute, and that annotation
object has a db attribute, again the itemSliceClass attribute
of that database will be used as the class to construct the subinterval
object. Otherwise it will
use SeqPath itself as the class for constructing the subinterval object.
Note: this itemSliceClass behavior applies not only to
sequence slices obtained via __getitem__, but also from all other
methods that return sequence slices, such as the following list:
before, after, __mul__, __neg__.
__add__, __iadd__.
| other) |
| ) |
| other) |
| attr) |
xform = IntervalTransform(srcPath,destPath) d2 = xform(s2) # MAPS s2 FROM srcPath coords to destPath coord system d3 = xform[s2] # CLIPS s2 TO NOT EXTEND OUTSIDE srcPath, THEN XFORMS s3 = xform.reverse(d3) # MAP BACK TO srcPath COORD SYSTEM
| msaSlice, targetPath, sourcePath=None) |
| sourceOnly=True, **kwargs) |
| **kwargs) |
| mode=max,trapOverflow=True) |
| mode=max,trapOverflow=True) |
| pIdentityMin=.9,minAlignSize=1,mode=max) |
(srcStart,srcEnd,destStart,destEnd).
Warning: if your query sequence has multiple mappings in the alignment (i.e. it is aligned to two or more different regions in the alignment), pIdentity() and pAligned() may return fractions larger than 1.0. This is because the query sequence may align to a given target sequence via more than one region in the alignment. If you encounter this problem, you can iterate through the individual mappings yourself (by calling the iter(), items() or edges() iterator methods for your alignment slice object), and calculating the percentage identity or alignment (via your own algorithm) individually for each specific mapping. For more background on this problem, see ``Multiple Mappings'', below.
Note that the presence of multiple mappings is not a Pygr bug, but simply reflects the alignment data loaded into Pygr. Seq2SeqEdge should be able to avoid this problem mostly, beginning with release 0.6. (It tries to screen out hits not consistent with the specific region-region mapping stored with this edge).
KeyError.
d = SeqFilterDict(seqIntervalList) overlap = d[ival] # RETURNS INTERSECTION OF ival AND STORED IVAL, OR KeyError
You can pass a list of intervals to store to the class constructor (as
shown above). You can also add a single interval using the syntax
d[saveInterval]=saveInterval. (This syntax reflects the actual
mapping that the dictionary will perform if later called with the
same interval).
| ) |
| ) |
| seq) |
KeyError if not in this edge
| s) |
| seq, start, stop) |
seq.pathForward[start:stop].
| seq, start, stop) |
seq[start:stop].
Pygr interface to sequence alignment, and scalable storage for multigenome alignments
Pygr provides a general model for interfacing with any kind of sequence alignment, and also a uniquely scalable storage system for working with huge multiple sequence alignments such as multigenome alignments. Specifically, it lets you work with an alignment both in the traditional Row-Column model (each row is a sequence, each column is a set of individual letters from different sequences, that are aligned; we will refer to this as the RC-MSA model), and also as a graph structure (known as a Partial Order Alignment, which we will refer to as the PO-MSA model). This supports ``traditional'' alignment analysis, as well as graph-algorithms, and even graph query of alignments.
This model has a few basic classes:
for s2 in msa[s1]: # GET ALL INTERVALS s2 ALIGNED TO s1 IN msa
do_something(s1,s2)
In addition, its letters attribute acts as a graph interface to the Partial Order alignment (PO-MSA) representation of the alignment. I.e. it is a graph whose nodes each represent a set of individual letters from different sequences, that are aligned to each other, and whose edges connect pairs of nodes that are ``adjacent'' to each other in at least one sequence. Specifically, it acts as a dictionary whose keys are MSANode objects (see below), and whose edges are LetterEdge objects (see previous section).
for node in msa.letters: # GET ALL ALIGNMENT ``COLUMNS'' IN msa
for l in node: # GET ALL INDIVIDUAL SEQ LETTERS ALIGNED HERE
say_something(node,l)
myslice = msa[s1]: # GET SLICE ALIGNED TO s1 IN msa
for node in myslice.letters: # GET ALL ALIGNMENT ``COLUMNS'' FOR s1
for l1,l2,e in node.edges(): # GET INDIVIDUAL LETTERS ALIGNED TO l1 OF s1
whatever(l1,l2,e)
This class also has a regions method that generates all the alignment interval relationships in this slice according to ``grouping'' criteria such as maximum permissible gap length, etc. (i.e. any region of alignment containing no gaps larger than a specified size would be returned as a single region, whereas any gap larger than the specified size would split it into two separate regions). This provides a general interface for group-by operations in alignment query.
The nested list data structure solves this problem, by moving
any interval in the database that is contained in another interval
out of the top-level interval list, into the sublist of the
parent interval. Based on this, one can prove that one can stop
the scanning operation at the first non-overlapping interval (i.e.
the overlapping intervals in any list form a single contiguous block).
Overall, this reduces the query time to
, where
is
the number of intervals in the database that actually overlap the
query (i.e. results to return). Moreover, the nested list data structure
can be implemented very well both in computer memory (RAM) or as indexed
disk files. Pygr's disk-based cnestedlist database can complete
a typical interval query of the 26GB UCSC 8 genome alignment in
about 60 microseconds, compared with 10-30 seconds per query using
MySQL.
If you encounter multiple mappings, you can always iterate over them one by one, and perform your own computations for each one. However, to avoid them altogether, you can restrict your queries to the reference genome for this specific alignment (UCSC offers different versions of each alignment set, each based on a different reference genome).
This functionality is encapsulated in the NLMSA class, which has a number of methods and attributes.
Construction Methods:
| pathstem='', mode='r', seqDict=None, mafFiles=None, axtFiles=None, maxOpenFiles=1024, maxlen=None, nPad=1000000, maxint=41666666, trypath=None, bidirectional=True, pairwiseMode= -1, bidirectionalRule=nlmsa_utils.prune_self_mappings, maxLPOcoord=None) |
seqDict specifies a dictionary which maps sequence names to actual sequence
objects representing those sequences. If seqDict is None, the constructor
will call nlmsa_utils.read_seq_dict() to try to obtain it from files
associated with the NLMSA. It first looks for a file pathstem.seqDictP
that is simply a pickle of the seqDict data. If this is not found, it
next looks for a file pathstem.seqDict that is a prefixUnionDict
header file for opening all the sequence database files for you automatically.
This header file will itself specify a list of sequence database files; the
trypath option, if provided, specifies a list of directories in which to look for these
sequence database files.
The bidirectional option indicates whether you wish the NLMSA to save each input alignment relationship A:B in both possible directions (i.e. nlmsa[A] will yield B, and nlmsa[B] will yield A). In general, the bidirectional=True mode is most appropriate for true multiple sequence alignments, i.e. where it is guaranteed that for a given pair of sequences A,B each interval of A maps to a unique interval in B, which in turn maps back to the same interval of A (and only that interval in A). There are many possible scenarios where you might prefer bidirectional=False mode:
nlmsa is a mapping of the human genome sequence onto the mouse genomic
sequence, then nlmsa[s] should only yield a result if s is a human
genome sequence interval; a mouse genome sequence interval should raise a KeyError.
nlmsa[A] will return TWO alignments of B, which might be identical...
or might be significantly different). This is undesirable behavior. Instead,
use bidirectional=False so that nlmsa[A] will simply return the
alignments that were found when A was blasted against the database.
pairwiseMode=True indicates a PAIRWISE sequence alignment, in which the stored alignment relationships each consist of a pair of sequence intervals that are aligned. Note: this pairwise format can store the alignment of any number of sequences, but the key point is that the individual alignment relations are pairwise, sequence-to-sequence. The opposite model (pairwiseMode=False) indicates a true MULTIPLE sequence alignment, in which the stored alignment relationships each consist of an integer coordinate interval (the alignment's internal coordinate system, for technical reasons called the ``LPO'') and a sequence interval that is aligned to it. Under normal circumstances, you will not need to specify a value for the pairwiseMode option; the NLMSA will infer the correct setting automatically based on the input data. Note: the pairwise format (pairwiseMode=True) and multiple alignment format (pairwiseMode=False) cannot be mixed in a single NLMSA. It must be either one format or the other.
mafFiles can be used to specify a list of filenames containing a multiple sequence alignment in the UCSC MAF format, for saving as a new NLMSA (i.e. mode='w'). Note that this automatically sets pairwiseMode=False. After the MAF data are read, it will automatically call the build() method to construct the alignment index files.
axtFiles can be used to specify a list of filenames containing a set of pairwise alignments in UCSC axtNet format, for saving as a new NLMSA (i.e. mode='w'). Note that this automatically sets pairwiseMode=True. After the axtNet data are read, it will automatically call the build() method to construct the alignment index files.
bidirectionalRule allows the user to provide a function that has complete control over the desired bidirectional setting to use for each possible pair of sequence databases. Currently, this is only used for axtFiles reading; the default method (nlmsa_utils.prune_self_mappings) filters out duplicate mappings for a sequence database onto itself (since these are provided in both forward and reverse directions in the axtNet file), but stores mappings for one sequence database to another bidirectionally (since the axtNet files give such mappings in only one direction normally). To implement your own bidirectionalRule function, see nlmsa_utils.prune_self_mappings() as an example.
maxlen specifies the maximum coordinate value for a union or LPO coordinate system. Its default value is 2GB, to prevent int overflow. Using a smaller value can be useful, to 1) limit the size of the LPO in memory during initial construction, and 2) to limit the size of LPO database files on disk (if for example, your file system does not support files above some maximum size). During initial construction of the NLMSA (from MAF files or user-specified interval alignments), the algorithm performs a one-pass sort of the LPO intervals. Thus, this set of intervals is briefly held in RAM for this sort. If you have insufficient RAM, the construction step may raise a MemoryError. If so, you can avoid this problem by using a smaller maxlen value.
The maxint option provides another way of limiting the size of LPO databases. It specifies the maximum number of intervals to store per LPO database. Since each interval takes 24 bytes, the default setting limits each LPO to a total size of 1 GB. Note that the current NLMSA construction algorithm requires loading each database index into memory as one-time operation during construction. If your NLMSA build fails due to running out of memory, simply reduce this value.
The nPad option sets the maximum number of LPO coordinate systems (specifically, the offset for the start of real sequence IDs in the NLMSA sequence index). You are unlikely to need to change this default value.
maxOpenFiles limits the open file descriptors the NLMSA will use. This option is no longer of much importance. In versions prior to pygr 0.5, however, it was important because each sequence in the alignment had its own index file (in v.0.5 and later this problem is solved by unionization; for details see below). Since each sequence has a separate nested list database file, a large multi-genome alignment (with each genome containing 20 different chromosomes, say) can rapidly open a large number of file descriptors. Note: NLMSA only opens a given sequence's nested list database when one of your queries actually requires access to that sequence; it then keeps that file descriptor open to make subsequent queries to it fast. If the number of open file descriptors would exceed maxOpenFiles, it will close other open database files, which may slow down query performance (due to having to open and close databases repeatedly to process queries).
| sequence) |
nlmsa[s]+=s2, where s is
an interval of sequence and s2 is some other sequence interval.
If sequence had not been added to the alignment, this later operation
will raise a KeyError.
| annotation) |
| seqInterval) |
nlmsa[s1]+=s2
saves the alignment of intervals s1 and s2.
You can also use a regular Python slice object using integer indices
ie. nlmsa[1:45], in which case, it indicates that
region of the LPO coordinate system.
If the sequence containing
interval s2 is not already in the NLMSA, it will be added for you automatically
(i.e. creating the necessary indexing, nested list database files, etc.). In this
case, the sequence must supply a unique string identifier, which will be used
on subsequent attempts to open the NLMSA database, to match the individual sequence
nested-list databases against corresponding sequence objects (using seqDict,
see above).
| buildInPlace=True,saveSeqDict=False,verbose=True) |
buildInPlace=False forces it to use an older NLMSA construction method (higher memory usage, but more tested). The new in-place construction method (made the default in release 0.7) is described in the Alekseyenko & Lee 2007 paper published in Bioinformatics.
saveSeqDict=True forces it to write the NLMSA's seqDict (dictionary of sequences that are included in the alignment) to disk. This is unnecessary if you intend to store the NLMSA in pygr.Data, as pygr.Data will automatically save the NLMSA's seqDict as part of that process. However, if you plan on re-opening the NLMSA directly from disk, you should save the seqDict to disk by passing this option, or by directly calling the NLMSA's save_seq_dict() method.
verbose controls whether the method will print explanatory messages to stderr about the saveSeqDict=False mode. To suppress printing of these messages, use verbose=False.
| ) |
Alignment Usage Methods:
| s1) |
nlmsa[1:45], in which case, it gets the NLMSA slice corresponding to that
region of the LPO coordinate system.
| s1) |
seq,
instead of querying its stored alignment data. You can thus use this
to provide an NLMSA interface around virtually any source of alignment information
that you have. To see an example, see the xnestedlist.NLMSAClient class.
| pathstem,outfilename=None,verbose=True) |
/loaner/hg17_NLMSA/hg17_msa.idDict
(and many other index files with different suffixes),
then you would supply a pathstem value of /loaner/hg17_NLMSA/hg17_msa.
outfilename gives the path for the output text file into which the
NLMSA database will be dumped. If None, it will default to pathstem with a
.txt suffix added.
Setting verbose=False will prevent printing of warning messages to stderr (for details about possible warnings, see below).
Note: dump_textfile attempts to save information about the seqDict (or, alternatively, the PrefixUnionDict dictionary of multiple sequence databases), using their pygr.Data IDs if possible. Specifically, for a PrefixUnionDict (i.e. multiple sequence databases in one NLMSA), it saves a dictionary of the prefixes for each sequence database in the NLMSA, with its pygr.Data ID if it has one. Assigning a pygr.Data ID to each sequence database has the great advantage that the reconstruction method textfile_to_binaries() can simply request pygr.Data for these IDs on the destination machine, automatically. By contrast, if a sequence database has no pygr.Data ID, the user will have to supply that sequence database manually on the destination machine. In this case, dump_textfile will print a warning message to stderr explaining what the user must do. This provides yet another reason why it's a good idea to assign a pygr.Data ID to any sequence database that is a well-defined, commonly used public resource.
| filename,seqDict=None,prefixDict=None) |
Handling of sequence databases: textfile_to_binaries will attempt to obtain any needed sequence databases using their pygr.Data ID if assigned. If you obtain a PygrDataNotFoundError, this simply means that one of the pygr.Data IDs was not found in any of your pygr.Data resource databases. In this case, you must either add it to one of your resource databases, or add a resource database that does contain it to your PYGRDATAPATH, then re-run textfile_to_binaries.
On the other hand, if any of the needed sequence databases were NOT assigned a pygr.Data ID, then you will have to provide that sequence database(s) manually to the textfile_to_binaries() function, either via its seqDict argument (if the NLMSA contains only one sequence database), or via its prefixDict argument (if the NLMSA contains multiple sequence databases). If you do not do so, an appropriate error will be raised, explaining what you need to do. The prefixDict argument must be a dictionary whose keys match individual sequence database prefixes in the original NLMSA PrefixUnionDict, and whose associated values are the appropriate sequence database to use for each specified prefix. You only need to provide those sequence databases that textfile_to_binaries() is unable to obtain from pygr.Data. When in doubt, just run textfile_to_binaries() without the prefixDict argument, and it will raise an error message listing the prefixes that you need to provide.
NLMSAServer is constructed exactly the same as a normal NLMSA; it is a normal NLMSA with just two methods added for serving XMLRPC client requests. See the coordinator.XMLRPCServerBase reference documentation below for details about starting an XMLRPC server.
NLMSAClient provides a read-only client interface for querying
data in a remote NLMSAServer. It takes two extra arguments for
its constructor: url, the URL for the XMLRPC server; name,
the name of the NLMSAServer server object in the XMLRPC server's dictionary.
For example, to use an NLMSA stored on a remote XMLRPC server,
assuming that myPrefixUnion stores a dictionary of all the
sequence databases used by that NLMSA alignment, would just be:
from pygr import xnestedlist
nlmsa = xnestedlist.NLMSAClient(url='http://leelab.mbi.ucla.edu:5000',
name='ucsc17',seqDict=myPrefixUnion)
In addition, the NLMSASlice is the basic unit of sequence caching control, by which you can ensure that pygr alignment analysis accesses sequence databases in the most efficient way. Here's how it works:
cacheHint method for each
sequence database object containing the relevant sequences (if this method
exists; if it doesn't, this step is skipped). It passes the cacheHint method
the covering interval information for the aligned sequence, and a reference to
itself (the NLMSASlice object) as the owner of this cache hint.
An NLMSASlice acts like a dictionary whose keys are sequence intervals that are aligned to this region, and whose values are Seq2SeqEdge objects providing detailed information about the alignment of the target interval (key) to the source interval (the sequence interval used to create the NLMSASlice in the first place). You can use this dictionary interface in several ways:
| ) |
| maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, mergeAll=False, maxsize=500000000, minAlignSize=None, maxAlignSize=None, pIdentityMin=None, ivalMethod=None, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs) |
1. ``group-by'' operations on the alignment intervals contained within this slice. It allows the user to supply various parameters for controlling when alignment intervals will be merged or split in the results that it returns.
mergeAll forces it to combine intervals of a given sequence irrespective of the size of gaps or inserts separating them.
mergeMost forces it to combine intervals of a given sequence, within reason (but don't merge a whole chromosome if you get one interval from one end and one interval from the other end: maxgap=maxinsert=10000, mininsert=-10, maxsize=50000).
maxgap sets the maximum gap size for merging two adjacent intervals. If the target sequence for the two alignment intervals has a gap longer than maxgap letters between the two alignment intervals, they will be returned as separate intervals; otherwise they will be merged as a single alignment region.
maxinsert sets the maximum length of insert in the target sequence that allows to adjacent intervals to be merged as a single alignment region in the results.
mininsert is specifically for handling alignments that may have small ``cycles'' (due to slight inconsistencies in the reported alignment intervals, for example, if a portion of sequence can align at both the end of one interval or at the beginning of another, and the intervals are actually added to the NLMSA that way, then the start of the second interval will actually be before the stop of the first interval; this corresponds to a negative insert value). A mininsert value of zero (the default), prevents any such interval pairs from being merged. Giving a negative mininsert value will allow interval pairs whose insert value is greater than or equal to this value, to be merged.
maxsize: upper bound on maximum size for interval merging.
filterSeqs, if not None, should be a dict of sequences
used to filter the group-by analysis; i.e. only alignment intervals
containing these sequences are considered in the analysis. More
specifically, filterSeqs can be used to mask the group-by analysis
to a specific interval of a sequence, by having filterSeqs
return only the intersection between the interval it is passed as a key,
and the masking interval that it stores. If there is no overlap, it
must raise KeyError. The sequence.SeqFilterDict class
provides exactly this masking capability, i.e.
d = sequence.SeqFilterDict(someIntervals) overlap = d[ival] # RETURNS INTERSECTION BETWEEN ival AND someIntervals, OR KeyError
maxAlignSize if not None, sets a maximum size for filtering the resulting alignment regions. Regions larger than the specified size will be culled from the output.
pIdentityMin if not None, sets a minimum fractional sequence identity for filtering the resulting alignment regions. Regions with lower levels of identity will be clipped from the output. Specifically, within each region, the largest contiguous segment (possibly including indels, if permitted by maxgap and maxinsert) whose sequence identity is above the threshold will be returned (but only if it is larger than minAlignSize if set).
ivalMethod, if not None, allows the user to provide a Python function that performs interval grouping. Specifically it is called as ivalMethod(l, ns,msaSlice=self, **kwargs), where l is the list of intervals for NLMSASequence ns within the current slice msaSlice; all other args are passed as a dict in kwargs.
2. merge groups of sequences using "vertical" group-by rules. seqGroups: a list of one or more lists of sequences to group. If None, the whole set of sequences will be treated as a single group. Each group will be analyzed separately, as follows:
sourceOnly: output intervals will be reported giving only the corresponding interval on the source sequence; redundant output intervals (mapping to the same source interval) are culled. Has the effect of giving a single interval traversal of each group.
indelCut: for sourceOnly mode, do not merge separate intervals that the groupByIntervals analysis separated due to an indel).
minAligned: the minimum number of sequences that must be aligned to the source sequence for masking the output. Regions below this threshold are masked out; no intervals will be reported in these regions.
pMinAligned: the minimum fraction of sequences (out of the total in the group) that must be aligned to the source sequence for masking the output.
seqMethod: you may supply your own function for grouping. Called as seqMethod(bounds,seqs,**kwargs), where bounds is a sorted list of (ipos,isStart,i,ns,isIndel,(start,end,targetStart,targetEnd)) and seqs is a list of sequences in the group. Must return a list of (sourceIval,targetIval). See the docs.
| **kwargs) |
| **kwargs) |
| s1) |
| ) |
In addition to these standard dictionary methods, NLMSASlice provides several additional methods and attributes:
It also provides a graph interface to subset of the partial order alignment graph corresponding to this slice. For details, see NLMSASliceLetters, below.
| **kwargs) |
| **kwags) |
| maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, maxsize=500000000, mergeAll=True, ivalMethod=None, pIdentityMin=None, minAlignSize=None, maxAlignSize=None,**kwargs) |
| seqIntervals,pIdentityMin=None,filterFun=None,**kwargs) |
| seq,m,pIdentityMin=None,minAlignSize=None,maxAlignSize=None,**kwargs) |
(srcStart,srcEnd,destStart,destEnd).
seq must be the destination sequence object (sliceable by the destination
interval coordinates). The conservation criteria and clipping are performed
using Seq2SeqEdge.conservedSegment().
| seqIntervals, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs) |
| seq=None) |
| seq) |
This graph has the following methods:
| ) |
| ) |
| node) |
| ) |
| ) |
| ) |
Other, internal methods that regular users are unlikely to need:
| seq) |
KeyError if it is not
aligned here.
| node2) |
| ) |
However, one thing you should know is that for a sequence to be stored in a NLMSA, it needs to have a unique string identifier. NLMSASequence obtains a string identifier for the sequence in one of the following ways (in decreasing order of precedence): 1) the sequence ``object'' can itself just be a Python string, in which case that string is used as the identifier. 2) otherwise, the object should be a SeqPath instance. If it has a name attribute, that will be used as the identifier. 3) Otherwise, if it has a id attribute (which is present by default on sequence.Sequence objects), that will be used.
Pygr interface to sequence databases stored in FASTA, BLAST or relational databases.
The seqdb module provides a simple, consistent interface to sequence databases from a variety of different storage sources such as FASTA, BLAST and relational databases. Sequence databases are modeled (like other Pygr container classes) as dictionaries, whose keys are sequence IDs and whose values are sequence objects. Pygr sequence objects use the Python sequence protocol in all the ways you'd expect: a subinterval of a sequence object is just a Python slice (s[0:10]), which just returns a sequence object representing that interval; the reverse complement is just -s; the length of a sequence is just len(s); to obtain the actual string sequence of a sequence object is just str(s). Pygr sequence objects work intelligently with different types of back-end storage (e.g. relational databases or BLAST databases) to efficiently access just the parts of sequence that are requested, only when an actual sequence string is needed.
formatdb, fastacmd, blastall, and megablast.
RepeatMasker to mask out repetitive sequences from seeding alignments,
but to allow extension of alignments into masked regions.
If you are lacking one or more of these requirements, you can still install Pygr
and use all Pygr functionality that does not depend on the missing requirements.
If you try to use a function for which a requirement is missing, Pygr will raise
an appropriate exception (e.g. unable to run blastall).
from pygr.sequence import *
from pygr.seqdb import *
db = BlastDB('sp') # OPEN SWISSPROT BLAST DB
s = Sequence(str(db['CYGB_HUMAN'][40:-40]),'boo')
m = db.blast(s) # DO BLAST SEARCH
myg = db['MYG_CHICK']
for i in m[s][myg]:
print repr(i.srcPath),repr(i.destPath),i.blast_score,i.percent_id
Let's go through this example line by line:
KeyError). This edge information consists of a set of interval alignment relationships (described in detail below), which are printed out in this example.
Options for constructing a BlastDB:
| filepath=None,skipSeqLenDict=False,ifile=None,idFilter=None, blastReady=False,blastIndexPath=None,blastIndexDirs=None,**kwargs) |
blast or megablast, etc.
skipSeqLenDict prevents construction of a sequence length index file
(which will be named ``filepath.seqlen'') and a fast
sequence access file (which will be named ``filepath.pureseq'').
Setting this option to True can be useful if you either wish to
speed up initial opening of the BlastDB (note: construction of these indexes is
only a one-time event) or avoid the extra disk space required by these indexes.
Explanation: To facilitate the rapid creation of sequence objects (which requires the length of the sequence), it creates a sequence length index (as a Python shelve). This enables it to avoid actually loading the sequence string into memory each time a sequence object is created; instead it just looks up the sequence length. While this speeds up access to genomic sequence databases (where each sequence tends to be extremely long), this initial step may be slow for databases of short sequences. Setting skipSeqLenDict option to True, will prevent construction of this sequence length index.
ifile lets you open the database directly from a file object rather
than a filename. If you have a file object, you can pass it directly to BlastDB instead of a filepath. NB: the BlastDB() constructor will close ifile when it is done reading from the file object.
idFilter allows you to provide a function for re-mapping the FASTA sequence
identifiers read from the sequence file. This can be useful in the case of
NCBI FASTA files, since NCBI often treats the sequence ID as a ``blob'' into
which any number of database fields can be stuffed, rather than a true ID.
blastReady option specifies whether BLAST index files should be automatically
constructed (using formatdb). Note, if you attempt to run the blast()
method, it will automatically create the index files for you if they are missing.
blastIndexPath, if not None, specifies the path to the BLAST index
files for this database. For example, if the BLAST index files are
/some/path/foo.psd etc., then blastIndexPath='/some/path/foo'.
blastIndexDirs, if not None, specifies a list of directories in which
to search for and create BLAST index files. Entries in the list can be
either a string, or a function that takes no parameters and returns
a string path. A string value ``FILEPATH'' instructs it to use the
filepath of the FASTA file associated with the BlastDB.
The default value of this attribute on the BlastDB class is
['FILEPATH',os.getcwd,os.path.expanduser,pygr.classutil.default_tmp_path]
Useful methods:
| ) |
| ) |
| seq, al=None, blastpath='blastall', blastprog=None, expmax=0.001, maxseq=None, opts='', verbose=True) |
| seq, al=None, blastpath='megablast', expmax=1e-20, maxseq=None, minIdentity=None, maskOpts='-U T -F m', rmPath='RepeatMasker', rmOpts='-xsmall', opts='', verbose=True) |
| ) |
id = (~db)[seq] # GET IDENTIFIER FOR THIS SEQUENCE FROM ITS DATABASE
| filepath=None) |
formatdb program provided by NCBI, which must be in your
PATH for this method to work.
Useful attributes:
import coordinator
server = coordinator.XMLRPCServerBase(name,port=5020)
db = BlastDBXMLRPC('sp') # OPEN BlastDB AS USUAL, BUT WITH SUBCLASS
server['sp'] = db # ADD OUR DATABASE TO THE XMLRPC SERVER
server.serve_forever() # START SERVING XMLRPC REQUESTS, UNTIL KILLED.
| url,name) |
db = XMLRPCSequenceDB('http://leelab:5020','sp')
hbb = db['HBB_HUMAN'] # GET A SEQUENCE OBJECT FROM THE DATABASE...
seek()
to the correct location for any substring of the sequence. New in Pygr 0.4.
This was previously the default class for sequence objects returned from BlastDB,
but has been deprecated because we found that NCBI fastacmd was much too slow
and consumed enormous amounts of memory. BlastSequence relies on
fastacmd for ``fast'' access to individual sequence slices. The advantage is
that it only requires BLAST database files (produced by Pygr using formatdb),
whereas the new FileDBSequence requires a specially indexed sequence file
(constructed by default by BlastDB), which may be a disadvantage if you are low
on disk space.
BlastSequence has several optimizations for working with BLAST databases:
db, id and annotationType attributes.
e[:10] gets the slice representing
the first ten bases of the exon); such annotation slices can themselves
also be sliced. More generally, an annotation is itself a coordinate
system that can be sliced, negated (only for nucleotide sequence
annotations, to obtain the opposite strand), and have a length
(obtainable as usual via the builtin len() function).
nlmsa = cnestedlist.NLMSA('myAnnotDB','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
pairwiseMode=True,bidirectional=False)
for a in annoDB.itervalues(): # GET EACH ANNOTATION OBJ IN DATABASE
nlmsa.addAnnotation(a) # SAVE ALIGNMENT OF ITS SEQ INTERVAL TO THIS ANNOTATION
nlmsa.build() # CREATE FINAL INDEXES FOR THE ALIGNMENT DATABASE
s
as easily as
for a in nlmsa[s]: # FIND ANNOTATIONS THAT MAP TO s # DO SOMETHING...
e representing an exon annotation
might have attributes that link it
to its splice graph. for e2,splice in e.next.items() would iterate
through the list of exons it is connected to by a forward splice, etc.
| sliceDB,seqDB,annotationType=None,itemClass=AnnotationSeq,itemSliceClass=AnnotationSlice,itemAttrDict=None,sliceAttrDict=dict(),filename=None,mode='r') |
sliceDB, a database that takes an annotation ID as a key, and returns a slice information object with attributes that give the sequence ID and start/stop coordinates of the sequence interval representing the annotation, and any other information about the annotation. In general, any attribute on the slice information object, will also be accessible on the corresponding annotation object and slices derived from it.
You can give None as the sliceDB, in which case the
AnnotationDB will create one for you, either using an in-memory dictionary,
or by opening a Python shelve file if you provide the filename argument;
see below.
seqDB, a sequence database that takes a sequence ID as a key, and returns a sequence object.
annotationType should be a string identifier for the type of annotation. This will be propagated to all annotation objects / slices derived from this annotation database.
itemClass: the class to use for constructing an annotation object to be returned from the AnnotationDB.__getitem__. You can extend the behavior of annotation objects by subclassing AnnotationSeq. If the AnnotationDB participates in important schema relations, pygr.Data may add properties to the itemClass that implement its schema relations to other database containers. (See the reference docs on pygr.Data below for details).
itemSliceClass: the class to use for slices of annotation objects returned from the AnnotationDB.__getitem__. You can extend the behavior of annotation objects by subclassing AnnotationSlice. If the AnnotationDB participates in important schema relations, pygr.Data may add properties to the itemSliceClass that implement its schema relations to other database containers. (See the reference docs on pygr.Data below for details).
sliceAttrDict, a dictionary providing the attribute name aliases for attributes on annotation objects to access attributes or tuple values in the sliceInfo objects. The minimal required attributes are the sequence ID, start and stop coordinates in each object returned from sliceDB. For example,
sliceAttrDict = dict(id='chromosome',start='gen_start',stop='gen_stop')
s.chromosome,s.gen_start,s.gen_stop as the ID and interval
coordinates for each slice information object s. Note: the start,stop
coordinates should follow the SeqPath sign convention, i.e. positive
coordinates mean an interval on the positive strand, and negative coordinates
mean an interval on the negative strand (i.e. the reverse complement of
the positive strand. See the reference documentation on SeqPath above
for details).
If the sliceAttrDict (or sliceInfo object directly) provides a orientation attribute, it will be used to be change positive intervals to negative intervals if the orientation attribute is negative. This gives the user an alternative method to represent orientation: give all coordinates in positive orientation (positive integer values), and give an orientation attribute that is a negative value if the interval should be reversed (to negative orientation).
If a sliceAttrDict value is an integer, then it will not be treated as an attribute name, but instead will be used as an index, treating the sliceInfo object as a tuple. This makes it possible to use a sliceDB whose items are tuples. Here's an example:
exon_db = AnnotationDB(exon_slices, db,
sliceAttrDict=dict(id=0, orientation=3, # GIVE ATTR INTERFACE TO 2PLE
transcript_id=4, start=5, stop=6))
filename, if not None, indicates a Python shelve file to store the sliceDB info. It will be opened according to the mode argument; see the Python shelve docs for details. Note: if you write data to an AnnotationDB stored using a shelve, you must call its close() method to ensure that all data is saved to the Python shelve file!