2 Module Documentation
The following subsections provide details about how to use specific
modules of Pygr functionality.
2.1 Installation
Installing pygr is quite simple:
tar -xzvf pygr-0.3.tar.gz
cd pygr
python setup.py install
Once the test framework has completed successfully, the setup script
will install pygr into python's respective site-packages directory.
If you don't want to install pygr into your system-wide site-packages,
replace the "python setup.py install" command with
This will build pygr but not install it in site-packages.
Pygr contains several modules imported as follows:
from pygr import seqdb # IMPORT SEQUENCE DATABASE MODULE
If you did not install pygr in your system-wide site-packages, you
must set your PYTHONPATH to the location of your pygr build.
For example, if your top-level pygr source directory is PYGRDIR then
you'd type something like:
setenv PYTHONPATH PYGRDIR/build/lib.linux-i686-2.3
where the last directory name depends on your specific architecture.
2.2 sequence Module
Base classes for representing sequences and sequence intervals.
Pygr provides one base class representing both sequences and sequence intervals (SeqPath),
from which all sequence classes are derived (Sequence, SQLSequence, BlastSequence etc.).
In this section we document both the features of the base class, and ways to extend or
customize it by creating your own subclasses derived from SeqPath. The IntervalTransform
class represents a coordinate system mapping from one interval of a sequence, onto
another interval of the same or a different sequence.
This class provides the basic capabilities of a sliceable sequence or sequence interval,
widely used in Pygr. It tries to provide core operations on sequences in a highly
Pythonic way:
-
This method returns the entire sequence interval preceding this interval.
For example, if
exon is an interval of genomic sequence, then
exon.before()[-2:] is its acceptor splice site (i.e. the 2 nt immediately
before exon).
-
This method returns the entire sequence interval following this interval.
For example, if
exon is an interval of genomic sequence, then
exon.after()[:2] is its donor splice site, (i.e. the 2 nt immediately
after exon).
| sequence.Sequence( |
s, id) |
-
The Sequence class provides a SeqPath flavor that stores a sequence string
s and identifier id for this sequence.
from pygr import sequence
seq=sequence.Sequence('GPTPCDLMETQ','FOOG_HUMAN')
-
You can change the actual string sequence to a new string s
using the update method:
seq.update('TKRRPLEDKMNEPS')
-
returns DNA_SEQTYPE for DNA sequences,
RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.
-
returns the reverse complement of the sequence string s.
- id: the id attribute stores the sequence's identifier.
SeqPath follows Python slicing conventions (i.e. 0-based indexing, positive indexes
count forward from start, negative indexes count backwards from the sequence
end, and always s.start<s.stop).
Each SeqPath object has a number of attributes giving information about its
``location'':
- orientation: +1 if on the forward strand, or -1 if on the reverse strand.
- path: the top-level sequence object that this interval is part of, or self
if this object is its top-level (i.e. not a slice of a larger sequence). Note that
all forward intervals share the same path attribute, but reverse strand intervals
all have a path attribute that represents the entire reverse strand.
- pathForward: same as path, but always the forward strand sequence.
- start: start coordinate of the interval. NB: SeqPath stores coordinates
relative to the start of the forward strand. This is necessary for allowing
resizing of the top-level SeqPath; if coordinates were relative to the end of the
sequence, they would have to be recomputed every time the length of the sequence
changed. The main consequence of this is that coordinates for forward intervals
are always positive, whereas coordinates for reverse intervals are always
negative (i.e. following the Python convention
that negative coordinates count backwards
from the end, and the fact that the end of the reverse strand corresponds to
the start of the forward strand). NB2: if the SeqPath was originally created with
start=None, requesting its start attribute will force it to compute its start
coordinate, typically requiring a computation of the sequence length. In this
case, the start attribute will computed automatically by SeqPath.__getattr__().
- stop: end coordinate of the interval. The above comments for start
apply to stop. Note that for reverse intervals, a stop value of 0
means the end of the reverse strand (i.e. -1 is the last nucleotide of the
reverse strand, and 0 is one beyond the last nucleotide of the reverse strand).
- _abs_interval: a tuple giving the (start,stop) coordinates of the
interval on the forward strand corresponding to this interval (i.e. for a
forward interval, itself, or for a reverse interval, the interval that base-pairs
to it).
There are several methods and attributes you can override to extend or customize
the behavior of your own SeqPath-derived classes. Typically you will derive
either from the Sequence class, or in some cases from the SeqPath class.
| strslice( |
start, stop, useCache=True) |
-
called to get the string
sequence of the interval (start, stop). You can provide your own strslice()
method to customize how sequence is stored and accessed. For example,
SQLSequence.strslice() gets the sequence via a SQL query, and
BlastSequence.strslice() obtains it using the
fastacmd -L start,stop
UNIX shell command from the NCBI toolkit.
The optional useCache argument controls whether your strslice method
should attempt to get the sequence slice from its database cache (if any),
or, if false, only directly from its back-end storage (in the usual way
described above).
-
called to compute the length of the sequence. You can
customize this to provide an efficient length method for your particular
sequence storage. e.g. SQLSequence obtains it via a SQL query;
BlastSequence obtains it from a precomputed length index.
The default Sequence.__len__() method computes it from
len(self.seq), assuming that the sequence can be accessed
from the seq attribute.
-
if you want to monitor or intercept slicing
requests on your sequence, you can do so by providing your own getitem method.
See seqdb.BlastSequenceCache class for an example.
If the sequence object has a
db attribute, and that database object
it points to has an itemSliceClass attribute, SeqPath.__getitem__
will use that class to construct the subinterval object. Similarly,
if the sequence object has an annot attribute, and that annotation
object has a db attribute, again the itemSliceClass attribute
of that database will be used as the class to construct the subinterval
object. Otherwise it will
use SeqPath itself as the class for constructing the subinterval object.
Note: this itemSliceClass behavior applies not only to
sequence slices obtained via __getitem__, but also from all other
methods that return sequence slices, such as the following list:
before, after, __mul__, __neg__.
__add__, __iadd__.
-
get the sequence interval intersection of self and other.
-
get the sequence interval representing the opposite strand of self
i.e. the slice whose string value is the reverse complement of the string
value of self.
-
get the sequence interval union of self and other, i.e.
the smallest sequence interval that contains both of them.
-
if you subclass a SeqPath-derived class and supply a __getattr__
method for your subclass, it must call the parent class's
__getattr__. This is essential for ``delayed evaluation'' of
start and stop attributes, which are generated automatically
by SeqPath's __getattr__. If your subclass inherits from
more than one parent class, check whether both parents supply a
__getattr__, in which case your subclass must supply a
__getattr__ that explicitly calls both of them. Failing to do so
will lead to strange bugs.
- seq: the Sequence.strslice() method assumes that
the actual sequence is stored
on the seq attribute. You could customize this behavior by
making the seq attribute a property that is computed on the fly
by some method of your own.
This class provides a mapping transform between the coordinate
systems of a pair of intervals.
xform=IntervalTransform(srcPath,destPath)
d2=xform(s2) # MAPS s2 FROM srcPath coords to destPath coord system
d3=xform[s2] # CLIPS s2 TO NOT EXTEND OUTSIDE srcPath, THEN XFORMS
s3=xform.reverse(d3) # MAP BACK TO srcPath COORD SYSTEM
This class represents a segment of alignment between two sequences.
It is a temporary object created in association with a MSASlice
object (see Alignment Module below).
| __init__( |
msaSlice, targetPath, sourcePath=None) |
-
Create a Seq2SeqEdge for the targetPath, on the specified alignment
slice. If sourcePath is None, it will be calculated automatically
by calling the slice's methods.
| __iter__( |
sourceOnly=True, **kwargs) |
-
iterate over source intervals within this segment of alignment.
kwargs will be passed on to the msaSlice's
groupByIntervals and groupBySequences methods.
-
same as __iter__, but gets tuples of (source_interval,target_interval).
| pIdentity( |
mode=max,trapOverflow=True) |
-
Compute the percent identity between the source and target sequence
intervals in this segment of the alignment. mode controls
the method used for determining the denominator based on the lengths of
the two aligned sequence intervals. trapOverflow controls
whether overflow (due to multiple mappings of the query sequence to
different regions of the alignment) is trapped as an error.
To turn off such error trapping, set trapOverflow=False.
| pAligned( |
mode=max,trapOverflow=True) |
-
Compute the percent alignment between the source and target sequence
intervals in this segment of the alignment, i.e. the fraction of
residues that are actually aligned as opposed to gaps / insertions,
in the two intervals.
| conservedSegment( |
pIdentityMin=.9,minAlignSize=1,mode=max) |
-
Return the longest alignment interval (possibly including gaps) with
a %identity fraction higher than pIdentityMin. If there is no
such interval, or the longest such interval
is shorter than minAlignSize, it returns None. The interval
is returned as a tuple of integers
(srcStart,srcEnd,destStart,destEnd).
Warning: if your query sequence has multiple mappings in the alignment
(i.e. it is aligned to two or more different regions in the alignment),
pIdentity() and pAligned() may return fractions larger
than 1.0. This is because the query sequence may align to a given
target sequence via more than one region in the alignment. If you
encounter this problem, you can iterate through the individual mappings
yourself (by calling the iter(), items() or
edges() iterator methods for your alignment slice object),
and calculating the percentage identity or alignment (via your own algorithm)
individually for each specific mapping. For more
background on this problem, see ``Multiple Mappings'', below.
Note that the presence of multiple mappings is not a Pygr bug,
but simply reflects the alignment data loaded into Pygr. Seq2SeqEdge
should be able to avoid this problem mostly, beginning with release 0.6.
(It tries to screen out hits not consistent with the specific region-region
mapping stored with this edge).
This dict-like class provides a simple way for masking a set of sequences
to specific intervals. It stores a specific interval for each
sequence. Subsequent look-up using a sequence interval as a key will
return the intersection between that interval and the stored interval
for that sequence in the dictionary. If there is no overlap, it
raises KeyError.
d=SeqFilterDict(seqIntervalList)
overlap=d[ival] # RETURNS INTERSECTION OF ival AND STORED IVAL, OR KeyError
You can pass a list of intervals to store to the class constructor (as
shown above). You can also add a single interval using the syntax
d[saveInterval]=saveInterval. (This syntax reflects the actual
mapping that the dictionary will perform if later called with the
same interval).
This class represents an edge from origin -> target node.
-
iterate over seqpos for sequences that traverse this edge.
-
generate origin, target seqpos for sequences that traverse this edge.
-
return origin,target seqpos for sequence seq;
raise
KeyError if not in this edge
- seqs: returns its sequences that traverse this edge
The sequence module also provides convenience functions:
-
based on the letter composition of
the string s, returns DNA_SEQTYPE for DNA sequences,
RNA_SEQTYPE for RNA, and PROTEIN_SEQTYPE for protein.
| absoluteSlice( |
seq, start, stop) |
-
returns the sequence interval of top-level sequence object associated
with seq, interpreting start and stop according to
the Pygr convention: a pair of positive values represents an interval
on the forward strand; a pair of negative values represents an
interval on the reverse strand (see Coordinate System, above).
Note: if seq is itself a subinterval, then the start,stop
coordinates are interpreted relative to its parent sequence, i.e.
seq.pathForward[start:stop].
| relativeSlice( |
seq, start, stop) |
-
returns the sequence interval of seq, interpreting
start and stop according to
the Pygr convention: a pair of positive values represents an interval
on the forward strand; a pair of negative values represents an
interval on the reverse strand (see Coordinate System, above).
Note: if seq is itself a subinterval, then the start,stop
coordinates are interpreted relative to seq itself, i.e.
seq[start:stop].
2.3 Alignment Module
Pygr interface to sequence alignment, and scalable storage for multigenome alignments
Pygr provides a general model for interfacing with any kind of sequence alignment,
and also a uniquely scalable storage system for working with huge multiple sequence
alignments such as multigenome alignments. Specifically, it lets you work with
an alignment both in the traditional Row-Column model (each row is a sequence, each
column is a set of individual letters from different sequences, that are aligned;
we will refer to this as the RC-MSA model), and also
as a graph structure (known as a Partial Order Alignment, which we will refer to as
the PO-MSA model). This supports ``traditional'' alignment analysis, as well
as graph-algorithms, and even graph query of alignments.
This model has a few basic classes:
- MSA: this class represents an entire alignment. It acts as a graph whose
nodes are sequences (or sequence intervals) that are aligned, and whose edges
represent specific alignment relationships between specific pairs of sequences
(or intervals). Specifically, it acts as a dictionary whose keys are SeqPath
objects, and whose values are MSASlice objects (representing an alignment segment
associated with a specific SeqPath, see below for details). For example, to find
out what's aligned to some sequence interval s1:
for s2 in msa[s1]: # GET ALL INTERVALS s2 ALIGNED TO s1 IN msa
do_something(s1,s2)
In addition, its letters attribute acts as a graph interface
to the Partial Order alignment (PO-MSA) representation of the alignment. I.e.
it is a graph whose nodes each represent a set of individual letters from
different sequences, that are aligned to each other, and whose edges connect
pairs of nodes that are ``adjacent'' to each other in at least one sequence.
Specifically, it acts as a dictionary whose keys are MSANode objects (see below),
and whose edges are LetterEdge objects (see previous section).
for node in msa.letters: # GET ALL ALIGNMENT ``COLUMNS'' IN msa
for l in node: # GET ALL INDIVIDUAL SEQ LETTERS ALIGNED HERE
say_something(node,l)
- MSASlice: this class represents a segment of alignment associated with
a specific sequence interval (s1). It acts as dictionary whose keys are sequence
intervals s2 aligned to s1, and whose values are MSASeqEdge objects
that represent the alignment relationship between s1
s2. It also
has a letters attribute, that represents the subgraph of nodes
associated specifically with s1, and the edges that interconnect them.
myslice=msa[s1]: # GET SLICE ALIGNED TO s1 IN msa
for node in myslice.letters: # GET ALL ALIGNMENT ``COLUMNS'' FOR s1
for l1,l2,e in node.edges(): # GET INDIVIDUAL LETTERS ALIGNED TO l1 OF s1
whatever(l1,l2,e)
This class also has a regions method that generates all the alignment
interval relationships in this slice according to ``grouping'' criteria such
as maximum permissible gap length, etc. (i.e. any region of alignment containing
no gaps larger than a specified size would be returned as a single region,
whereas any gap larger than the specified size would split it into two separate
regions). This provides a general interface for group-by operations in alignment
query.
- MSASeqEdge: this class represents a relationship between a pair of
sequence intervals s1 and s2 (SeqPath objects). It provides a mapping between
subintervals of s1
s2. I.e. it acts as a dictionary
that accepts subintervals of s1 as keys, and maps them to aligned
subintervals of s2. It also
has a letters attribute, that represents the subgraph of nodes
associated specifically with them, and the edges that interconnect the nodes.
- MSANode: this class represents a specific ``column'' in the alignment
that aligns a set of individual letters from different sequences. This
corresponds to a node in the PO-MSA representation of the alignment.
It acts as a dictionary whose keys are sequence intervals (typically only
one letter long) aligned in this column, and whose values are MSASeqEdges
representing the alignment of that letter to the column (see above).
Pygr provides a highly scalable storage mechanism for working with
multi-genome alignments. One fundamental challenge in working with
very large alignments is the interval overlap query problem:
to obtain a portion of an alignment (defined by some interval of
interest) requires finding all interval elements in the ``alignment
database'' that overlap the query interval. Since the intervals
can be indexed by start (or end) position, one can typically find the
first overlapping element in
time, where
is the total
number of intervals in the database. The problem is that since
standard index structures cannot index both start and end,
to obtain all intervals that overlap the query interval, one must scan
forwards (or backwards) from that point. Furthermore, one cannot stop
at the first non-overlapping interval; there might be an extremely long
interval at the very beginning of the index, that extends to overlap
the query interval. In this case, one would have to scan the entire
database (
time) to guarantee that all overlapping intervals are
found.
The nested list data structure solves this problem, by moving
any interval in the database that is contained in another interval
out of the top-level interval list, into the sublist of the
parent interval. Based on this, one can prove that one can stop
the scanning operation at the first non-overlapping interval (i.e.
the overlapping intervals in any list form a single contiguous block).
Overall, this reduces the query time to
, where
is
the number of intervals in the database that actually overlap the
query (i.e. results to return). Moreover, the nested list data structure
can be implemented very well both in computer memory (RAM) or as indexed
disk files. Pygr's disk-based cnestedlist database can complete
a typical interval query of the 26GB UCSC 8 genome alignment in
about 60 microseconds, compared with 10-30 seconds per query using
MySQL.
Multi-genome alignments take traditional models of alignment to an
entirely different scale, and inevitably many of the assumptions of
standard row-column multiple sequence alignment are broken (e.g.
no inversions; no cycles; etc.). One major issue that users should be
aware of in UCSC multi-genome alignments is the possibility of
multiple mappings, in which a given query sequence interval is
mapped to two or more different regions of the alignment (and thus potentially
to two or more different locations in a given target genome). Currently,
UCSC multi-genome alignment are typically based on a single
reference genome, to which all other genomes are aligned. While
a given region of the reference genome might be guaranteed to have
a unique mapping in the UCSC multi-genome alignment, other genomes
do not appear to have any such guarantee: a region in any of those genome
can have multiple mappings. This is problematic for several reasons:
- It introduces ambiguity in the alignment: you don't know which of the
multiple hits is considered to be the ``right'' alignment; the UCSC alignment
file does not tell you.
- There is no scoring information to resolve this ambiguity. In a way,
this situation is even worse than the common situation we previously faced
in search for alignment mappings using BLAST, because (unlike BLAST) the
MAF alignment does not give a score that indicates which mapping is best.
(We haven't seen such scoring information; if it can be recovered for these
alignment files, we'd be love to know about that...).
- It can cause ``buggy'' results in calculations based on the alignment.
For example, Pygr's pIdentity() and pAligned() computations
can give values larger than one when a query region has multiple hits. This
is not, strictly speaking, a Pygr bug: the query region is mapped by the MAF
file to the same target region multiple times, resulting in multiple
overlaps.
If you encounter multiple mappings, you can always iterate over them one
by one, and perform your own computations for each one. However, to avoid them
altogether, you can restrict your queries to the reference genome for this specific
alignment (UCSC offers different versions of each alignment set, each based on
a different reference genome).
Top-level object representing an entire multiple sequence alignment,
stored using a set of disk-based nested list interval databases.
The alignment is stored as an interval representation of a
linearized partial order (LPO), using nested list
databases. This has several elements:
- PO-MSA: Conceptually, the alignment is represented as a partial order alignment
(PO-MSA), in which aligned sequence intervals are fused together as a single
``node'' in the alignment graph; two nodes are connected by an edge if and only
if they are adjacent in at least one of the sequences aligned to them
(i.e. if residue i of that sequence is in the first node, and
residue i+1 is in the second node, then there is a directed edge
from the first PO-MSA node to the second node).
- LPO: This alignment graph is partially ordered. Let's define an
ordering relation ``i<j'' to mean ``there exists a path
of directed edges from i to j''. For two
letters i and j in a sequence, i<j XOR j<i (i.e. all
nodes have an ordering relationship). By contrast, if two nodes in the LPO
represent insertions in different sequences, then NOT i<j AND NOT j<i.
Thus there can be some nodes in the LPO that have no ordering relationship
with respect to each other. It is still possible to map the PO-MSA onto
a linear coordinate system (i.e. to ``linearize'' the partial order): as long
as the graph contains no cycles, we can map the nodes i,j,k,... of the graph
onto a linear coordinate system x,y,z,... such that for any pair of
nodes i,j mapped to coordinates x<y, we assert NOT j<i. This is
called the linearized partial order (LPO). This maps the PO-MSA onto
a standard Row-Column MSA format, where the LPO coordinate (just an integer
sequence 0,1,2...) can be considered the index value of each alignment column.
- nested list: The actual alignment data are stored in the form of
(start,stop) pairs representing aligned intervals. Since this representation
uses intervals, not individual letters, it takes no more memory to store
an alignment of two 100 kb regions than it does to align two individual letters.
This is important for scalable storage (and query) of large multi-genome
alignments. (Each alignment interval takes 24 bytes: five int for
the (start,stop) pairs and target sequence ID, plus one int
for the sublist ID).
These interval databases are stored using nested lists. Specifically,
the alignment is stored as 1) a mapping of each aligned sequence interval
onto an LPO coordinate interval; 2) a reverse mapping of each LPO interval onto
all the sequence intervals that are aligned there. To find the alignment of
a sequence interval onto the other sequences in the alignment, that interval
is first mapped onto the LPO, and from there mapped back to intervals in the
other sequences. A nested list database is stored for each of these
mappings (i.e. for an alignment of N sequences, there will be N+1
nested list databases to store the MSA). Furthermore, if the size of the LPO
coordinate system (i.e. number of columns in its RC-MSA format)
grows larger than the range representable by int (typically
= 2 GB),
the LPO will have to be split into separate nested list databases of a size
smaller than the maximum range representable by int. This is necessary
for handling alignments of large genomes (e.g. the human genome is approximately 3 GB).
Pygr takes care of all this for you automatically. Note, as an entirely separate
issue, that Pygr's cnestedlist
module uses the long long data type for file offsets and
the fseeko() POSIX interface for large file support (i.e. 64-bit
file sizes), which is supported by current versions of Linux, Mac OS X, etc;
otherwise, check if your filesystem supports this.
This functionality is encapsulated in the NLMSA class, which has a number of methods
and attributes.
Construction Methods:
| NLMSA( |
pathstem='', mode='r', seqDict=None, mafFiles=None, maxOpenFiles=1024, maxlen=None, nPad=1000000, maxint=41666666,trypath=None,bidirectional=True,use_virtual_lpo= -1,maxLPOcoord=None) |
-
Constructor for the class. pathstem specifies a path and filename prefix for
the NLMSA files (since multiple files are used to store one NLMSA, it will automatically add a
number of suffixes automatically to open the necessary set of files for the NLMSA).
mode is either ``r'' to open an existing NLMSA (from the pathstem disk files);
``w'' to create a new one (which will be saved to the pathstem disk files);
or ``memory'' to create a new in-memory NLMSA (i.e. stored in your computer's RAM
instead of using files on your hard disk). Obviously, this limits you to
the amount of RAM in your computer, but will make the NLMSA much, much faster.
seqDict specifies a dictionary which maps sequence names to actual sequence
objects representing those sequences. If seqDict is None, the constructor
will look for a file pathstem.seqDict that is a prefixUnionDict
header file for opening all the sequence database files for you automatically.
This header file will itself specify a list of sequence database files; the
trypath option, if provided, specifies a list of directories in which to look for these
sequence database files.
The bidirectional option indicates whether you wish the NLMSA to
save each alignment relationship A:B in both possible directions
(i.e. nlmsa[A] will yield B, and nlmsa[B] will yield A). In general, the
bidirectional=True mode is most appropriate for true multiple sequence
alignments, i.e. where it is guaranteed that for a given pair of sequences A,B
each interval of A maps to a unique interval in B, which in turn maps back
to the same interval of A (and only that interval in A). There are
many possible scenarios where you might prefer bidirectional=False mode:
- When you WANT your alignment to have a specific directionality. For example,
if
nlmsa is a mapping of the human genome sequence onto the mouse genomic
sequence, then nlmsa[s] should only yield a result if s is a human
genome sequence interval; a mouse genome sequence interval should raise a KeyError.
- When the input alignment data may contain multiple, inconsistent alignments of
a given pair of sequences. For example, a BLAST all-vs-all will return TWO alignments
of A,B: one when A is blasted against the database (finding B), and another when
B is blasted against the database (finding A). These two alignments could be different!
In this case, a bidirectional=True alignment would return BOTH alignments
(i.e.
nlmsa[A] will return TWO alignments of B, which might be identical...
or might be significantly different). This is undesirable behavior. Instead,
use bidirectional=False so that nlmsa[A] will simply return the
alignments that were found when A was blasted against the database.
- In general, using bidirectional=True can yield multiple, potentially
inconsistent results when the input data are not a true multiple-sequence alignment
(e.g. BLAST alignment data is strictly pairwise, not a true multiple-sequence alignment).
use_virtual_lpo=True indicates a PAIRWISE sequence alignment, in which
the stored alignment relationships each consist of a pair of sequence intervals
that are aligned. Note: this pairwise format can store the alignment of any
number of sequences, but the key point is that the individual alignment relations
are pairwise, sequence-to-sequence. The opposite model (use_virtual_lpo=False)
indicates a true MULTIPLE sequence alignment, in which the stored alignment
relationships each consist of an integer coordinate interval (the alignment's internal
coordinate system, for technical reasons called the ``LPO'') and a sequence
interval that is aligned to it. Under normal circumstances, you will not need
to specify a value for the use_virtual_lpo option; the NLMSA will infer
the correct setting automatically based on the input data. Note: the pairwise format
(use_virtual_lpo=True) and multiple alignment format (use_virtual_lpo=False)
cannot be mixed in a single NLMSA. It must be either one format or the other.
mafFiles can be used to specify a list of
filenames storing a multiple sequence alignment in the UCSC MAF format.
maxOpenFiles limits the open file descriptors the NLMSA will use.
This option is no longer of much importance. In versions prior to pygr 0.5,
however, it was important because each sequence in the alignment had its
own index file (in v.0.5 and later this problem is solved by unionization;
for details see below). Since
each sequence has a separate nested list database file, a large multi-genome alignment
(with each genome containing 20 different chromosomes, say) can rapidly open a large
number of file descriptors. Note: NLMSA only opens a given sequence's nested list database
when one of your queries actually requires access to that sequence; it then
keeps that file descriptor open to make subsequent queries to it fast. If the number
of open file descriptors would exceed maxOpenFiles, it will close other open
database files, which may slow down query performance (due to having to open and close
databases repeatedly to process queries).
maxlen specifies the maximum coordinate
value for an LPO database. Its default value is 2GB, to prevent int overflow.
Using a smaller value can be useful, to 1) limit the size of the LPO in memory
during initial construction, and 2) to limit the size of LPO database files on disk
(if for example, your file system does not support files above some maximum size).
During initial construction of the NLMSA (from MAF files or user-specified interval
alignments), the algorithm performs a one-pass sort of the LPO intervals. Thus,
this set of intervals is briefly held in RAM for this sort. If you have insufficient
RAM, the construction step may raise a MemoryError. If so, you can avoid this problem
by using a smaller maxlen value.
The maxint option provides another way of limiting the size of LPO
databases. It specifies the maximum number of intervals to store per database.
Since each interval takes 24 bytes, the default setting limits each LPO to
a total size of 1 GB. Note that the current NLMSA construction algorithm
requires loading each database index into memory as one-time operation
during construction. If your NLMSA build fails due to running out of memory,
simply reduce this value.
The nPad option sets the maximum number of LPO coordinate systems
(specifically, the offset for the start of real sequence IDs in the NLMSA
sequence index). You are unlikely to need to change this default value.
-
As part of constructing an alignment, adds sequence to the alignment graph,
so that you can subsequently save specific alignments of intervals of
sequence, using code like
nlmsa[s]+=s2, where s is
an interval of sequence and s2 is some other sequence interval.
If sequence had not been added to the alignment, this later operation
will raise a KeyError.
| addAnnotation( |
annotation) |
-
adds an alignment relationship to annotation from its underlying
sequence interval. Note: to use this, the NLMSA must have been created with the
use_virtual_lpo=True option.
| __getitem__( |
seqInterval) |
-
prepare to store an alignment relationship for the sequence interval seqInterval,
i.e. get a BuildMSASlice object representing seqInterval, to which you can
then add other sequence intervals to align them. I.e.
nlmsa[s1]+=s2
saves the alignment of intervals s1 and s2.
You can also use a regular Python slice object using integer indices
ie. nlmsa[1:45], in which case, it indicates that
region of the LPO coordinate system.
If the sequence containing
interval s2 is not already in the NLMSA, it will be added for you automatically
(i.e. creating the necessary indexing, nested list database files, etc.). In this
case, the sequence must supply a unique string identifier, which will be used
on subsequent attempts to open the NLMSA database, to match the individual sequence
nested-list databases against corresponding sequence objects (using seqDict,
see above).
| build( |
buildInPlace=True) |
-
to construct the final nested list databases,
after all the desired alignment intervals have been saved (using the
iadd/getitem above). This method
simply calls the build() method on all the constituent NLMSASequence objects
in this alignment. NOTE: you do not need to call build() if
you provided a mafFiles constructor argument, since that automatically
calls build().
buildInPlace=False forces it to use an older NLMSA construction method
(higher memory usage, but more tested). The new in-place construction method
(made the default in release 0.7) is described in the Alekseyenko & Lee 2007
paper published in Bioinformatics.
Alignment Usage Methods:
-
get the alignment slice for the sequence interval s1,
i.e. get an NLMSASlice object representing the set of intervals aligned to s1.
You can also use a regular Python slice object using integer indices
ie.
nlmsa[1:45], in which case, it gets the NLMSA slice corresponding to that
region of the LPO coordinate system.
-
If you subclass NLMSA and provide a doSlice method, the NLMSA will
call your doSlice(seq) method to find alignment results for
seq,
instead of querying its stored alignment data. You can thus use this
to provide an NLMSA interface around virtually any source of alignment information
that you have. To see an example, see the xnestedlist.NLMSAClient class.
- seqs: This attribute provides a dictionary of the sequences in
the NLMSA, whose keys are top-level sequence objects, and whose values are
the associated NLMSASequence object for each sequence. Ordinarily you will have
no need to access the NLMSASequence object directly; only do so if you know what
you're doing (details below). This dictionary is of type NLMSASeqDict (see below).
These two functions enable you to dump a constructed NLMSA binary database
to a platform-independent text format, and to restore an NLMSA binary database
from this text format. This can be useful for
- speeding up the process of installing an NLMSA database on multiple
machines. Since the restore operation does not involve a build step, it
can be substantially faster than building the NLMSA separately on each machine.
- moving an NLMSA database from one machine to a machine with a different
binary architecture. Since the binary database format depends on platform-specific
details (e.g. big-endian vs. little-endian integer representation), it is not
compatible between different architectures.
- using an NLMSA database on a machine that has insufficient RAM memory
to perform the binary database build. You can build the NLMSA binary database
on another machine with sufficient RAM, dump it to text, then restore it on
the desired machine where you wish to be able to use it.
- using the text format to ``package'' an NLMSA database for distribution
on the Internet. Users need only to obtain a single file and run a single command
to restore the NLMSA database. Users only need sufficient disk space to hold
the NLMSA; they do not need large amounts of RAM (because they will not have to
perform a ``build'' step).
| dump_textfile( |
pathstem,outfilename=None) |
-
Dumps a text representation of an existing NLMSA binary database.
pathstem must be the path to the NLMSA. For
example if you have an NLMSA database index file
/loaner/hg17_NLMSA/hg17_msa.idDict
(and many other index files with different suffixes),
then you would supply a pathstem value of /loaner/hg17_NLMSA/hg17_msa.
outfilename gives the path for the output text file into which the
NLMSA database will be dumped. If None, it will default to pathstem with a
.txt suffix added.
| textfile_to_binaries( |
filename) |
-
Creates an NLMSA binary database from input text file filename.
The NLMSA binary database will be created in the current directory,
and will be given the same name as it originally had prior to being dumped to text.
Since no build is required, this function does not require significant amounts
of RAM memory.
These two classes, provided by the separate xnestedlist module,
provide an XMLRPC client-server mechanism for querying NLMSA databases
over a network.
NLMSAServer is constructed exactly the same as a normal NLMSA;
it is a normal NLMSA with just two methods added for serving XMLRPC client
requests. See the coordinator.XMLRPCServerBase reference
documentation below for details about starting an XMLRPC server.
NLMSAClient provides a read-only client interface for querying
data in a remote NLMSAServer. It takes two extra arguments for
its constructor: url, the URL for the XMLRPC server; name,
the name of the NLMSAServer server object in the XMLRPC server's dictionary.
For example, to use an NLMSA stored on a remote XMLRPC server,
assuming that myPrefixUnion stores a dictionary of all the
sequence databases used by that NLMSA alignment, would just be:
from pygr import xnestedlist
nlmsa=xnestedlist.NLMSAClient(url='http://leelab.mbi.ucla.edu:5000',
name='ucsc17',seqDict=myPrefixUnion)
A temporary object created on-the-fly to represent (an interface to provide
information about) the portion of the alignment associated with a specific
sequence interval. This is the main class for querying information about
alignments, and provides a number of useful methods for getting
detailed information about alignment relationships.
In addition, the NLMSASlice is the basic unit of sequence caching
control, by which you can ensure that pygr alignment analysis accesses
sequence databases in the most efficient way. Here's how it works:
- When you perform an NLMSA query by creating an NLMSASlice, it assembles
a list of covering intervals for all sequences in this part of the alignment
(i.e. for each sequence, the smallest interval that contains all of its
aligned intervals in this NLMSASlice).
- NLMSASlice then attempts to call the
cacheHint method for each
sequence database object containing the relevant sequences (if this method
exists; if it doesn't, this step is skipped). It passes the cacheHint method
the covering interval information for the aligned sequence, and a reference to
itself (the NLMSASlice object) as the owner of this cache hint.
- If any operation subsequently attempts to access the actual sequence
for any interval that is contained within this covering interval, the sequence
database will instead load the entire covering interval, which it stores in
its cache, associated with the specified owner. It then returns the
appropriate subinterval of sequence requested, as usual.
- Any subsequent requests for sequence strings that fall within this
covering interval will simply be obtained from this cache, instead of
retrieving the sequence from disk files.
- This cache information is retained until the owner (in this case,
the original NLMSASlice) is deleted (by Python garbage collecting). Thus, to
control sequence caching, all you have to do is hold on to the NLMSASlice as
long as you want to work with its associated sequence intervals. As soon as
you drop it, its associated cache information will also be automatically deleted,
freeing up memory.
An NLMSASlice acts like a dictionary whose keys are
sequence intervals that are aligned to this region, and whose values are
Seq2SeqEdge objects providing detailed information about the alignment of
the target interval (key) to the source interval (the sequence interval
used to create the NLMSASlice in the first place). You can use this
dictionary interface in several ways:
-
iterates over all sequence intervals that have
a 1:1 mapping (i.e. a block of alignment containing no indels) to
all or part of the source interval.
| keys( |
maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, mergeAll=False, maxsize=500000000, minAlignSize=None, maxAlignSize=None, pIdentityMin=None, ivalMethod=None, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs) |
-
Provides a more general interface than iter(), with two types of
group-by capabilities, ``group-by'' operations on the alignment intervals
contained within this slice (``horizontal'' grouping),
and on the sets of sequences aligned
to this slice (``vertical'' grouping).
1. ``group-by'' operations on the alignment intervals
contained within this slice. It allows the user to supply
various parameters for controlling when alignment intervals will be
merged or split in the results that it returns.
mergeAll
forces it to combine intervals of a given sequence irrespective
of the size of gaps or inserts separating them.
mergeMost
forces it to combine intervals of a given sequence, within reason
(but don't merge a whole chromosome if you get one interval from one end
and one interval from the other end:
maxgap=maxinsert=10000, mininsert=-10, maxsize=50000).
maxgap sets the
maximum gap size for merging two adjacent intervals. If the target sequence
for the two alignment intervals has a gap longer than maxgap
letters between the two alignment intervals, they will be returned as
separate intervals; otherwise they will be merged as a single alignment
region.
maxinsert sets the maximum length of insert in the target
sequence that allows to adjacent intervals to be merged as a single alignment
region in the results.
mininsert is specifically for handling
alignments that may have small ``cycles'' (due to slight inconsistencies
in the reported alignment intervals, for example, if a portion of sequence
can align at both the end of one interval or at the beginning of another, and
the intervals are actually added to the NLMSA that way, then the start
of the second interval will actually be before the stop of
the first interval; this corresponds to a negative insert value). A
mininsert value of zero (the default), prevents any such interval
pairs from being merged. Giving a negative mininsert value will allow
interval pairs whose insert value is greater than or equal to this value,
to be merged.
maxsize: upper bound on maximum size for interval merging.
filterSeqs, if not None, should be a dict of sequences
used to filter the group-by analysis; i.e. only alignment intervals
containing these sequences are considered in the analysis. More
specifically, filterSeqs can be used to mask the group-by analysis
to a specific interval of a sequence, by having filterSeqs
return only the intersection between the interval it is passed as a key,
and the masking interval that it stores. If there is no overlap, it
must raise KeyError. The sequence.SeqFilterDict class
provides exactly this masking capability, i.e.
d=sequence.SeqFilterDict(someIntervals)
overlap=d[ival] # RETURNS INTERSECTION BETWEEN ival AND someIntervals, OR KeyError
minAlignSize if not None, sets a minimum size for filtering the resulting
alignment regions. Regions smaller than the specified size will be culled
from the output.
maxAlignSize if not None, sets a maximum size for filtering the resulting
alignment regions. Regions larger than the specified size will be culled
from the output.
pIdentityMin if not None, sets a minimum fractional sequence identity
for filtering the resulting alignment regions. Regions with lower levels
of identity will be clipped from the output. Specifically, within each
region, the largest contiguous segment (possibly including indels, if
permitted by maxgap and maxinsert) whose sequence identity is above the
threshold will be returned (but only if it is larger than minAlignSize
if set).
ivalMethod,
if not None, allows the user to provide a Python function that performs
interval grouping. Specifically it is called as
ivalMethod(l, ns,msaSlice=self, **kwargs), where l is the
list of intervals for NLMSASequence ns within the current slice
msaSlice; all other args are passed as a dict in kwargs.
2. merge groups of sequences using "vertical" group-by rules.
seqGroups: a list of one or more lists of sequences to group.
If None, the whole set of sequences will be treated as a single group.
Each group will be analyzed separately, as follows:
sourceOnly: output intervals will be reported giving only
the corresponding interval on the source sequence; redundant
output intervals (mapping to the same source interval) are
culled. Has the effect of giving a single interval traversal
of each group.
indelCut: for sourceOnly mode, do not merge separate
intervals that the groupByIntervals analysis separated due to an indel).
minAligned: the minimum number of sequences that must be aligned to
the source sequence for masking the output. Regions below
this threshold are masked out; no intervals will be reported
in these regions.
pMinAligned: the minimum fraction of sequences (out of the
total in the group) that must be aligned to the source
sequence for masking the output.
seqMethod: you may supply your own function for grouping.
Called as seqMethod(bounds,seqs,**kwargs), where
bounds is a sorted list of
(ipos,isStart,i,ns,isIndel,(start,end,targetStart,targetEnd))
and seqs is a list of sequences in the group.
Must return a list of (sourceIval,targetIval). See the docs.
-
same keys as iter, but for each provides the source interval
to target interval mapping (Seq2SeqEdge).
Uses same group-by arguments as keys().
-
same interval mappings as iteritems, but for
each provides a tuple of three objects:
the source interval, the corresponding target interval,
and the Seq2SeqEdge providing detailed
information about the alignment between the source and target intervals
(such as percent identity, etc.).
Uses same group-by arguments as keys().
-
treats s1 as a key (target sequence
interval), and returns an Seq2SeqEdge object providing detailed
information about the alignment between this target interval
and the source interval.
-
returns the number of distinct sequences that
are aligned to the source interval. Note: this is NOT necessarily
equal to the number of items that will be returned by the above iterators,
since a single target sequence might have multiple 1:1 intervals of
alignment to the source interval, due to indels.
In addition to these standard dictionary methods, NLMSASlice provides
several additional methods and attributes:
-
this method provides a way to perform group-by operations on the slice;
the output of split() is one or more NLMSASlice objects; if the
group-by analysis results in no splitting of the current slice, then
it is returned unchanged (i.e. the method just returns self).
Uses same group-by arguments as keys().
For further details on group-by operations, see keys() above.
-
performs the same group-by analysis as split(), but replaces
the source interval by the corresponding interval in the LPO. The main
practical consequence of this is that target sequence inserts
are included in the resulting slice (because they are present in the LPO
interval corresponding to the original source interval), whereas they
were NOT included in the original slice (because they are not aligned
to the source interval). The main place where this matters is in graph
traversal of the slice's letters attribute: whereas the nodes
and edges corresponding to these inserts are not considered to be part
of the letters graph for the original slice, they are part of the
LPO slice. Also, the ``source interval'' in any subsequent operations
with the LPO slice will be LPO coordinates instead of subintervals of the
original source sequence interval.
Uses same group-by arguments as keys().
| groupByIntervals( |
maxgap=0, maxinsert=0, mininsert= 0, filterSeqs=None, mergeMost=False, maxsize=500000000, mergeAll=True, ivalMethod=None, pIdentityMin=None, minAlignSize=None, maxAlignSize=None,**kwargs) |
-
This method performs the interval grouping analysis for all the iterators
described above. Users will not need to call it directly. Its arguments
are described above (see keys()). It returns a dictionary
whose keys are sequences aligned to this slice (represented by their integer nlmsa_id),
and whose values are
the list of intervals produced by the group-by analysis for the corresponding
sequence. The values are tuples of the form
(source_start, source_stop, target_start, target_stop), showing the
mapping of a source sequence interval onto a target sequence interval.
This dictionary is the primary input to the groupBySequences()
method below.
| filterIvalConservation( |
seqIntervals,pIdentityMin=None,filterFun=None,**kwargs) |
-
This method is used by groupByIntervals() to filter the results
using the specified filterFun filter function, which should either
return None if the specified alignment region does not pass the filter,
or return the filtered interval. For an example
filter function, see conservationFilter, which is used by default
in filterIvalConservation. seqIntervals must be passed in
the same format as expected by groupBySequences; it is modified in
place by filterIvalConservation, which always returns None.
| conservationFilter( |
seq,m,pIdentityMin=None,minAlignSize=None,maxAlignSize=None,**kwargs) |
-
Tests an alignment mapping m for the specified size and sequence
identity criteria. Returns the (possibly clipped) interval m if
the criteria are met, and None if the criteria are not met. m
is expected to be a tuple of integers
(srcStart,srcEnd,destStart,destEnd).
seq must be the destination sequence object (sliceable by the destination
interval coordinates). The conservation criteria and clipping are performed
using Seq2SeqEdge.conservedSegment().
| groupBySequences( |
seqIntervals, sourceOnly=False, indelCut=False, seqGroups=None, minAligned=1, pMinAligned=0., seqMethod=None, **kwargs) |
-
This method performs the sequence grouping analysis for all the iterators
described above. seqIntervals must be a dictionary of sequences
and their associated list of intervals (produced by groupByIntervals()
above). It returns a list of output sequence intevals, which is either
a list of source sequence intervals (sourceOnly mode), or a list
of tuples of the form (source_interval, target_interval).
| matchIntervals( |
seq=None) |
-
this method returns the set of
1:1 match intervals for the target sequence seq (or all
aligned sequences, if seq is None), as a dictionary
whose keys are target sequence intervals, and whose values are
the corresponding source sequence intervals to which they are
aligned.
-
returns the largest possible interval of
seq that is aligned to this slice, i.e. it merges all
alignment intervals in this slice containing seq, and
returns the merged sequence interval based on the minimum start
value and maximum stop value found.
represents the letters graph of a specific NLMSASlice. It is
a graph whose nodes are the NLMSANode objects in this slice, and whose
edges are sequence.LetterEdge objects. Note: currently the edge objects
are just returned as None - please implement!
This graph has the following methods:
-
generates all the nodes in the slice, in order from left to right.
-
also iteritems(). Generate the same set of nodes as above,
as keys, but for each also returns a value representing its outgoing
directed edges (see getitem, below).
-
gets a dictionary indicating all the outgoing
directed edges from node to subsequence nodes, whose keys are
the target nodes, and whose edges are the
sequence.LetterEdge objects representing each edge.
A temporary object (created on-the-fly)
representing a single letter ``column'' in the alignment. It acts like
a container of the sequence letters aligned to the source sequence in
this column. It has the following methods:
-
generates all the individual sequence letters
(as SeqPath intervals, presumably of length 1) that are aligned to
the source sequence, in this column of the alignment.
-
generates the same list of of target sequence letters as
the iterator, but as a tuple of (target letter, source letter, edge).
Currently, edge is just None.
-
returns the number of distinct sequences aligned to
the source interval, in this column.
Other, internal methods that regular users are unlikely to need:
-
returns the sequence interval of seq
that is aligned to this column, or raises
KeyError if it is not
aligned here.
-
returns a dictionary of sequences
that traverse the edge directly from this node to node2,
i.e. if letter i of seq is aligned to this node, then
letter letter i+1 is aligned to node2. The
dictionary's keys are top-level sequence objects, and its
value for each is the letter position index i as defined above.
-
returns a dictionary of the outgoing edges
from this node, whose keys are target nodes, and whose values
are the corresponding edge objects (of type sequence.LetterEdge).
You are unlikely to need to manipulate NLMSASequence objects directly;
they perform the back-end work for accessing the nested list disk storage
of the alignment of the associated sequence.
However, one thing you should know is that for a sequence to be stored
in a NLMSA, it needs to have a unique string identifier.
NLMSASequence obtains a string identifier for the sequence in one of the following
ways (in decreasing order of precedence): 1) the sequence ``object'' can itself just
be a Python string, in which case that string is used as the identifier. 2) otherwise,
the object should be a SeqPath instance. If it has a name attribute, that will
be used as the identifier. 3) Otherwise, if it has a id attribute (which is present
by default on sequence.Sequence objects), that will be used.
2.4 Seqdb Module
Pygr interface to sequence databases stored in FASTA, BLAST or relational databases.
The seqdb module provides a simple, consistent interface to sequence databases from a variety of different storage sources such as FASTA, BLAST and relational databases. Sequence databases are modeled (like other Pygr container classes) as dictionaries, whose keys are sequence IDs and whose values are sequence objects. Pygr sequence objects use the Python sequence protocol in all the ways you'd expect: a subinterval of a sequence object is just a Python slice (s[0:10]), which just returns a sequence object representing that interval; the reverse complement is just -s; the length of a sequence is just len(s); to obtain the actual string sequence of a sequence object is just str(s). Pygr sequence objects work intelligently with different types of back-end storage (e.g. relational databases or BLAST databases) to efficiently access just the parts of sequence that are requested, only when an actual sequence string is needed.
This module makes use of several external programs:
- NCBI toolkit: The BLAST database functionality in this module
requires that the NCBI toolkit
be installed on your system. Specifically, some functions will call the command line
programs
formatdb, fastacmd, blastall, and megablast.
- RepeatMasker: the BlastDB.megablast() method calls the command line
program
RepeatMasker to mask out repetitive sequences from seeding alignments,
but to allow extension of alignments into masked regions.
- Python DB-API 2.0: the SQLTable class, and dependent classes such as
SQLSequence and StoredPathMapping, conform to the Python DB-API 2.0.
Typically you must supply a DB-API 2.0-compliant database cursor to the
SQLTable constructor. To do so, you must have some DB-API 2.0-compliant
module (such as MySQLdb) installed for connecting to a database server.
If you are lacking one or more of these requirements, you can still install Pygr
and use all Pygr functionality that does not depend on the missing requirements.
If you try to use a function for which a requirement is missing, Pygr will raise
an appropriate exception (e.g. unable to run blastall).
Interface to an existing BLAST database or FASTA sequence file; in the latter case, it will automatically construct BLAST database files for you using the NCBI tool formatdb. Here's a simple example of opening a BLAST database and searching it for matches to a specific piece of sequence:
from pygr.seqdb import *
db=BlastDB('sp') # OPEN SWISSPROT BLAST DB
s=NamedSequence(str(db['CYGB_HUMAN'][40:-40]),'boo')
m=db.blast(s) # DO BLAST SEARCH
myg=db['MYG_CHICK']
for i in m[s][myg]:
print repr(i.srcPath),repr(i.destPath),i.blast_score,i.percent_id
Let's go through this example line by line:
- construction of a BlastDB object: This looks for either a FASTA file with the path 'sp' or BLAST database formatted files based on this path (e.g. 'sp.psd' for protein sequences, or 'sp.nsd' for nucleotide sequences).
- db['CYGB_HUMAN'] obtains a sequence object representing the SwissProt sequence whose ID is CYGB_HUMAN. The slice operation [40:-40] behaves just like normal Python slicing: it obtains a sequence object representing the subinterval omitting the first 40 letters and last 40 letters of the sequence. The str() operation obtains the actual string representation of this subinterval.
- NamedSequence(letter_string, name) creates a new sequence object whose sequence is letter_string, and whose ID is name.
- Running the db.blast(s) method searches the BLAST database for homologies to s, using NCBI BLAST. It chooses reasonable parameters based upon the sequence types of the database and supplied query. However, you can specify extra parameter options if you wish. It returns a Pygr sequence mapping (multiple alignment) that represents a standard Pygr graph of alignment relationships between s and the homologies that were found.
- The expression m[s][myg] obtains the "edge information" for the graph relationship between the two sequence nodes s and myg. (if there was no edge in the m graph representing a relationship between these two sequences, this would produce a
KeyError). This edge information consists of a set of interval alignment relationships (described in detail below), which are printed out in this example.
Options for constructing a BlastDB:
| BlastDB( |
filepath=None,skipSeqLenDict=False,ifile=None,idFilter=None,
blastReady=False) |
-
Open a sequence file as a ``database'' object, giving the user access to its sequences,
easy searching via
blast or megablast, etc.
skipSeqLenDict prevents construction of a sequence length index file
(which will be named ``filepath.seqlen'') and a fast
sequence access file (which will be named ``filepath.pureseq'').
Setting this option to True can be useful if you either wish to
speed up initial opening of the BlastDB (note: construction of these indexes is
only a one-time event) or avoid the extra disk space required by these indexes.
Explanation: To facilitate the rapid creation of sequence objects (which requires the length of the sequence), it creates a sequence length index (as a Python shelve). This enables it to avoid actually loading the sequence string into memory each time a sequence object is created; instead it just looks up the sequence length. While this speeds up access to genomic sequence databases (where each sequence tends to be extremely long), this initial step may be slow for databases of short sequences. Setting skipSeqLenDict option to True, will prevent construction of this sequence length index.
ifile lets you open the database directly from a file object rather
than a filename. If you have a file object, you can pass it directly to BlastDB instead of a filepath. NB: the BlastDB() constructor will close ifile when it is done reading from the file object.
idFilter allows you to provide a function for re-mapping the FASTA sequence
identifiers read from the sequence file. This can be useful in the case of
NCBI FASTA files, since NCBI often treats the sequence ID as a ``blob'' into
which any number of database fields can be stuffed, rather than a true ID.
blastReady option specifies whether BLAST index files should be automatically
constructed (using formatdb). Note, if you attempt to run the blast()
method, it will automatically create the index files for you if they are missing.
Useful methods:
-
iterate over all IDs in the BLAST database.
-
returns number of sequences in the BLAST database.
| blast( |
seq,al=None,blastpath='blastall',blastprog=None,expmax=0.001,maxseq=None) |
-
run a BLAST search on sequence object seq. Maxseq will limit the number of returned hits to the best maxseq hits.
| megablast( |
seq,al=None,blastpath='megablast',expmax=1e-20,maxseq=None,minIdentity
=None,maskOpts='-U T -F m',rmOpts='') |
-
first performs repeat masking on the sequence by converting repeats to lowercase,
then runs megablast with command line options to prevent seeding new alignments
within repeats, but allowing extension
of alignments into repeats. minIdentity should be a number (maximum value, 100)
indicating the minimum percent identity for hits to be returned.
-
The invert operator (~, the ``tilde'' character)
enables reverse-mapping of sequence objects to their string ID.
id=(~db)[seq] # GET IDENTIFIER FOR THIS SEQUENCE FROM ITS DATABASE
Useful attributes:
- itemClass: the object class to use for instantiating new sequence objects from this BLAST database. You can set this to create customized sequence behaviors.
This is used by pygr.Data to propagate correct attribute schemas to
items / slices from database containers managed by it.
- itemSliceClass: the object class to use for instantiating new sequence slice objects (i.e. subintervals of sequences from this BLAST database). You can set this to create customized sequence behaviors.
This is used by pygr.Data to propagate correct attribute schemas to
items / slices from database containers managed by it.
A subclass of BlastDB that adds a couple methods needed to serve
the data to clients connecting over XMLRPC. For example, to make an XMLRPC
server for a blast database, accessible on port 5020:
import coordinator
server=coordinator.XMLRPCServerBase(name,port=5020)
db=BlastDBXMLRPC('sp') # OPEN BlastDB AS USUAL, BUT WITH SUBCLASS
server['sp']=db # ADD OUR DATABASE TO THE XMLRPC SERVER
server.serve_forever() # START SERVING XMLRPC REQUESTS, UNTIL KILLED.
Class for a client interface that accesses a Blast database over
XMLRPC (from the the BlastDBXMLRPC acting as the server).
-
url must be the URL (including port number) for accessing the
XMLRPC server; name must be the key of the BlastDBXMLRPC object
in that server's dictionary (in the example above, it would be 'sp').
Thus to access the server above (assuming it is running on leelab port 5020):
db=XMLRPCSequenceDB('http://leelab:5020','sp')
hbb=db['HBB_HUMAN'] # GET A SEQUENCE OBJECT FROM THE DATABASE...
Currently, this class provides sequence access. You can work with sequences
exactly as you would with a BlastDB, but cannot perform actual BLAST searches
(i.e. the blast and megablast methods don't work over XMLRPC).
The default class for sequence objects returned from BlastDB. It provides efficient,
fast access to sequence slices (subsequences). When the BlastDB is initially opened,
Pygr constructs a length and offset index that enables FileDBSequence to seek()
to the correct location for any substring of the sequence. New in Pygr 0.4.
This was previously the default class for sequence objects returned from BlastDB,
but has been deprecated because we found that NCBI fastacmd was much too slow
and consumed enormous amounts of memory. BlastSequence relies on
fastacmd for ``fast'' access to individual sequence slices. The advantage is
that it only requires BLAST database files (produced by Pygr using formatdb),
whereas the new FileDBSequence requires a specially indexed sequence file
(constructed by default by BlastDB), which may be a disadvantage if you are low
on disk space.
BlastSequence has several optimizations for working with BLAST databases:
- it uses the NCBI tool fastacmd to retrieve sequence efficiently from a BLAST database, when your program requests an actual string of sequence text. Moreover, for subintervals (slices) of the sequence, it uses fastacmd's -L option to request just the desired subinterval of the sequence, rather than the whole sequence. This makes it efficient for requesting specific intervals of large genomic contigs. Basically, just use Python slicing and str() methods on sequence objects, and subsequences will be obtained in an efficient manner.
- the len() method is implemented using the seqLenDict, a precalculated index of the sequence lengths. So again no sequence has to be read by Python.
This class provides a general interface for sequence annotation databases.
This interface follows several principles:
- An annotation object acts both like a sequence interval
(representing the region of sequence that it annotates), and as an object
with annotation attributes that provide further information or relations
for that annotation. For example, an object
e representing an exon annotation
would act both like the genomic sequence interval that constitutes that exon
(i.e. you can use any of the usual SeqPath methods to get information
about that sequence interval), but also might have attributes that link it
to its splice graph. E.g. str(e) would return the actual
nucleotide sequence of the exon; for e2,splice in e.next.items() would iterate
through the list of exons it is connected to by a forward splice, etc.
- An annotation object always has an annot attribute that
points to the complete annotation. Since an annotation object is a sequence interval,
it can itself be sliced (e.g.
e[:10] gets the slice representing
the first ten bases of the exon); such annotation slices also have
an annot attribute, which will simply point to the annotation
object representing the complete exon (i.e. just e itself).
To check whether an annotation e is the complete annotation
(as opposed to just a slice of part of the annotated region), use the
logical test e.annot is e.
- The mapping of an annotation object to the sequence region it
represents is trivial, i.e. the object itself behaves as that sequence interval.
The reverse mapping (for any region of sequence, find the annotation(s)
that map to that region) is best performed by creating an NLMSA alignment
object and saving the mapping as follows:
nlmsa=cnestedlist.NLMSA('myAnnotDB','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
use_virtual_lpo=True,bidirectional=False)
for a in annoDB.itervalues(): # GET EACH ANNOTATION OBJ IN DATABASE
nlmsa.addAnnotation(a) # SAVE ALIGNMENT OF ITS SEQ INTERVAL TO THIS ANNOTATION
nlmsa.build() # CREATE FINAL INDEXES FOR THE ALIGNMENT DATABASE
Later you can get the list of annotations in some sequence interval s
as easily as
for a in nlmsa[s]: # FIND ANNOTATIONS THAT MAP TO s
# DO SOMETHING...
| __init__( |
sliceDB,seqDB,itemClass=AnnotationSeq,itemSliceClass=AnnotationSlice,itemAttrDict=None,sliceAttrDict=dict()) |
-
Constructs an annotation database using several arguments:
sliceDB, a database that takes an annotation ID as a key, and returns
a slice information object with attributes that give the sequence ID and start/stop
coordinates of the sequence interval representing the annotation.
seqDB, a sequence database that takes a sequence ID as a key, and
returns a sequence object.
itemClass: the class to use for constructing an annotation object
to be returned from the AnnotationDB.__getitem__. You can extend the
behavior of annotation objects by subclassing AnnotationSeq.
If the AnnotationDB participates in important schema relations,
pygr.Data may add properties to the itemClass that implement
its schema relations to other database containers. (See the reference
docs on pygr.Data below for details).
itemSliceClass: the class to use for slices of annotation
objects returned from the AnnotationDB.__getitem__. You can extend the
behavior of annotation objects by subclassing AnnotationSlice.
If the AnnotationDB participates in important schema relations,
pygr.Data may add properties to the itemSliceClass that implement
its schema relations to other database containers. (See the reference
docs on pygr.Data below for details).
sliceAttrDict, a dictionary providing the attribute name aliases
for attributes on annotation objects to access attributes or tuple values
in the sliceInfo objects. The minimal required attributes are the
sequence ID, start and stop coordinates in each object returned from sliceDB.
For example,
sliceAttrDict=dict(id='chromosome',start='gen_start',stop='gen_stop')
would make it use s.chromosome,s.gen_start,s.gen_stop as the ID and interval
coordinates for each slice information object s. Note: the start,stop
coordinates should follow the SeqPath sign convention, i.e. positive
coordinates mean an interval on the positive strand, and negative coordinates
mean an interval on the negative strand (i.e. the reverse complement of
the positive strand. See the reference documentation on SeqPath above
for details).
If the sliceAttrDict (or sliceInfo object directly) provides a orientation
attribute, it will be used to be change positive intervals to negative intervals
if the orientation attribute is negative. This gives the user an alternative
method to represent orientation: give all coordinates in positive orientation
(positive integer values), and give an orientation attribute that
is a negative value if the interval should be reversed (to negative orientation).
If a sliceAttrDict value is an integer, then it will not be treated as an
attribute name, but instead will be used as an index, treating the sliceInfo
object as a tuple. This makes it possible to use a sliceDB whose
items are tuples. Here's an example:
exon_db = AnnotationDB(exon_slices, db,
sliceAttrDict=dict(id=0, orientation=3, # GIVE ATTR INTERFACE TO 2PLE
transcript_id=4, start=5, stop=6))
-
Get the annotation object with primary key id. This annotation object
is both a sequence interval (representing the region of sequence that it
annotates, e.g. for an exon, the region of genomic sequence that constitutes
that exon), and also an annotation (i.e. it may have additional attributes
from the slice information object, that give useful information about this
annotation).
-
Returns a sequence interval object representing this annotation region.
The only difference (versus the annotation object itself) is that the
sequence interval object will not have an annot attribute or
any other attributes that would mark it as an annotation. This is mainly
important for NLMSA alignments, which treat annotation objects differently
(as an annotation) from sequence objects.
-
Returns True if this is the complete annotation, False if just a fragment.
-
Returns True if on same strand as original annotation, False if on reverse strand
This class provides an empty sequence object that
acts purely as a reference system.
Automatically elongates if slice extends beyond current stop.
This class avoids setting the stop attribute, taking advantage
of SeqPath's mechanism for allowing a sequence to grow in length.
s=VirtualSeq('FOOG_HUMAN')
len(s) # ONLY 1 LETTER LONG BY DEFAULT
s1=s[100:215] # GET A SLICE OF THIS SEQUENCE
len(s) # NOW IT'S 215
The associated VirtualSeqDB class provides a ``sequence database''
that returns a VirtualSeq object for every identifier requested of
it. It acts like a Python dictionary:
db=VirtualSeqDB()
s=db['FOOG_HUMAN'] # ASK FOR A SEQUENCE BY ITS IDENTIFIER
s1=s[100:215] # GET A SLICE OF THIS SEQUENCE
For a given identifier it always returns the same VirtualSeq
object (i.e. the object returned from the first request for that identifier).
In other words, if the identifier was previously requested,
it returns the VirtualSeq for that identifier; if not, it
creates a new one.
This can be convenient when performing operations that just
need a coordinate reference system, not actual sequence.
This class acts as a wrapper for a set of dictionaries, each
of which is assigned a specific string ``prefix''. It provides
a dictionary interface that accepts string keys of the form
``prefix.suffix'', and returns d['suffix'] where d is
the dictionary associated with the corresponding prefix. This
is useful for providing a unified ``database interface'' to a
set of multiple databases.
hg17=BlastDB('/usr/tmp/ucsc_msa/hg17')
mm5=BlastDB('/usr/tmp/ucsc_msa/mm5')
... # LOAD A BUNCH OF OTHER GENOMES TOO...
genomes={'hg17':hg17,'mm5':mm5, 'rn3':rn3, 'canFam1':cf1, 'danRer1':dr1,
'fr1':fr1, 'galGal2':gg2, 'panTro1':pt1} # PREFIX DICTIONARY FOR THE UNION
# OF ALL OUR GENOMES
genomeUnion=PrefixUnionDict(genomes)
ptChr7=genomeUnion['panTro1.chr7'] # GET CHIMP CHROMOSOME 7
if 'panTro1.chr5' in genomeUnion: # CHECK IF THIS ID IN OUR UNION
pass # DO SOMETHING...
s= -(ptChr7[1000:2000]) # GET A BIT OF THIS SEQUENCE
if s in genomeUnion: # THIS IS HOW TO CHECK IF s DERIVED FROM OUR UNION
pass # DO SOMETHING...
It provides a __contains__ method that tests whether
a given sequence object is derived from the PrefixUnionDict
(see example above). Here are some additional methods:
| __init__( |
prefixDict=None,separator='.',filename=None,dbClass=BlastDB) |
-
You can create a PrefixUnionDict either using
a prefixDict (whose keys are string prefixes, and whose
values are sequence databases), or using a previously created
header file filename.
Using the header file, the constructor will
automatically open all the sequence databases for you.
When opening from a header file, you can also specify a
dbClass to be used for opening individual sequence databases
listed in the header file; the default is BlastDB.
The database class constructor must take a single argument,
which is the filepath for opening the database. The
separator character is used to form ``prefix.suffix''
identifiers.
-
The invert operator (~, the ``tilde'' character)
enables reverse-mapping of sequence objects to their string ID.
This is the recommended way to get the ``fully qualified sequence ID'', i.e. with
the appropriate prefix prepended.
id=(~db)[seq] # GET PROPERLY PREFIXED-IDENTIFIER FOR THIS SEQUENCE
For a given sequence object seq derived from the union
(or a slice of a sequence from the union), return a string identifier
in the form of ``foo.bar''.
-
This method is deprecated; instead use the __invert__ operator
above.
| writeHeaderFile( |
filename) |
-
Save a header file for this union, to reopen later.
It saves the separator character, and a list of prefixes
and filepaths to the various sequence databases (which
must have a filepath attribute). This header
file can be used for later reopening the prefix-union
in a single step.
-
Returns a new member dictionary for testing membership in
the distinct prefix groups. See PrefixUnionMemberDict.
| cacheHint( |
owner,ivalDict) |
-
Communicates a set of caching hints to the appropriate member
databases. ivalDict must be a dictionary whose keys are
sequence ID strings, and whose values are each a (start,stop) tuple
for the associated covering interval coordinates to
cache for each sequence. owner should be a python object
whose existence controls the lifetime of these cache hints.
When owner is garbage-collected by Python (due to its
reference count going to zero), the member databases will clear
these cache hints from their cache storage.
On PrefixUnionDict, this method simply passes along
the cache hints to the appropriate member databases by calling
their cacheHint method, without itself doing anything
to cache the information.
Implements membership testing on distinct prefix groups. Specifically,
you can bind a given prefix to a value
then test whether a given object k is a member of any of the
prefix groups in the dictionary:
v=d[k] # raises KeyError if k not a member of 'prefix1' or other prefix group in d
| __init__( |
puDict,default=None,attrMethod=lambda x:x.pathForward.db) |
-
puDict must be a PrefixUnionDict, whose prefix groups constitute the
allowed possible key groups for this membership dictionary. default
provides a default value to apply to any key whose prefix has not been explicitly
given a value in this dictionary. If no default is set, this dictionary
will raise a
KeyError for any key whose prefix has not been
explicitly given a value in this dictionary.
attrMethod specifies a method for obtaining
the actual prefix container object from a given member key object. The default
attrMethod treats the key as a sequence object and tries to determine what
database container it is from.
-
Returns an iterator for the key values (prefix strings) that are allowed for
this dictionary, obtained from the bound PrefixUnionDict.
Provides the interface to the inverse mapping of the PrefixUnionDict.
-
Returns the fully-qualified string ID for sequence object k.
Properly handles both sequence annotation object and regular sequence
objects.
Adds the capability of automatically adding new sequence databases to the
PrefixUnionDict, if needed. This is implemented by extending
the standard __getitem__ method:
-
Returns the fully-qualified string ID for sequence object k.
Properly handles both sequence annotation object and regular sequence
objects. If sequence object k is from a sequence database that
is not in the PrefixUnionDict, it will be automatically added
to the prefixUnion, if the prefixUnion has an addAll attribute
set to True; if not, a
KeyError is raised.
This is used in the standard NLMSA write mode 'w'
to allow users to add sequences to the alignment without having to
previously add the sequence databases containing those sequences,
to the prefixUnion for the NLMSA.
This class is deprecated; the FileDBSequence class and associated
database container caching mechanisms provide a more powerful mechanism
that is intended to replace BlastSequenceCache.
Implements a variant of BlastSequence designed to merge and cache requests for local intervals of sequence so that repeated accesses to these regions are bundled and cached for efficiency. You work with sequence objects of this type normally, using Python slicing to obtain subintervals, and str() to get the sequence string for a subinterval. But behind the scenes, it does two things:
- all slicing operations are recorded, in the form of a cache of superintervals. Overlapping or adjacent intervals are merged into a superinterval up to a maximum superinterval size (default 20000). It will automatically create as many superintervals as needed to cover the requested subinterval slices. Each superinterval is represented by an object of the FastacmdIntervalCache class.
- when the sequence string of a subinterval is requested, the cache actually retrieves (and caches) the entire superinterval containing that subinterval. Fastacmd only needs to be called once for this superinterval. Subsequent subinterval string requests that fall within this cached superinterval are simply returned directly from the cache, without calling fastacmd.
Implements a subclass inheriting from SQLRow and NamedSequenceBase, to use a relational database table to obtain the actual sequence. There are three minor variants DNASQLSequence, RNASQLSequence, ProteinSQLSequence (so that the sequence does not have to analyze itself to determine what kind of sequence it is). Its constructor takes the same arguments as SQLRow(table, id), where table is the SQLTable object representing the table in which the sequence is stored, and id is the primary key of the row representing this sequence. However, normally this class is simply pas