Pygr is open source software designed to make it easy to do powerful sequence and comparative genomics analyses, even with extremely large multi-genome alignments.
Sequences and alignments also can be modeled as graph structures in Pygr, providing the same consistent and simple framework for queries.
>>> from pygr.sequence import *
>>> s = Sequence('attatatgccactat','bobo') #create a sequence named bobo
>>> s # interpreter will print repr(s)
bobo[0:15]
>>> t = s[-8:] #python slice gives last 8 nt of s
>>> t # interpreter will print repr(t)
bobo[7:15]
>>> str(t) #string conversion just yields the sequence as a string
'gccactat'
>>> rc = -s #get the reverse complement
>>> str(rc[:5]) #its first five letters
'atagt'
Several points:
Many groups (e.g. David Haussler's group at UC Santa Cruz) have constructed alignments of multiple genomes. These alignments are extremely useful and interesting, but so large that it is cumbersome to work with the dataset using conventional methods. For example, for the 17-genome alignment you have to work simultaneously with the individual genome datasets for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish etc., as well as the huge alignment itself. Pygr makes this quite easy. Here we illustrate an example of mapping a set of human exons, which has two splice sites
(ss1 and ss2) bracketing a single exon (exon).
We use the alignment database to map each of these splice sites onto all the aligned
genomes, and to print the percent-identity and percent-aligned for each genome,
as well as the two nucleotides consituting the splice site itself.
It also prints the conservation of the two exonic region (between ss1
and ss2:
import pygr.Data # FINDS DATA WHEREVER IT'S REGISTERED
msa = pygr.Data.Bio.Seq.MSA.ucsc17() # SANTA CRUZ 17-GENOME ALIGNMENT
exons = pygr.Data.Leelab.ASAP2.hg17.exons() # ASAP2 HUMAN EXONS
idDict = ~(msa.seqDict) # INVERSE: MAPS SEQ --> STRING IDENTIFIER
def printConservation(id,label,site):
for src,dest,edge in msa[site].edges(mergeMost=True):
print '%d\t%s\t%s\t%s\t%s\t%s\t%2.1f\t%2.1f' \
%(id,label,repr(src),src,idDict[dest],dest,
100*edge.pIdentity(),100*edge.pAligned())
for id,exon in exons.iteritems():
ival = exon.sequence # GET THE SEQUENCE INTERVAL FOR THIS EXON
ss1 = ival.before()[-2:] # GET THE 2 NT SPLICE SITES
ss2 = ival.after()[:2]
cacheHint = msa[ss1+ss2] #CACHE THE COVERING INTERVALS FROM ss1 TO ss2
printConservation(id,'ss1',ss1)
printConservation(id,'ss2',ss2)
printConservation(id,'exon',ival)
A few notes:
printConservation().
msa is the database; site is the interval query; and the
edges methods iterates over the results, returning a tuple for
each, consisting of a source sequence interval (i.e. an interval of
site), a destination sequence interval (i.e. an interval in
an aligned genome), and an edge object describing that alignment.
We are taking advantage of Pygr's group-by operator mergeMost,
which will cause multiple intervals in a given sequence to be merged
into a single interval that constitutes their ``union''. Thus,
for each aligned genome, the edges iterator will return a single
aligned interval. The alignment edge object provides some useful
conveniences, such as calculating the percent-identity between src
and dest automatically for you. pIdentity() computes
the fraction of identical residues; pAligned computes the
fraction of aligned residues (allowing you to see if there are
big gaps or insertions in the alignment of this interval). If we
had wanted to inspect the detailed alignment letter by letter, we
would just iterate over the letters attribute instead of
the edges method. (See the NLMSASlice documentation for
further information).
results = msa[site].edges(maxgap=1,maxinsert=1,
minAlignSize=14,pIdentityMin=0.9)
We can use this same idea to search for regions of ``deep conservation''. Here we search the UCSC alignment of 17 vertebrate genomes for regions of 90% identity or better that are at least 40 nt long, and then screen for a zone in which at least nine different genomes all share this level of alignment with the human query:
>>> ival = nlmsa.seqDict['hg17.chr1'][7000:8000] # 1 kb REGION OF HUMAN CHROMOSOME 1 >>> for x,y,e in nlmsa[ival].edges(minAligned=9,minAlignSize=40,pIdentityMin=0.9): ... print "%s\t%s\n%s\t%s\n" % (x,repr(x),y,(~(nlmsa.seqDict))[y]) ... GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAACAGCAGTAAAGAGCTGAC danRer3.chr18 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAAGGAGCTGAC dasNov1.scaffold_107966 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAAAGGAGCAAGGAGCTGAC xenTro1.scaffold_1073 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAGGGAGCTGAG galGal2.chr1 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAAGGAGCTGAC panTro1.chr1 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAGGGAGCTGAC bosTau2.chr5 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAATAGCAGCAACGAGCTGAC canFam2.chr27 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAAGGAGCTGAG monDom2.scaffold_31 GTGTTGAAGAGCAGCAAGGAGCTGAC chr1[7480:7506] GTGTTGAAGAGCAGCAAGGAGCTGAC loxAfr1.scaffold_5603
src and dest print the first two nucleotides
of the site in human and in the aligned genome.
pureseq text format, or (before release 0.5) NCBI's fastacmd),
not the cnestedlist database. Basically, each genome is being queried
as a separate sequence database, represented in Pygr by the
BlastDB class. Pygr makes this complex set of multi-database
operations more or less transparent to the user.
For further information, see the BlastDB documentation.
exon) is an annotation object. To get the actual
sequence interval corresponding to this annotation, we simply request
the annotation object's sequence attribute.
exon.sequence must itself be a slice of a sequence in our alignment,
or the alignment query msa[site] will raise an KeyError informing
the user that the sequence site is not in the alignment.
ss1+ss2.
This obtains a single interval that covers both of the input intervals.
msa[site]), it infers that you are likely to look
at the sequence intervals that are contained within this slice of the alignment
(i.e. within the region aligned to site). It sets ``caching hints'' on the
associated sequence databases, recording for each aligned sequence
the covering interval coordinates (i.e. the smallest interval that fully contains
all portions of the sequence that are aligned to site). These caching hints
do not themselves trigger reading of sequence string data from the databases. Only
when user code actually requests sequence strings that fall within these covering
intervals, the sequence database object will load not the requested interval, but
the entire covering interval, which is then cached. Thereafter, all sequence
string requests that fall within the covering interval are simply immediately sliced
from the cached sequence string, completely avoiding any need to read from disk.
This greatly accelerates sequence analysis with very large multigenome alignments
and sequence databases.
In this case, to enforce the most efficient caching possible, we simply performed
a query that contains all three sites of interest (ss1, ss2, and exon). By performing
this query first, and holding onto the query result, we ensure that Pygr will
use the same cache for all three subsequent queries contained in it. As soon
as we release the reference to this query result (i.e. in the example above,
whenever the variable cacheHint is deleted or over-written with a new value,
freeing Python to garbage-collect the original query result), the associated
cache hint information will also be cleared.
(Actually, because of Pygr's caching / optimizations, considerably more is going on than indicated in this simplified sketch. But you get the idea: Pygr makes it relatively effortless to work with a variety of disparate (and large) resources in an integrated way.)
Here is some example output:
NEED TO UPDATE THESE RESULTS 1 Mm.99996 ss1 hg17 50.0 100.0 AG GG 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT 1 Mm.99996 e1 hg17 78.9 100.0 AG GG 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG 1 Mm.99996 e1 rn3 97.4 98.7 AG AG 1 Mm.99996 e2 hg17 91.6 99.1 CC CC 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC 1 Mm.99996 e2 rn3 97.2 100.0 CC CC
Pygr provides a variety of "back-end" implementations of sequence objects, ranging from sequences stored in a relational database table, or a BLAST database, to sequences created by the user in Python (as above). All of these provide the same consistent interface, and in general try to be efficient. For example, Pygr sequence objects are just "placeholders" that record what sequence interval you're working with, but if the back-end is an external database, the sequence object itself does not store the sequence, and creating new sequence objects (e.g. taking slices of the object as above) will not require anything to be done on the actual sequence itself (such as copying a portion of it). Pygr only obtains sequence information when you actually ask for it (e.g. by taking the string value str(s) of a sequence object), and normally only obtains just the portion that you ask for (i.e. str(s[1000000:1000100]) only obtains 100nt of sequence, even if s is a 100 megabase sequence. By contrast str(s)[1000000:1000100] would force it to obtain the whole sequence from the database, then slice out just the 100 nt you selected).
Here's an example of working with sequences from a BLAST database:
NEED TO UPDATE THESE RESULTS
>>> from pygr.seqdb import *
>>> db = BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
>>> s = db['CYGB_HUMAN'][90:150] # get a sequence by ID, and take a slice
>>> str(s)
'TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA'
>>> m = db.blast(s) # get alignment to all BLAST hits in db
>>> for src,dest,edge in m.edges(): # print out the alignment edges
... print src,repr(src),'\n',dest,repr(dest),e.pIdentity(),'\n'
...
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPP CYGB_HUMAN[90:145]
TLVENLRDADKLNTIFNQMGKSHALRHKVDPVYFKILAGVILEVLVEAFPQCFSP CYGB_BRARE[87:142] 72
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
120
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
TVVENLHDPDKVSSVLALVGKAHALKHKVEPMYFKILSGVILEVIAEEFANDFPVETQKA CYGB_MOUSE[90:150]
112
...
This example introduces the use of a Pygr alignment object to store the mapping of s onto homologous sequences in db, obtained from BLAST. Here's what Pygr actually does:
src and dest are the two aligned sequence intervals, and edge provides a convenient interface to information about their relationship (e.g. %identity, etc.).
repr(src) to get a "string representation" of each sequence interval. When print calls str() on individualBlastSequence interval objects returned by the BLAST search, they invoke fseek() to obtain the specific sequence slice representing that interval.
Pygr provides a systematic solution to this problem: creating a consistent
namespace for data. A given resource is given a unique name that then becomes
its universal handle for accessing it, no matter where you are (just as Python's
import command provides a consistent name for accessing a given code
resource, regardless of where you are). For example, say we want to add the
hg17 (release 17 of the human genome sequence) as ``Bio.Seq.Genome.HUMAN.hg17''
(the choice of name is arbitrary, but it's best to choose a good convention and follow
it consistently):
from pygr import seqdb
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
hg17 = seqdb.BlastDB('hg17')
hg17.__doc__ = 'human genome sequence draft 17' # REQUIRED!
pygr.Data.Bio.Seq.Genome.HUMAN.hg17 = hg17 # SAVE AS THIS NAME
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
pygr.Data.save() to
complete the transaction and save all pending data resources
(i.e. all those added since your last pygr.Data.save() or
pygr.Data.rollback()). In particular, if you have added
data to pygr.Data during a given Python interpreter session, you
should always call pygr.Data.save() or
pygr.Data.rollback() prior to exiting from that session.
In any subsequent Python session, we can now access it directly by its pygr.Data name:
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE hg17=pygr.Data.Bio.Seq.Genome.HUMAN.hg17() # FIND THE RESOURCE
hg17()) emphasizes that this acts like a Python
constructor: it constructs a Python object for us (in this case, the
desired seqdb.BlastDB object representing this genome database).
Note that we did not even have to know how to construct the hg17
object, e.g. what Python class to use (seqdb.BlastDB), or even to import
the necessary modules for constructing it. pygr.Data uses the
power of Python pickling to figure out automatically what to import.
pygr.Data looks at the environment variable PYGRDATAPATH to get a list
of local and remote resource databases in which to look up any resource name
that you try to load. For example, in the shell you might set:
setenv PYGRDATAPATH ~,.,/usr/local/pygr,mysql:PYGRDATA.index,http://leelab.mbi.ucla.edu:5000
$HOME/.pygr_data; ./.pygr_data; /usr/local/pygr/.pygr_data;
the MySQL table PYGRDATA.index (using your
MySQL .my.cnf file to determine the MySQL host and authentication);
and the XMLRPC server running on leelab.mbi.ucla.edu on port 5000.
pygr.Data is smart about figuring out data resource dependencies. For example, you could just save a 17-genome alignment in a single step as follows:
from pygr import cnestedlist
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
nlmsa = cnestedlist.NLMSA('/loaner/ucsc17')
nlmsa.__doc__ = 'UCSC 17way multiz alignment, rooted on hg17'
pygr.Data.Bio.Seq.MSA.ucsc17 = nlmsa
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
However, it would be a lot smarter to give those databases pygr.Data resource names too. Let's do that:
from pygr import cnestedlist
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
nlmsa = cnestedlist.NLMSA('/loaner/ucsc17')
for id,genome in nlmsa.seqDict.prefixDict.items(): # 1st SAVE THE GENOMES
genome.__doc__ = 'genome sequence '+id
pygr.Data.getResource.addResource('Bio.Seq.Genome.'+id,genome)
nlmsa.__doc__ = 'UCSC 17way multiz alignment, rooted on hg17'
pygr.Data.Bio.Seq.MSA.ucsc17 = nlmsa # NOW SAVE THE ALIGNMENT
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
This has several advantages. First, we can now access other genome databases using pygr.Data too:
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE mm7 = pygr.Data.Bio.Seq.Genome.mm7() # GET THE MOUSE GENOME
NOTE: Python pickling is not secure. In particular, you should not unpickle
data provided by someone else unless you trust the data not to contain
attempted security exploits. Because Python unpickling has access to import,
it has the potential to access system calls and execute malicious code on your
computer.
splicegraph.__doc__ = 'graph of exon:splice:exon relations in human genes'
pygr.Data.Bio.Genomics.ASAP2.hg17.splicegraph = splicegraph # ADD A NEW RESOURCE
pygr.Data.schema.Bio.Genomics.ASAP2.hg17.splicegraph = \
pygr.Data.ManyToManyRelation(exons,exons,splices, # ADD ITS SCHEMA RELATIONS
bindAttrs=('next','previous','exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
splicegraph is a graph whose nodes are exons, and whose
edges are splices connecting a pair of exons. Specifically,
splicegraph[exon1][exon2]=splice1 means splice1 is a
splice object (from the container splices) that connects
exon1 and exon2 (both from the container exons).
pygr.Data.Bio.Genomics.ASAP2.hg17.splicegraph,
the sourceDB, targetDB, edgeDB attributes on
the returned object will automatically be set to point to the
corresponding pygr.Data resources representing exons and splices
respectively. splicegraph does not need to do anything to
remember these relationships; pygr.Data.schema remembers and applies
this information for you automatically. Note that when you access
splicegraph, neither exons nor splices will be
actually loaded unless you do something that specifically tries to
read these data (e.g. for exon in splicegraph will read
exons but not splices).
exons we can use splicegraph to find its
splices via code like
for exon,splice in splicegraph[exon0].items(): do something...
exon
has a next attribute that gives its splices to subsequent exons
via code like
for exon,splice in exon0.next.items(): do something...
next,
previous on any exon item from the container exons,
according to the schema. I.e. exon.next will be equivalent to
splicegraph[exon]. Note that as long as the object exon0
came from the pygr.Data resource, the user would not have to do anything
to be able to use the next attribute. On the basis of the saved
schema information, pygr.Data will construct this attribute automatically,
and will automatically load the resources splicegraph and splices
if the user tries to actually use the next attribute.
~,.,http://biodb2.bioinformatics.ucla.edu:5000
>>> import pygr.Data
>>> pygr.Data.dir('Bio')
['Bio.MSA.UCSC.canFam2_multiz4way',...]
>>> msa = pygr.Data.Bio.MSA.UCSC.hg17_multiz17way()
>>> chr1 = msa.seqDict['hg17.chr1']
>>> ival = chr1[4000:4400]
>>> myslice = msa[ival]
>>> for s1,s2,e in myslice.edges():
... print '%s\n%s\n' %(s1,s2)
...
AAGGGCCA
AAGGGCCA
To setup your own XMLRPC client-server using pygr.Data, first create an XMLRPC server on a machine that has access to the data:
import pygr.Data
nlmsa = pygr.Data.Bio.Seq.MSA.ucsc17() # GET OUR NLMSA AND SEQ DBs
server = pygr.Data.getResource.newServer('nlmsa_server') # SERVE ALL LOADED DATA
server.register() # TELL PYGRDATA INDEX SERVER WHAT RESOURCES WE'RE SERVING
server.serve_forever() # START THE SERVICE...
This example code looks for a pygr.Data XMLRPC server in your PYGRDATAPATH, and registers our resources to that index. Now any machine that can access your servers can access the alignment as easily as:
import pygr.Data nlmsa = pygr.Data.Bio.Seq.MSA.ucsc17() # GET THE NLMSA AND SEQ DBs
pygr.Data.here.Bio.Seq.MSA.ucsc17=nlmsa # SAVE THE NLMSA AND SEQ DBs
here refers to the first entry in your
PYGRDATAPATH that starts with ``.'' (dot). For other layer names, see
the reference documentation. This might be useful for prototyping or
testing a new resource, without yet adding it to your long-term resource
database.
my is the first entry that begins with your home directory
(i.e. (tilde), ``/home/yourname'' or whatever your home directory is).
system is the first entry that
begins with an absolute path and is not within your home directory.
subdir is the first entry that
begins with a relative path (ie. does not fit any of the preceding
definitions).
del pygr.Data.leelab.Bio.Seq.MSA.ucsc17 # DELETE THIS RESOURCE RULE
pygr.Data provides classes that make it very easy for you to store your data in this way.
ens_genes = Collection(filename='genes.db',mode='c' # CREATE NEW DATABASE
itemClass=Transcript)
for gene_id,gene_data in geneList:
gene = Transcript(gene_id,gene_data,ens_genes)
ens_genes[gene_id] = gene # STORE IN OUR DATABASE
Mapping enables you to store a relationship between one collection and another collection in a way that is easily stored as a database. For example, assuming that ens_genes is a collection of genes, and exon_db is a collection of exons, we can store the mapping from a gene to its exons as follows:
gene_exons = Mapping(ens_genes, exon_db, multiValue=True,
inverseAttr='gene_id',filename='gene_exons.db',mode='c')
for exon in exon_db:
gene = ens_genes[exon.gene_id] # GET ITS GENE
exons = gene_exons.get(gene, []) # GET ITS LIST OF EXONS, OR AN EMPTY LIST
exons.append(exon) # ADD OUR EXON TO ITS LIST
gene_exons[gene] = exons # SAVE EXPANDED EXON MAPPING LIST
The Collection, Mapping and Graph classes provide general and flexible storage options for storing data and graphs. These classes can be accessed from the pygr.Data or pygr.mapping modules. For further details, see the pygr.mapping module documentation. The SQLTable and SQLGraph classes in the pygr.sqlgraph module provide analogous interfaces for storing data and graphs in an SQL database server (such as MySQL).
Here's an example of creating an SQLGraph representing the splices connecting pairs of exons, using data stored in an existing database table:
splicegraph = sqlgraph.SQLGraphClustered('PYGRDB_JAN06.splicegraph_hg17',
source_id='left_exon_form_id',
target_id='right_exon_form_id',edge_id='splice_id',
sourceDB=exons,targetDB=exons,edgeDB=splices,
clusterKey='cluster_id')
pygr.Data.Bio.ASAP2.hg17.splicegraph = splicegraph
pygr.Data.schema.Bio.ASAP2.hg17.splicegraph = \
pygr.Data.ManyToManyRelation(exons,exons,splices,
bindAttrs=('next','previous','exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
Pygr multiple alignment objects can be treated as mappings of sequence intervals onto sequence intervals. Here is a very simple example, showing basic operations for constructing an alignment:
>>> from pygr import cnestedlist
>>> m2 = cnestedlist.NLMSA('myo',mode='memory') # an empty alignment, in-memory
>>> m2 += s # add sequence s to the alignment
>>> ival = s[100:160] # AN INTERVAL OF s
>>> m2[ival] += db['MYG_CHICK'][83:143] # add an edge mapping interval s -> an interval of MYG_CHICK
>>> m2[ival] += db['MYG_CANFA'][45:105] # add an additional edge
>>> m2.build() # done constructing the alignment. Initialize for query.
>>> for s2 in m2[ival[:10]]: # get aligned seqs for the first 10 letters of ival...
... print repr(s2)
...
MYG_CHICK[83:93]
MYG_CANFA[45:55]
from pygr import cnestedlist,seqdb
msa = cnestedlist.NLMSA('all_vs_all',mode='w',bidirectional=False) # ON-DISK
sp = seqdb.BlastDB('sp') # OPEN SWISSPROT DATABASE
for id,s in sp.iteritems(): # FOR EVERY SEQUENCE IN SWISSPROT
sp.blast(s,msa,expmax=1e-10) # GET STRONG HOMOLOGS, SAVE ALIGNMENT IN msa
msa.build(saveSeqDict=True) # BUILD & SAVE ALIGNMENT + SEQUENCE INDEXES
# msa READY TO QUERY NOW...
mode='w' flag, NLMSA will create a set of alignment
index files called 'all_vs_all'.
bidirectional=False: whenever you store an alignment relationship
msa with S would get TWO alignments
to T: one from the
msa.
What this highlights is that BLAST is not a true multiple sequence alignment
algorithm (among other things, it is not symmetric: you can get different
mappings in one direction vs. the other).
In general, bidirectional storage mainly makes sense for true multiple sequence alignments (which are guaranteed to be symmetric).
maf/:
import os
from pygr import cnestedlist,seqdb
genomes = {'hg17':'hg17','mm5':'mm5', 'rn3':'rn3', 'canFam1':'cf1',
'danRer1':'dr1', 'fr1':'fr1','galGal2':'gg2', 'panTro1':'pt1'}
for k,v in genomes.items(): # PREFIX DICTIONARY FOR UNION OF GENOMES
genomes[k] = seqdb.BlastDB(v) # USE v AS FILENAME FOR FASTA FILE
genomeUnion=seqdb.PrefixUnionDict(genomes) # CREATE UNION OF THESE DBs
# CREATE NLMSA DATABASE ucsc8 ON DISK, FROM MAF FILES IN maf/
msa = cnestedlist.NLMSA('ucsc8','w',genomeUnion,os.listdir('maf'))
msa.build(saveSeqDict=True) # BUILD & SAVE ALIGNMENT + SEQUENCE INDEXES
The only real work here is due to the fact that UCSC's MAF files use a prefix.suffix notation for identifying specific sequences, where prefix gives the name of the genome, and suffix gives the identifier of the sequence in that genome database. Here we use Pygr's PrefixUnionDict class to wrap the set of genome databases in a dict-like interface that accepts string keys of the form prefix.suffix and returns the right sequence object from the right genome database. As an added twist, the genome names in the MAF files match the filenames of the associated genome databases in most cases, but not all, so we have to create an initial dictionary giving the correct mapping. Actually building the NLMSA requires just one line, but actually a number of steps are happening behind the scenes:
fastacmd's horrible
memory requirements and slow speed, so we implemented fast sequence
indexing.
import pygr.Data
from pygr.apps.leelabdb import * # this accesses our databases
from pygr import coordinator # this provides parallelization support
def map_clusters(server,dbname='HUMAN_SPLICE_03',
result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
rmOpts='',**kwargs):
"CLIENT FUNCTION: map clusters one by one"
# CONSTRUCT RESOURCE FOR US IF NEEDED
genome = pygr.Data.Bio.Seq.Genome.HUMAN.hg17()
# LOAD DB SCHEMA
(clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna,
protein, clusterExons,clusterSplices) = getSpliceGraphFromDB(spliceCalcs[dbname])
# NOW MAP CLUSTER SEQUENCES ONE BY ONE TO OUR NEW genome
for cluster_id in server:
g = genomic_seq[cluster_id] # GET THE OLD GENOMIC SEQUENCE FOR THIS CLUSTER
m = genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
# SAVE ALIGNMENT m TO DATABASE TABLE result_table USING cursor
createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
{'src_id':'varchar(12)','dest_id':'varchar(12)'})
yield cluster_id # WE MUST FUNCTION AS GENERATOR TO KEEP ERROR TRAPPING
# HAPPY
def serve_clusters(dbname='HUMAN_SPLICE_03',
source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
"SERVER FUNCTION: serve up cluster_id one by one to as many clients as you want"
cursor = getUserCursor(dbname)
t = SQLTable(source_table,cursor)
for id in t:
yield id # HAND OUT ONE CLUSTER ID TO A CLIENT
if __name__=='__main__': # AUTOMATICALLY RUN EITHER THE CLIENT OR SERVER FUNCTION
coordinator.start_client_or_server(map_clusters,serve_clusters,[],__file__)
First, let's just focus on the map_clusters() function, which illustrates how the mapping of each gene is generated and saved. Let's examine the data piece by piece:
for alignedRegion in msa[myRegion]: # FIND ALIGNMENT IN OTHER GENOMES
for ival in alignedRegion.exons: # SEE IF THIS CONTAINS ANY ANNOTATED EXONS
if ival.orientation>0: # ENSURE ANNOTATION IS ON SAME STRAND AS alignedRegion
print 'exon\tID:%d\tSEQ:%s' % (ival.id,str(ival.sequence)) # PRINT ITS SEQUENCE
for exon2,splice in ival.next.items(): # LOOK AT ALTERNATIVE SPLICING OF THIS EXON
do something...
alignedRegion) to exon annotations. This mapping
is bound by pygr.Data.schema to the sequence object's exons attribute.
In a moment we will see how to construct such a mapping.
annotation[0] is the first position in the annotation;
annotation[-10:] is the last ten positions of the annotation etc.
Any subslice of the annotation gives its coordinates and orientation
relative to the original annotation. In the example above, we used this
to check whether the annotation slice ival (which is returned
in the same orientation as alignedRegion) is on the same strand
(orientation = 1) as the original exon annotation,
or the opposite strand (orientation = -1).
str,
for example.
pathForward attribute
points to its parent annotation
object. For example, in the case above, alignedRegion might not contain
the whole exon, in which case ival would be just the part of the exon
contained in alignedRegion, and ival.pathForward would be the complete
exon. Note that for nucleotide sequence annotations, orientation matters.
In this example, we restricted our analysis to exons that are on the same
strand as alignedRegion.
id gives its unique identifier (primary key) in that database,
and db points to the annotation database object itself.
ival is part of the exons annotation database,
it can apply schema information automatically to it. In this particular case,
it applies the splicegraph schema to it (see example from pygr.Data.schema
tutorial above), so we can find out what exons it splices to via its next
attribute.
sliceDB each consisting of a sequence ID,
start, and stop coordinates. We can easily construct an annotation database
from this:
from pygr import seqdb,cnestedlist
annoDB = seqdb.AnnotationDB(sliceDB,genome) # CREATE THE ANNOTATION DB
nlmsa = cnestedlist.NLMSA('exonAnnot','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
pairwiseMode=True,bidirectional=False)
for a in annoDB.itervalues(): # SAVE ALL ANNOTATION INTERVALS
nlmsa.addAnnotation(a) # ADD ALIGNMENT BETWEEN ival AND ann INTERVALS
nlmsa.build() # WRITE INDEXES FOR THE ALIGNMENT
annoDB.__doc__ = 'exon annotation on the human genome'
pygr.Data.Bio.Genomics.ASAP2.exons = annoDB # ADD AS A PYGR.DATA RESOURCE
nlmsa.__doc__ = 'map human genome regions to contained exons'
pygr.Data.Bio.Genomics.ASAP2.exonmap = nlmsa # NOW SAVE MAPPING AND SCHEMA
pygr.Data.schema.Bio.Genomics.ASAP2.exonmap = \
pygr.Data.ManyToManyRelation(genome,annoDB,bindAttrs=('exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
genome
has an exons attribute that automatically gives a list of exon
annotations contained within that sequence. I.e. as in the previous
example, you would simply access it via:
for exon in someRegion.exons: do something...
ManyToManyRelation indicates that nlmsa should
be interpreted as being a many-to-many relation from items of genome
to items of annoDB. It also creates an exons attribute on
items of genome that translates to g.exons=nlmsa[g].
genome was obtained
from pygr.Data (and thus a pygr.Data resource ID). If not, we would first
have to add it, just as we did for annoDB.
exons, to the
genome items) for the forward mapping (from genome to annoDB).
We did not even store the reverse mapping in nlmsa, because
it is completely trivial. (i.e. an annotation from annoDB is itself
the interval of genome that it maps to). This was set by
the bidirectional=False option to the NLMSA.
pairwiseMode option indicates that this is a pairwise
(sequence to sequence) alignment, not a true multiple sequence alignment
(which requires its own coordinate system, called an LPO; see the reference
docs on NLMSA for more information). This option could have been omitted;
pygr would have figured it out automatically from the fact that we saved
direct alignments of sequence interval pairs to nlmsa. NLMSA does not
permit mixing pairwise and true MSA alignment formats in a single NLMSA.
from pygr import seqdb,cnestedlist
annoDB = seqdb.AnnotationDB(None,genome,'exon', # CREATE THE ANNOTATION DB
filename='exonAnnot',mode='c') # STORE ON DISK
nlmsa = cnestedlist.NLMSA('exonMap','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
pairwiseMode=True,bidirectional=False)
for id,s in exonSeqs.items(): # SAVE ALL ANNOTATION INTERVALS
for ann in annoDB.add_homology(s,'megablast',id=id,maxseq=1,minIdentity=98,maxLoss=2):
nlmsa.addAnnotation(ann)
nlmsa.build() # WRITE INDEXES FOR THE ALIGNMENT
annoDB.close() # SAVE ALL OUR ANNOTATION DATA TO DISK
annoDB.__doc__ = 'exon annotation on the human genome'
pygr.Data.Bio.Genomics.ASAP2.exons = annoDB # ADD AS A PYGR.DATA RESOURCE
nlmsa.__doc__ = 'map human genome regions to contained exons'
pygr.Data.Bio.Genomics.ASAP2.exonmap = nlmsa # NOW SAVE MAPPING AND SCHEMA
pygr.Data.schema.Bio.Genomics.ASAP2.exonmap = \
pygr.Data.ManyToManyRelation(genome,annoDB,bindAttrs=('exons',))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
exonSeqs is a dictionary of exon IDs and sequence
strings.
None as the sliceDB argument, we force the
AnnotationDB to create a new dictionary for us. By passing the filename
argument, we make it create a Python shelve disk file to store the dictionary.
genome. This requires that
our genome provide a method attribute matching our search name
('megablast'), which must return an alignment object. For a BlastDB
object we could use either its blast or megablast methods.
Since we have provided an id, it will be used as
the id for the annotation. The remaining arguments are passed to the
homology search and filtering functions; see the BlastDB and
NLMSASlice.keys documentation for full details of the options you
can use. These specific arguments indicate that only the top hit should
be processed (maxseq=1), that it must have at least 98% identity to the
query, and that no more than 2 nucleotides can be missing relative to the
original query. add_homology() returns a list of the resulting
annotation(s) for this search, which are added to the alignment as usual.
The following introductory examples show how to use Pygr for graph queries, sequence searching and alignment queries, annotation queries, and multigenome alignment queries.
SELECT * FROM exons t1, exons t2, exons t3, splices t4, splices t5, splices t6 WHERE t1.cluster_id=t4.cluster_id AND t1.gen_end=t4.gen_start AND t4.cluster_id=t2.cluster_id AND t4.gen_end=t2.gen_start AND t2.cluster_id=t5.cluster_id AND t2.gen_end=t5.gen_start AND t5.cluster_id=t3.cluster_id AND t5.gen_end=t3.gen_start AND t1.cluster_id=t6.cluster_id AND t1.gen_end=t6.gen_start AND t6.cluster_id=t3.cluster_id AND t6.gen_end=t3.gen_start;
Such a six-way JOIN is painfully slow in a relational database; in general such queries just aren't practical. More fundamentally, the relational schema is forced to represent the graph relation with combinations of foreign keys and other data that the user really should not have to remember. All the user should know is that there is a specific relation, e.g. from this exon, the "next" exon is X, and the relation joining them is splice Y.
In Pygr, writing the query is just a matter of writing down the graph (edges from 1 to 2, 1 to 3, and 2 to 3, but no special "edge information"):
queryGraph = {1:{2:None,3:None},2:{3:None},3:{}}
We can now execute the query using the GraphQuery class:
results = [dict(m) for m in GraphQuery(spliceGraph,queryGraph)]
This is more or less equivalent to writing a bunch of for-loops for iterating over the possible closures:
results = []
for e1 in spliceGraph: # FIND ALL EXONS
for e2 in spliceGraph[e1]: # NEXT EXON
for e3 in spliceGraph[e2]: # NEXT EXON
if e3 in spliceGraph[e1]: # MAKE SURE SPLICE FROM e1 -> e3
results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH
It is often convenient to bind an object attribute to a graph, so that you can use either the graph syntax or a traditional object attribute and mean exactly the same thing. In the splice graph example, we bind the exon.next attribute to the spliceGraph, so the above for-loops can also be written:
results = []
for e1 in spliceGraph: # FIND ALL EXONS
for e2 in e1.next: # NEXT EXON
for e3 in e2.next: # NEXT EXON
if e3 in e1.next: # MAKE SURE SPLICE FROM e1 -> e3
results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH
Another interesting query in the alternative splicing field is the so-called U12-adapter exon query (see slide 21 of the ISMB slides):
queryGraph = {0:{1:dict(dataGraph=alt5Graph),
2:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},
1:{3:None},
2:{3:None},
3:{}}
Here we use edge information in the query graph to add a few constraints:
Note that the query graph "nodes" (in this example, the integers 0, 1, 2, 3) are quite arbitrary. We could have used strings, or other kinds of objects instead.
Now if we want to see the results right away, we use the mapping returned by GraphQuery to look at individual nodes and edges of the dataGraph that matched our query:
for m in GraphQuery(spliceGraph,queryGraph):
print m[1].id,m[0,2].id # PRINT EXON ID FOR EXON 1,
# SPLICE ID FOR SPLICE 0 -> 2
The match is returned by GraphQuery as a mapping from nodes and edges of the query graph to nodes and edges of the data graph. Edges are specified simply as tuples of the nodes you want to get the edge for (in this example 0,2). Constructing a Graph How was the spliceGraph created in the first place? Let's say we have an initial list of tuples giving connections between exon objects and splice objects, where each tuple consists of a pair of exons connected by a splice.
for exon1,exon2,splice in spliceConnections:
spliceGraph += exon1 # add exon1 as a node in the graph
spliceGraph += exon2 # if already a node in the graph, does nothing...
exon1.next[exon2] = splice # add an edge, with splice as the edgeinfo
The last operation makes use of the binding of exon.next to spliceGraph, and is equivalent to
spliceGraph[exon1][exon2] = splice
If we didn't want to save the edge information, we could use the simpler syntax
spliceGraph[exon1] += exon2 # equivalent to exon1.next+=exon2
This "short" form is equivalent to saving None as the edge information.
Pygr can do much more sophisticated analyses than this fairly easily. Just to give a taste of how to use these capabilities, we will illustrate one example of a standard Pygr model: querying Pygr data by drawing a ``query graph'' showing the connections we want to find, and running its GraphQuery() engine. Since Pygr alignments follow the same interface as any Pygr graph, we can query them using the standard GraphQuery class. Let's say we have a Python script load_alignments.py that loads two alignments:
>>> from load_alignments import * # load the alignments
>>> from pygr.graphquery import * # import the graph query code
# draw a graph using a dict. Note: edge 2->3 must come from swiss_features
>>> queryGraph = {1:{2:None},2:{3:dict(dataGraph=swiss_features)},3:{}}
# run the query and save the mappings
>>> l = [dict(d) for d in GraphQuery(mRNA_swiss,queryGraph)]
>>> len(l) # how many annotations mapped onto our mRNA sequences?
4703
We assumed that mRNA_swiss would be passed as the default dataGraph, and specified directly that edge 2->3 should be looked up in swiss_features. We then captured all the results from the GraphQuery iterator using a Python list comprehension. Note that since the iterator returns each result in the same container (mapping object), if we want to save all the individual results we have to copy each one to a new mapping (dict) object, as illustrated in this example. Storing Alignments in a Relational Database
>>> import MySQLdb # standard module for accessing MySQL, now get a cursor...
>>> rdb = MySQLdb.connect(db='HUMAN_SPLICE_03',read_default_file=os.environ['HOME'
]+'/.my.cnf')
>>> t = SQLTable('genomic_cluster_JUN03',rdb.cursor()) #interface to a table of
sequences
>>> from pygr.seqdb import * # pygr module for working with sequences from databases
>>> t.objclass(DNASQLSequence) #use this class as "row objects"
>>> s2 = t['Hs.1162'] # get a specific sequence object by ID
>>> str(s2[1000:1050]) # this will only get 50 nt of the genomic sequence from
MySQL
'acctgggtgatgaaataaatttttacgccaaatcccgatgacacacaatt'
(Note: in this example we used MySQLdb.connect()'s ability to read database server and user authentication information directly from the standard /.my.cnf file normally used by the MySQL client).
Additional examples of how to use pygr can be found in the tests/ directory within the pygr distribution package.
See About this document... for information on suggesting changes.