Subsections


1 Introductory Tutorial

Pygr is open source software designed to make it easy to do powerful sequence and comparative genomics analyses, even with extremely large multi-genome alignments.


1.1 Sequence / Alignment Tutorial

Sequences and alignments also can be modeled as graph structures in Pygr, providing the same consistent and simple framework for queries.

1.1.1 Sequence Objects

Pygr tries to provide a very "Pythonic" model for sequences. This python interpreter session illustrates some simple features:

>>> from pygr.sequence import *
>>> s = Sequence('attatatgccactat','bobo') #create a sequence named bobo
>>> s # interpreter will print repr(s)
bobo[0:15]
>>> t = s[-8:] #python slice gives last 8 nt of s
>>> t # interpreter will print repr(t)
bobo[7:15]
>>> str(t) #string conversion just yields the sequence as a string 
'gccactat'
>>> rc = -s #get the reverse complement
>>> str(rc[:5]) #its first five letters
'atagt'

Several points:

1.1.2 Comparative Genomics Query of Multigenome Alignments

Many groups (e.g. David Haussler's group at UC Santa Cruz) have constructed alignments of multiple genomes. These alignments are extremely useful and interesting, but so large that it is cumbersome to work with the dataset using conventional methods. For example, for the 17-genome alignment you have to work simultaneously with the individual genome datasets for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish etc., as well as the huge alignment itself. Pygr makes this quite easy. Here we illustrate an example of mapping a set of human exons, which has two splice sites (ss1 and ss2) bracketing a single exon (exon). We use the alignment database to map each of these splice sites onto all the aligned genomes, and to print the percent-identity and percent-aligned for each genome, as well as the two nucleotides consituting the splice site itself. It also prints the conservation of the two exonic region (between ss1 and ss2:

import pygr.Data # FINDS DATA WHEREVER IT'S REGISTERED
msa = pygr.Data.Bio.Seq.MSA.ucsc17() # SANTA CRUZ 17-GENOME ALIGNMENT
exons = pygr.Data.Leelab.ASAP2.hg17.exons() # ASAP2 HUMAN EXONS
idDict = ~(msa.seqDict) # INVERSE: MAPS SEQ --> STRING IDENTIFIER
def printConservation(id,label,site):
    for src,dest,edge in msa[site].edges(mergeMost=True):
        print '%d\t%s\t%s\t%s\t%s\t%s\t%2.1f\t%2.1f' \
              %(id,label,repr(src),src,idDict[dest],dest,
                100*edge.pIdentity(),100*edge.pAligned())
for id,exon in exons.iteritems():
    ival = exon.sequence # GET THE SEQUENCE INTERVAL FOR THIS EXON
    ss1 = ival.before()[-2:] # GET THE 2 NT SPLICE SITES
    ss2 = ival.after()[:2]
    cacheHint = msa[ss1+ss2] #CACHE THE COVERING INTERVALS FROM ss1 TO ss2
    printConservation(id,'ss1',ss1)
    printConservation(id,'ss2',ss2)
    printConservation(id,'exon',ival)

A few notes:

(Actually, because of Pygr's caching / optimizations, considerably more is going on than indicated in this simplified sketch. But you get the idea: Pygr makes it relatively effortless to work with a variety of disparate (and large) resources in an integrated way.)

Here is some example output:

NEED TO UPDATE THESE RESULTS
1       Mm.99996        ss1     hg17    50.0    100.0   AG      GG
1       Mm.99996        ss1     canFam1 50.0    100.0   AG      GG
1       Mm.99996        ss1     panTro1 50.0    100.0   AG      GG
1       Mm.99996        ss1     rn3     100.0   100.0   AG      AG
1       Mm.99996        ss2     hg17    100.0   100.0   AG      AG
1       Mm.99996        ss2     canFam1 100.0   100.0   AG      AG
1       Mm.99996        ss2     panTro1 100.0   100.0   AG      AG
1       Mm.99996        ss2     rn3     100.0   100.0   AG      AG
1       Mm.99996        ss3     hg17    100.0   100.0   GT      GT
1       Mm.99996        ss3     canFam1 100.0   100.0   GT      GT
1       Mm.99996        ss3     panTro1 100.0   100.0   GT      GT
1       Mm.99996        ss3     rn3     100.0   100.0   GT      GT
1       Mm.99996        e1      hg17    78.9    100.0   AG      GG
1       Mm.99996        e1      canFam1 84.2    100.0   AG      GG
1       Mm.99996        e1      panTro1 77.6    100.0   AG      GG
1       Mm.99996        e1      rn3     97.4    98.7    AG      AG
1       Mm.99996        e2      hg17    91.6    99.1    CC      CC
1       Mm.99996        e2      canFam1 88.8    99.1    CC      CC
1       Mm.99996        e2      panTro1 91.6    99.1    CC      CC
1       Mm.99996        e2      rn3     97.2    100.0   CC      CC

1.1.3 Working with Sequences from Databases

Pygr provides a variety of "back-end" implementations of sequence objects, ranging from sequences stored in a relational database table, or a BLAST database, to sequences created by the user in Python (as above). All of these provide the same consistent interface, and in general try to be efficient. For example, Pygr sequence objects are just "placeholders" that record what sequence interval you're working with, but if the back-end is an external database, the sequence object itself does not store the sequence, and creating new sequence objects (e.g. taking slices of the object as above) will not require anything to be done on the actual sequence itself (such as copying a portion of it). Pygr only obtains sequence information when you actually ask for it (e.g. by taking the string value str(s) of a sequence object), and normally only obtains just the portion that you ask for (i.e. str(s[1000000:1000100]) only obtains 100nt of sequence, even if s is a 100 megabase sequence. By contrast str(s)[1000000:1000100] would force it to obtain the whole sequence from the database, then slice out just the 100 nt you selected).

Here's an example of working with sequences from a BLAST database:

NEED TO UPDATE THESE RESULTS
>>> from pygr.seqdb import *
>>> db = BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
>>> s = db['CYGB_HUMAN'][90:150] # get a sequence by ID, and take a slice
>>> str(s)
'TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA'
>>> m = db.blast(s) # get alignment to all BLAST hits in db
>>> for src,dest,edge in m.edges(): # print out the alignment edges
...     print src,repr(src),'\n',dest,repr(dest),e.pIdentity(),'\n'
... 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPP CYGB_HUMAN[90:145] 
TLVENLRDADKLNTIFNQMGKSHALRHKVDPVYFKILAGVILEVLVEAFPQCFSP CYGB_BRARE[87:142] 72

TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150] 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
 120

TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150] 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPMYFKILSGVILEVIAEEFANDFPVETQKA CYGB_MOUSE[90:150]
112 
...

This example introduces the use of a Pygr alignment object to store the mapping of s onto homologous sequences in db, obtained from BLAST. Here's what Pygr actually does:

1.2 Simplifying the Challenges of Working with Complex Datasets

1.2.1 pygr.Data: a Namespace for Transparently Importing Data

One challenge in bioinformatics is the complexity of managing many diverse data resources. For example, running a large job on a heterogeneous cluster of computers is complicated by the fact that individual computers often can't access a given data resource in the same way (i.e. the file path may be different), and some machines may not have direct access at all to certain resources.

Pygr provides a systematic solution to this problem: creating a consistent namespace for data. A given resource is given a unique name that then becomes its universal handle for accessing it, no matter where you are (just as Python's import command provides a consistent name for accessing a given code resource, regardless of where you are). For example, say we want to add the hg17 (release 17 of the human genome sequence) as ``Bio.Seq.Genome.HUMAN.hg17'' (the choice of name is arbitrary, but it's best to choose a good convention and follow it consistently):

from pygr import seqdb
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
hg17 = seqdb.BlastDB('hg17')
hg17.__doc__ = 'human genome sequence draft 17' # REQUIRED!
pygr.Data.Bio.Seq.Genome.HUMAN.hg17 = hg17 # SAVE AS THIS NAME
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
Note that you must call the function pygr.Data.save() to complete the transaction and save all pending data resources (i.e. all those added since your last pygr.Data.save() or pygr.Data.rollback()). In particular, if you have added data to pygr.Data during a given Python interpreter session, you should always call pygr.Data.save() or pygr.Data.rollback() prior to exiting from that session.

In any subsequent Python session, we can now access it directly by its pygr.Data name:

import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
hg17=pygr.Data.Bio.Seq.Genome.HUMAN.hg17() # FIND THE RESOURCE
The call syntax (hg17()) emphasizes that this acts like a Python constructor: it constructs a Python object for us (in this case, the desired seqdb.BlastDB object representing this genome database). Note that we did not even have to know how to construct the hg17 object, e.g. what Python class to use (seqdb.BlastDB), or even to import the necessary modules for constructing it. pygr.Data uses the power of Python pickling to figure out automatically what to import. pygr.Data looks at the environment variable PYGRDATAPATH to get a list of local and remote resource databases in which to look up any resource name that you try to load. For example, in the shell you might set:
setenv PYGRDATAPATH ~,.,/usr/local/pygr,mysql:PYGRDATA.index,http://leelab.mbi.ucla.edu:5000
This is a comma-separated string (since colon ':' appears inside URLs). In this case it tells pygr.Data to look for resource databases (in order): $HOME/.pygr_data; ./.pygr_data; /usr/local/pygr/.pygr_data; the MySQL table PYGRDATA.index (using your MySQL .my.cnf file to determine the MySQL host and authentication); and the XMLRPC server running on leelab.mbi.ucla.edu on port 5000.

pygr.Data is smart about figuring out data resource dependencies. For example, you could just save a 17-genome alignment in a single step as follows:

from pygr import cnestedlist
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
nlmsa = cnestedlist.NLMSA('/loaner/ucsc17')
nlmsa.__doc__ = 'UCSC 17way multiz alignment, rooted on hg17'
pygr.Data.Bio.Seq.MSA.ucsc17 = nlmsa
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
This works, even though using this 17-genome alignment (behind the scenes) involves accessing 17 BlastDB sequence databases (one for each of the genomes in the alignment). Because the alignment object (NLMSA) references the 17 BlastDB databases, pygr.Data automatically saves information about how to access them too.

However, it would be a lot smarter to give those databases pygr.Data resource names too. Let's do that:

from pygr import cnestedlist
import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
nlmsa = cnestedlist.NLMSA('/loaner/ucsc17')
for id,genome in nlmsa.seqDict.prefixDict.items(): # 1st SAVE THE GENOMES
    genome.__doc__ = 'genome sequence '+id
    pygr.Data.getResource.addResource('Bio.Seq.Genome.'+id,genome)
nlmsa.__doc__ = 'UCSC 17way multiz alignment, rooted on hg17'
pygr.Data.Bio.Seq.MSA.ucsc17 = nlmsa # NOW SAVE THE ALIGNMENT
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE

This has several advantages. First, we can now access other genome databases using pygr.Data too:

import pygr.Data # MODULE PROVIDES ACCESS TO OUR DATA NAMESPACE
mm7 = pygr.Data.Bio.Seq.Genome.mm7() # GET THE MOUSE GENOME
But more importantly, when we try to load the ucsc17 alignment on another machine, if the genome databases are not in the same directory as on our original machine, the first method above would fail, whereas in the second approach pygr.Data now will automatically scan all its resource databases to figure out how to load each of the genomes on that machine.

NOTE: Python pickling is not secure. In particular, you should not unpickle data provided by someone else unless you trust the data not to contain attempted security exploits. Because Python unpickling has access to import, it has the potential to access system calls and execute malicious code on your computer.

1.2.2 pygr.Data.schema: a Simple Framework For Managing Database Schemas

Schema refers to any relationship between two or more collections of data. It captures the structure of relationships that define these particular kinds of data. For example ``a genome has genes, and genes have exons'', or ``an exon is connected to another exon by a splice''. In pygr.Data we can store such schema information as easily as:
splicegraph.__doc__ = 'graph of exon:splice:exon relations in human genes'
pygr.Data.Bio.Genomics.ASAP2.hg17.splicegraph = splicegraph # ADD A NEW RESOURCE
pygr.Data.schema.Bio.Genomics.ASAP2.hg17.splicegraph = \
  pygr.Data.ManyToManyRelation(exons,exons,splices, # ADD ITS SCHEMA RELATIONS
                               bindAttrs=('next','previous','exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
This example assumes that

1.2.3 pygr.Data Sharing Over a Network via XMLRPC

Sometimes individual compute nodes may not have sufficient disk space to store all the data resources (for example, just the single ucsc17 17-genome alignment and associated genome databases takes about 200 GB). Yet it would be useful to run compute-intensive analyses on those machines accessing such data. pygr.Data makes that easy. The default setting of PYGRDATAPATH (if you do not set it yourself) is
~,.,http://biodb2.bioinformatics.ucla.edu:5000
respectively your HOME directory, current directory, and the XMLRPC server provided at UCLA as a service to pygr users. Thus you can simply import pygr.Data and start accessing data. Try this:
>>> import pygr.Data
>>> pygr.Data.dir('Bio')
['Bio.MSA.UCSC.canFam2_multiz4way',...]
>>> msa = pygr.Data.Bio.MSA.UCSC.hg17_multiz17way()
>>> chr1 = msa.seqDict['hg17.chr1']
>>> ival = chr1[4000:4400]
>>> myslice = msa[ival]
>>> for s1,s2,e in myslice.edges():
...     print '%s\n%s\n' %(s1,s2)
...
AAGGGCCA
AAGGGCCA
This provides a convenient way to begin trying out pygr and working with comparative genomics data, but clearly is not efficient for analysis of large amounts of data, which must be transmitted to you by the server via XMLRPC, since potentially many users must share the access to the biodb2.bioinformatics.ucla.edu server.

To setup your own XMLRPC client-server using pygr.Data, first create an XMLRPC server on a machine that has access to the data:

import pygr.Data
nlmsa = pygr.Data.Bio.Seq.MSA.ucsc17() # GET OUR NLMSA AND SEQ DBs
server = pygr.Data.getResource.newServer('nlmsa_server') # SERVE ALL LOADED DATA
server.register() # TELL PYGRDATA INDEX SERVER WHAT RESOURCES WE'RE SERVING
server.serve_forever() # START THE SERVICE...

This example code looks for a pygr.Data XMLRPC server in your PYGRDATAPATH, and registers our resources to that index. Now any machine that can access your servers can access the alignment as easily as:

import pygr.Data
nlmsa = pygr.Data.Bio.Seq.MSA.ucsc17() # GET THE NLMSA AND SEQ DBs
Alignment queries and sequence strings will be obtained via XMLRPC queries over the network. Note that if any of the sequence databases are available locally (on this machine), Pygr will automatically use that in preference to obtaining it over the network (based on your PYGRDATAPATH settings). However, if a particular resource is not available locally, Pygr will transparently get access to it from the server we created, using XMLRPC.

1.2.4 pygr.Data Layers

Based on your PYGRDATAPATH, pygr.Data provides a number of named layers that give abstract names for where you want to read or store your pygr.Data info. For example, if you wanted to store a resource specifically in the resource database in your current directory, you could type:
pygr.Data.here.Bio.Seq.MSA.ucsc17=nlmsa # SAVE THE NLMSA AND SEQ DBs

1.2.5 Collection, Mapping, Graph, SQLTable and SQLGraph classes

One of the main challenges in persistent storage (e.g. keeping a database on disk) of Python objects is how to store their inter-relations in an efficient and transparent way. For example, in a database application we want to be able to load just one object at a time (rather than being forced to load all the objects from the database into memory) even though each object may have references to many other objects (and we obviously want these references to work transparently for the user). The standard database answer is to associate a unique identifier (e.g. an integer) with each object in a specific collection, and to store references in the database in terms of these identifiers. This gives the database a flexible way to refer to objects (by their unique identifiers) that we have not yet actually loaded into memory.

pygr.Data provides classes that make it very easy for you to store your data in this way.

Here's a simple example of using a pygr Collection:
ens_genes = Collection(filename='genes.db',mode='c' # CREATE NEW DATABASE
                       itemClass=Transcript)
for gene_id,gene_data in geneList:
    gene = Transcript(gene_id,gene_data,ens_genes)
    ens_genes[gene_id] = gene # STORE IN OUR DATABASE

Mapping enables you to store a relationship between one collection and another collection in a way that is easily stored as a database. For example, assuming that ens_genes is a collection of genes, and exon_db is a collection of exons, we can store the mapping from a gene to its exons as follows:

gene_exons = Mapping(ens_genes, exon_db, multiValue=True, 
                     inverseAttr='gene_id',filename='gene_exons.db',mode='c')
for exon in exon_db:
    gene = ens_genes[exon.gene_id] # GET ITS GENE
    exons = gene_exons.get(gene, []) # GET ITS LIST OF EXONS, OR AN EMPTY LIST
    exons.append(exon) # ADD OUR EXON TO ITS LIST
    gene_exons[gene] = exons # SAVE EXPANDED EXON MAPPING LIST
The optional multiValue flag indicates that this is a one-to-many mapping (i.e. each gene maps to a list of exons. Again, we used the filename variable to make pygr store our mapping on disk using a Python shelve (BerkeleyDB file).

The Collection, Mapping and Graph classes provide general and flexible storage options for storing data and graphs. These classes can be accessed from the pygr.Data or pygr.mapping modules. For further details, see the pygr.mapping module documentation. The SQLTable and SQLGraph classes in the pygr.sqlgraph module provide analogous interfaces for storing data and graphs in an SQL database server (such as MySQL).

Here's an example of creating an SQLGraph representing the splices connecting pairs of exons, using data stored in an existing database table:

splicegraph = sqlgraph.SQLGraphClustered('PYGRDB_JAN06.splicegraph_hg17',
                                         source_id='left_exon_form_id',
                                         target_id='right_exon_form_id',edge_id='splice_id',
                                         sourceDB=exons,targetDB=exons,edgeDB=splices,
                                         clusterKey='cluster_id')
pygr.Data.Bio.ASAP2.hg17.splicegraph = splicegraph
pygr.Data.schema.Bio.ASAP2.hg17.splicegraph = \
    pygr.Data.ManyToManyRelation(exons,exons,splices,
                                 bindAttrs=('next','previous','exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE
This variant of SQLGraph is optimized for typical usage patterns, by loading data in clusters (rather than each individual splice one by one). Since the key that we provided for the clustering ('cluster_id') is the gene identifier, this means that looking at any splice will have the effect of loading all splices for that gene. This makes sense, because only exons that are in the same gene can have splices to each other. This makes communication with the SQL server efficient, but only loads data that is likely to be used next by the user.

1.3 Storing Alignments

1.3.1 Alignment Basics

Pygr multiple alignment objects can be treated as mappings of sequence intervals onto sequence intervals. Here is a very simple example, showing basic operations for constructing an alignment:

>>> from pygr import cnestedlist
>>> m2 = cnestedlist.NLMSA('myo',mode='memory') # an empty alignment, in-memory
>>> m2 += s # add sequence s to the alignment
>>> ival = s[100:160] # AN INTERVAL OF s
>>> m2[ival] += db['MYG_CHICK'][83:143] # add an edge mapping interval s -> an interval of MYG_CHICK
>>> m2[ival] += db['MYG_CANFA'][45:105] # add an additional edge
>>> m2.build() # done constructing the alignment.  Initialize for query.
>>> for s2 in m2[ival[:10]]: # get aligned seqs for the first 10 letters of ival...
...     print repr(s2)
...
MYG_CHICK[83:93]
MYG_CANFA[45:55]
In this case we used in-memory storage of the alignment.

1.3.2 Storing an All-vs-All BLAST Alignment

However, for really large alignments (e.g. an all vs. all BLAST analysis) we may prefer to store the alignment on-disk. In pygr, all we have to do is change the mode flag to 'w' (implying write a file):
from pygr import cnestedlist,seqdb
msa = cnestedlist.NLMSA('all_vs_all',mode='w',bidirectional=False) # ON-DISK
sp = seqdb.BlastDB('sp') # OPEN SWISSPROT DATABASE
for id,s in sp.iteritems(): # FOR EVERY SEQUENCE IN SWISSPROT
    sp.blast(s,msa,expmax=1e-10) # GET STRONG HOMOLOGS, SAVE ALIGNMENT IN msa
msa.build(saveSeqDict=True) # BUILD & SAVE ALIGNMENT + SEQUENCE INDEXES
# msa READY TO QUERY NOW...
Again you can see how pygr makes it quite simple to do a large analysis and create a powerful resource (an all-vs-all alignment database). A couple of points deserve comment:

1.3.3 Building an Alignment Database from MAF files

It may be helpful to see how a large multi-genome alignment database is created in Pygr. It is quite straightforward. UCSC has defined a new file format for large multigenome alignment, called MAF. Pygr provides high-performance utilities for reading MAF alignment files and building a disk-based NLMSA alignment database. (These utilities are written in C for performance). Here's an example of building an alignment database from scratch using a set of MAF files stored in a directory called maf/:

import os
from pygr import cnestedlist,seqdb

genomes = {'hg17':'hg17','mm5':'mm5', 'rn3':'rn3', 'canFam1':'cf1', 
           'danRer1':'dr1', 'fr1':'fr1','galGal2':'gg2', 'panTro1':'pt1'}
for k,v in genomes.items(): # PREFIX DICTIONARY FOR UNION OF GENOMES
    genomes[k] = seqdb.BlastDB(v) # USE v AS FILENAME FOR FASTA FILE
genomeUnion=seqdb.PrefixUnionDict(genomes) # CREATE UNION OF THESE DBs
# CREATE NLMSA DATABASE ucsc8 ON DISK, FROM MAF FILES IN maf/
msa = cnestedlist.NLMSA('ucsc8','w',genomeUnion,os.listdir('maf'))
msa.build(saveSeqDict=True) # BUILD & SAVE ALIGNMENT + SEQUENCE INDEXES

The only real work here is due to the fact that UCSC's MAF files use a prefix.suffix notation for identifying specific sequences, where prefix gives the name of the genome, and suffix gives the identifier of the sequence in that genome database. Here we use Pygr's PrefixUnionDict class to wrap the set of genome databases in a dict-like interface that accepts string keys of the form prefix.suffix and returns the right sequence object from the right genome database. As an added twist, the genome names in the MAF files match the filenames of the associated genome databases in most cases, but not all, so we have to create an initial dictionary giving the correct mapping. Actually building the NLMSA requires just one line, but actually a number of steps are happening behind the scenes:

1.3.4 Example: Mapping an entire gene set onto a new genome version

To illustrate how Pygr can perform a big task with a little code, here is an example that maps a set of gene sequences onto a new version of the genome, using megablast to do the mapping, and a relational database to store the results. Moreover, since mapping 80,000 gene clusters takes a fair amount of time, the calculation is parallelized to run over a large number of compute nodes simultaneously:

import pygr.Data
from pygr.apps.leelabdb import * # this accesses our databases
from pygr import coordinator     # this provides parallelization support

def map_clusters(server,dbname='HUMAN_SPLICE_03',
                 result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
                 rmOpts='',**kwargs):
    "CLIENT FUNCTION: map clusters one by one"
    # CONSTRUCT RESOURCE FOR US IF NEEDED
    genome = pygr.Data.Bio.Seq.Genome.HUMAN.hg17()
    # LOAD DB SCHEMA
    (clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna, 
    protein, clusterExons,clusterSplices) = getSpliceGraphFromDB(spliceCalcs[dbname])
    # NOW MAP CLUSTER SEQUENCES ONE BY ONE TO OUR NEW genome
    for cluster_id in server:
        g = genomic_seq[cluster_id] # GET THE OLD GENOMIC SEQUENCE FOR THIS CLUSTER
        m = genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
        # SAVE ALIGNMENT m TO DATABASE TABLE result_table USING cursor
        createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
                            {'src_id':'varchar(12)','dest_id':'varchar(12)'})
        yield cluster_id # WE MUST FUNCTION AS GENERATOR TO KEEP ERROR TRAPPING 
		         # HAPPY

def serve_clusters(dbname='HUMAN_SPLICE_03',
                   source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
    "SERVER FUNCTION: serve up cluster_id one by one to as many clients as you want"
    cursor = getUserCursor(dbname)
    t = SQLTable(source_table,cursor)
    for id in t:
        yield id # HAND OUT ONE CLUSTER ID TO A CLIENT

if __name__=='__main__': # AUTOMATICALLY RUN EITHER THE CLIENT OR SERVER FUNCTION
    coordinator.start_client_or_server(map_clusters,serve_clusters,[],__file__)

First, let's just focus on the map_clusters() function, which illustrates how the mapping of each gene is generated and saved. Let's examine the data piece by piece:

This code will run in parallel over as many compute nodes as you have free, using Pygr's coordinator module. The parallelization model for this particular task is simple: a single iterator (server) dispensing task IDs to many clients.

1.4 Sequence Annotation Databases

1.4.1 What is a pygr sequence annotation?

Annotation - information bound to specific intervals of a genome or sequence - is an essential concept in bioinformatics. We would like to be able to store and query annotations naturally as part of working with multi-genome alignments, as a standard operation in comparative genomics. Pygr makes this easy:
for alignedRegion in msa[myRegion]: # FIND ALIGNMENT IN OTHER GENOMES
  for ival in alignedRegion.exons: # SEE IF THIS CONTAINS ANY ANNOTATED EXONS
    if ival.orientation>0: # ENSURE ANNOTATION IS ON SAME STRAND AS alignedRegion
      print 'exon\tID:%d\tSEQ:%s' % (ival.id,str(ival.sequence)) # PRINT ITS SEQUENCE
      for exon2,splice in ival.next.items(): # LOOK AT ALTERNATIVE SPLICING OF THIS EXON
        do something...

1.4.2 Constructing an Annotation Database

Suppose you had a set of annotations sliceDB each consisting of a sequence ID, start, and stop coordinates. We can easily construct an annotation database from this:
from pygr import seqdb,cnestedlist
annoDB = seqdb.AnnotationDB(sliceDB,genome) # CREATE THE ANNOTATION DB
nlmsa = cnestedlist.NLMSA('exonAnnot','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
                          pairwiseMode=True,bidirectional=False)
for a in annoDB.itervalues(): # SAVE ALL ANNOTATION INTERVALS
  nlmsa.addAnnotation(a) # ADD ALIGNMENT BETWEEN ival AND ann INTERVALS
nlmsa.build() # WRITE INDEXES FOR THE ALIGNMENT
annoDB.__doc__ = 'exon annotation on the human genome'
pygr.Data.Bio.Genomics.ASAP2.exons = annoDB # ADD AS A PYGR.DATA RESOURCE
nlmsa.__doc__ = 'map human genome regions to contained exons'
pygr.Data.Bio.Genomics.ASAP2.exonmap = nlmsa # NOW SAVE MAPPING AND SCHEMA
pygr.Data.schema.Bio.Genomics.ASAP2.exonmap = \
      pygr.Data.ManyToManyRelation(genome,annoDB,bindAttrs=('exons'))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE

1.4.3 Constructing an Annotation Mapping using Megablast

What if someone provided you with a set of ``exon annotations'' in the form of short sequences representing the exons, rather than actual genomic coordinates? Again, pygr makes this mapping extremely easy to save:
from pygr import seqdb,cnestedlist
annoDB = seqdb.AnnotationDB(None,genome,'exon', # CREATE THE ANNOTATION DB
                            filename='exonAnnot',mode='c') # STORE ON DISK
nlmsa = cnestedlist.NLMSA('exonMap','w', # STORE SEQ->ANNOT MAPPING AS AN ALIGNMENT
                          pairwiseMode=True,bidirectional=False)
for id,s in exonSeqs.items(): # SAVE ALL ANNOTATION INTERVALS
  for ann in annoDB.add_homology(s,'megablast',id=id,maxseq=1,minIdentity=98,maxLoss=2):
    nlmsa.addAnnotation(ann)
nlmsa.build() # WRITE INDEXES FOR THE ALIGNMENT
annoDB.close() # SAVE ALL OUR ANNOTATION DATA TO DISK
annoDB.__doc__ = 'exon annotation on the human genome'
pygr.Data.Bio.Genomics.ASAP2.exons = annoDB # ADD AS A PYGR.DATA RESOURCE
nlmsa.__doc__ = 'map human genome regions to contained exons'
pygr.Data.Bio.Genomics.ASAP2.exonmap = nlmsa # NOW SAVE MAPPING AND SCHEMA
pygr.Data.schema.Bio.Genomics.ASAP2.exonmap = \
      pygr.Data.ManyToManyRelation(genome,annoDB,bindAttrs=('exons',))
pygr.Data.save() # SAVE ALL PENDING DATA TO THE RESOURCE DATABASE

1.5 Using Pygr as a Graph Database

The real power of Pygr is that it provides a simple model for viewing all data as graph databases- in which all data are represented as nodes and connections between nodes (edges), and queries are formulated as a specific pattern of connections to find - in a very Pythonic style. To illustrate the simplicity and power of the graph database approach, Pygr has a strong emphasis on bioinformatics applications ranging from genome-wide analysis of alternative splicing patterns, to comparative genomics queries of multi-genome alignment data.

The following introductory examples show how to use Pygr for graph queries, sequence searching and alignment queries, annotation queries, and multigenome alignment queries.

1.5.1 Example: Simple graph query

Why would you want to use Pygr? Interesting data often consists of specific graph structures, and these relationships are much easier to describe as graphs than they are in SQL. For example, the simplest and most common form of alternative splicing is exon-skipping, where an exon is either skipped or included (see slide 15 of the ISMB slides for a picture). This can be defined immediately as a graph in which three nodes (exons 1, 2, 3) are joined by edges either as 1-2-3 or 1-3. Unfortunately, writing an SQL query for this simple pattern requires a 6-way JOIN (argh).

SELECT * FROM exons t1, exons t2, exons t3, splices t4, splices t5, splices t6 
WHERE t1.cluster_id=t4.cluster_id AND t1.gen_end=t4.gen_start 
  AND t4.cluster_id=t2.cluster_id AND t4.gen_end=t2.gen_start 
  AND t2.cluster_id=t5.cluster_id AND t2.gen_end=t5.gen_start 
  AND t5.cluster_id=t3.cluster_id AND t5.gen_end=t3.gen_start 
  AND t1.cluster_id=t6.cluster_id AND t1.gen_end=t6.gen_start 
  AND t6.cluster_id=t3.cluster_id AND t6.gen_end=t3.gen_start;

Such a six-way JOIN is painfully slow in a relational database; in general such queries just aren't practical. More fundamentally, the relational schema is forced to represent the graph relation with combinations of foreign keys and other data that the user really should not have to remember. All the user should know is that there is a specific relation, e.g. from this exon, the "next" exon is X, and the relation joining them is splice Y.

In Pygr, writing the query is just a matter of writing down the graph (edges from 1 to 2, 1 to 3, and 2 to 3, but no special "edge information"):

queryGraph = {1:{2:None,3:None},2:{3:None},3:{}}

We can now execute the query using the GraphQuery class:

results = [dict(m) for m in GraphQuery(spliceGraph,queryGraph)]

This is more or less equivalent to writing a bunch of for-loops for iterating over the possible closures:

results = []
for e1 in spliceGraph: # FIND ALL EXONS
    for e2 in spliceGraph[e1]: # NEXT EXON
        for e3 in spliceGraph[e2]: # NEXT EXON
            if e3 in spliceGraph[e1]: # MAKE SURE SPLICE FROM e1 -> e3
                results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH

It is often convenient to bind an object attribute to a graph, so that you can use either the graph syntax or a traditional object attribute and mean exactly the same thing. In the splice graph example, we bind the exon.next attribute to the spliceGraph, so the above for-loops can also be written:

results = []
for e1 in spliceGraph: # FIND ALL EXONS
    for e2 in e1.next: # NEXT EXON
        for e3 in e2.next: # NEXT EXON
            if e3 in e1.next: # MAKE SURE SPLICE FROM e1 -> e3
                results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH

Another interesting query in the alternative splicing field is the so-called U12-adapter exon query (see slide 21 of the ISMB slides):

queryGraph = {0:{1:dict(dataGraph=alt5Graph),
                 2:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},
              1:{3:None},
              2:{3:None},
              3:{}}

Here we use edge information in the query graph to add a few constraints:

Note that the query graph "nodes" (in this example, the integers 0, 1, 2, 3) are quite arbitrary. We could have used strings, or other kinds of objects instead.

Now if we want to see the results right away, we use the mapping returned by GraphQuery to look at individual nodes and edges of the dataGraph that matched our query:

for m in GraphQuery(spliceGraph,queryGraph):
    print m[1].id,m[0,2].id # PRINT EXON ID FOR EXON 1,
                            # SPLICE ID FOR SPLICE 0 -> 2

The match is returned by GraphQuery as a mapping from nodes and edges of the query graph to nodes and edges of the data graph. Edges are specified simply as tuples of the nodes you want to get the edge for (in this example 0,2). Constructing a Graph How was the spliceGraph created in the first place? Let's say we have an initial list of tuples giving connections between exon objects and splice objects, where each tuple consists of a pair of exons connected by a splice.

for exon1,exon2,splice in spliceConnections: 
    spliceGraph += exon1 # add exon1 as a node in the graph
    spliceGraph += exon2 # if already a node in the graph, does nothing...
    exon1.next[exon2] = splice # add an edge, with splice as the edgeinfo

The last operation makes use of the binding of exon.next to spliceGraph, and is equivalent to

spliceGraph[exon1][exon2] = splice

If we didn't want to save the edge information, we could use the simpler syntax

spliceGraph[exon1] += exon2 # equivalent to exon1.next+=exon2

This "short" form is equivalent to saving None as the edge information.

1.5.2 Alignment Query as a Graph Database Query

Pygr can do much more sophisticated analyses than this fairly easily. Just to give a taste of how to use these capabilities, we will illustrate one example of a standard Pygr model: querying Pygr data by drawing a ``query graph'' showing the connections we want to find, and running its GraphQuery() engine. Since Pygr alignments follow the same interface as any Pygr graph, we can query them using the standard GraphQuery class. Let's say we have a Python script load_alignments.py that loads two alignments:

To find out how the known SwissProt annotations map on to our mRNA sequences requires a join, which can be formulated as a simple Pygr graph, consisting of a mapping of an mRNA sequence interval (node 1), onto a SwissProt sequence interval (node 2), onto a feature annotation (node 3):

>>> from load_alignments import * # load the alignments
>>> from pygr.graphquery import *      # import the graph query code
# draw a graph using a dict.  Note: edge 2->3 must come from swiss_features
>>> queryGraph = {1:{2:None},2:{3:dict(dataGraph=swiss_features)},3:{}} 
# run the query and save the mappings
>>> l = [dict(d) for d in GraphQuery(mRNA_swiss,queryGraph)] 
>>> len(l) # how many annotations mapped onto our mRNA sequences?
4703

We assumed that mRNA_swiss would be passed as the default dataGraph, and specified directly that edge 2->3 should be looked up in swiss_features. We then captured all the results from the GraphQuery iterator using a Python list comprehension. Note that since the iterator returns each result in the same container (mapping object), if we want to save all the individual results we have to copy each one to a new mapping (dict) object, as illustrated in this example. Storing Alignments in a Relational Database

1.5.3 Example: a MySQL Database OBSOLETE

Here's an example of working with sequences from a relational database:

>>> import MySQLdb # standard module for accessing MySQL, now get a cursor...
>>> rdb = MySQLdb.connect(db='HUMAN_SPLICE_03',read_default_file=os.environ['HOME'
]+'/.my.cnf')
>>> t = SQLTable('genomic_cluster_JUN03',rdb.cursor()) #interface to a table of
 sequences
>>> from pygr.seqdb import *   # pygr module for working with sequences from databases
>>> t.objclass(DNASQLSequence) #use this class as "row objects"
>>> s2 = t['Hs.1162'] # get a specific sequence object by ID
>>> str(s2[1000:1050]) # this will only get 50 nt of the genomic sequence from 
MySQL
'acctgggtgatgaaataaatttttacgccaaatcccgatgacacacaatt'

(Note: in this example we used MySQLdb.connect()'s ability to read database server and user authentication information directly from the standard /.my.cnf file normally used by the MySQL client).


1.5.4 More examples

Additional examples of how to use pygr can be found in the tests/ directory within the pygr distribution package.

See About this document... for information on suggesting changes.