Subsections


1 Introduction

Pygr is an open source software project used to develop graph database interfaces for the popular Python language, with a strong emphasis on bioinformatics applications ranging from genome-wide analysis of alternative splicing patterns, to comparative genomics queries of multi-genome alignment data.

The following introductory examples show how to use Pygr for graph queries, sequence searching and alignment queries, annotation queries, and multigenome alignment queries.

1.1 Example: Simple graph query

Why would you want to use Pygr? Interesting data often consists of specific graph structures, and these relationships are much easier to describe as graphs than they are in SQL. For example, the simplest and most common form of alternative splicing is exon-skipping, where an exon is either skipped or included (see slide 15 of the ISMB slides for a picture). This can be defined immediately as a graph in which three nodes (exons 1, 2, 3) are joined by edges either as 1-2-3 or 1-3. Unfortunately, writing an SQL query for this simple pattern requires a 6-way JOIN (argh).

SELECT * FROM exons t1, exons t2, exons t3, splices t4, splices t5, splices t6 
WHERE t1.cluster_id=t4.cluster_id AND t1.gen_end=t4.gen_start 
  AND t4.cluster_id=t2.cluster_id AND t4.gen_end=t2.gen_start 
  AND t2.cluster_id=t5.cluster_id AND t2.gen_end=t5.gen_start 
  AND t5.cluster_id=t3.cluster_id AND t5.gen_end=t3.gen_start 
  AND t1.cluster_id=t6.cluster_id AND t1.gen_end=t6.gen_start 
  AND t6.cluster_id=t3.cluster_id AND t6.gen_end=t3.gen_start;

Such a six-way JOIN is painfully slow in a relational database; in general such queries just aren't practical. More fundamentally, the relational schema is forced to represent the graph relation with combinations of foreign keys and other data that the user really should not have to remember. All the user should know is that there is a specific relation, e.g. from this exon, the "next" exon is X, and the relation joining them is splice Y.

In Pygr, writing the query is just a matter of writing down the graph (edges from 1 to 2, 1 to 3, and 2 to 3, but no special "edge information"):

queryGraph={1:{2:None,3:None},2:{3:None},3:{}}

We can now execute the query using the GraphQuery class:

results=[dict(m) for m in GraphQuery(spliceGraph,queryGraph)]

This is more or less equivalent to writing a bunch of for-loops for iterating over the possible closures:

results=[]
for e1 in spliceGraph: # FIND ALL EXONS
    for e2 in spliceGraph[e1]: # NEXT EXON
        for e3 in spliceGraph[e2]: # NEXT EXON
            if e3 in spliceGraph[e1]: # MAKE SURE SPLICE FROM e1 -> e3
                results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH

It is often convenient to bind an object attribute to a graph, so that you can use either the graph syntax or a traditional object attribute and mean exactly the same thing. In the splice graph example, we bind the exon.next attribute to the spliceGraph, so the above for-loops can also be written:

results=[]
for e1 in spliceGraph: # FIND ALL EXONS
    for e2 in e1.next: # NEXT EXON
        for e3 in e2.next: # NEXT EXON
            if e3 in e1.next: # MAKE SURE SPLICE FROM e1 -> e3
                results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH

Another interesting query in the alternative splicing field is the so-called U12-adapter exon query (see slide 21 of the ISMB slides):

queryGraph={0:{1:dict(dataGraph=alt5Graph),
               2:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},
            1:{3:None},
            2:{3:None},
            3:{}}

Here we use edge information in the query graph to add a few constraints:

Note that the query graph "nodes" (in this example, the integers 0, 1, 2, 3) are quite arbitrary. We could have used strings, or other kinds of objects instead.

Now if we want to see the results right away, we use the mapping returned by GraphQuery to look at individual nodes and edges of the dataGraph that matched our query:

for m in GraphQuery(spliceGraph,queryGraph):
    print m[1].id,m[0,2].id # PRINT EXON ID FOR EXON 1,
                            # SPLICE ID FOR SPLICE 0 -> 2

The match is returned by GraphQuery as a mapping from nodes and edges of the query graph to nodes and edges of the data graph. Edges are specified simply as tuples of the nodes you want to get the edge for (in this example 0,2). Constructing a Graph How was the spliceGraph created in the first place? Let's say we have an initial list of tuples giving connections between exon objects and splice objects, where each tuple consists of a pair of exons connected by a splice.

for exon1,exon2,splice in spliceConnections: 
    spliceGraph+=exon1 # add exon1 as a node in the graph
    spliceGraph+=exon2 # if already a node in the graph, does nothing...
    exon1.next[exon2]=splice # add an edge, with splice as the edgeinfo

The last operation makes use of the binding of exon.next to spliceGraph, and is equivalent to

spliceGraph[exon1][exon2]=splice

If we didn't want to save the edge information, we could use the simpler syntax

spliceGraph[exon1]+=exon2 # equivalent to exon1.next+=exon2

This "short" form is equivalent to saving None as the edge information.


1.2 Sequence / Alignment Tutorial

Sequences and alignments also can be modeled as graph structures in Pygr, providing the same consistent and simple framework for queries.

1.2.1 Sequence Objects

Pygr tries to provide a very "Pythonic" model for sequences. This python interpreter session illustrates some simple features:

>>> from pygr.sequence import *
>>> s=Sequence('attatatgccactat','bobo') #create a sequence named bobo
>>> s # interpreter will print repr(s)
bobo[0:15]
>>> t=s[-8:] #python slice gives last 8 nt of s
>>> t # interpreter will print repr(t)
bobo[7:15]
>>> str(t) #string conversion just yields the sequence as a string 
'gccactat'
>>> rc= -s #get the reverse complement
>>> str(rc[:5]) #its first five letters
'atagt'

Several points:

1.2.2 Working with Sequences from Databases

Pygr provides a variety of "back-end" implementations of sequence objects, ranging from sequences stored in a relational database table, or a BLAST database, to sequences created by the user in Python (as above). All of these provide the same consistent interface, and in general try to be efficient. For example, Pygr sequence objects are just "placeholders" that record what sequence interval you're working with, but if the back-end is an external database, the sequence object itself does not store the sequence, and creating new sequence objects (e.g. taking slices of the object as above) will not require anything to be done on the actual sequence itself (such as copying a portion of it). Pygr only obtains sequence information when you actually ask for it (e.g. by taking the string value str(s) of a sequence object), and normally only obtains just the portion that you ask for (i.e. str(s[1000000:1000100]) only obtains 100nt of sequence, even if s is a 100 megabase sequence. By contrast str(s)[1000000:1000100] would force it to obtain the whole sequence from the database, then slice out just the 100 nt you selected).

Here's an example of working with sequences from a relational database:

>>> import MySQLdb # standard module for accessing MySQL, now get a cursor...
>>> rdb=MySQLdb.connect(db='HUMAN_SPLICE_03',read_default_file=os.environ['HOME'
]+'/.my.cnf')
>>> t=SQLTable('genomic_cluster_JUN03',rdb.cursor()) #interface to a table of
 sequences
>>> from pygr.seqdb import *   # pygr module for working with sequences from databases
>>> t.objclass(DNASQLSequence) #use this class as "row objects"
>>> s2=t['Hs.1162'] # get a specific sequence object by ID
>>> str(s2[1000:1050]) # this will only get 50 nt of the genomic sequence from 
MySQL
'acctgggtgatgaaataaatttttacgccaaatcccgatgacacacaatt'

(Note: in this example we used MySQLdb.connect()'s ability to read database server and user authentication information directly from the standard /.my.cnf file normally used by the MySQL client).

Here's an example of working with sequences from a BLAST database:

>>> from pygr.seqdb import *
>>> db=BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
>>> s=db['CYGB_HUMAN'][90:150] # get a sequence by ID, and take a slice
>>> str(s)
'TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA'
>>> m=db.blast(s) # get alignment to all BLAST hits in db
>>> for e in m.edges(): # print out the alignment edges
...     print e.srcPath,repr(e.srcPath),'\n',e.destPath,repr(e.destPath),e.blast
_score,'\n'
... 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPP CYGB_HUMAN[90:145] 
TLVENLRDADKLNTIFNQMGKSHALRHKVDPVYFKILAGVILEVLVEAFPQCFSP CYGB_BRARE[87:142] 72

TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150] 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
 120

TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150] 
TVVENLHDPDKVSSVLALVGKAHALKHKVEPMYFKILSGVILEVIAEEFANDFPVETQKA CYGB_MOUSE[90:150]
112 
...

This example introduces the use of a Pygr alignment object to store the mapping of s onto homologous sequences in db, obtained from BLAST. Here's what Pygr actually does:

1.2.3 Alignment Graph Basics

Pygr multiple alignment objects can be treated as mappings of sequence intervals onto sequence intervals. Here is a very simple example (continuing from the previous example), showing basic operations for constructing an alignment:

>>> m2=PathMapping() # create an empty alignment object
>>> m2[s]=db['MYG_CHICK'][83:143] # add an edge mapping interval s -> an interva
l of MYG_CHICK
>>> m2[s[:10] ] # get alignment for first 10 letters, and print representation
MYG_CHICK[83:93]
>>> m2[s]=db['MYG_CANFA'][45:105] # add an additional edge
>>> for s2 in m2[s[:10] ]: # get aligned seqs for the first 10 letters again...
...     print repr(s2)
...
MYG_CHICK[83:93]
MYG_CANFA[45:55]
In this case we created an alignment of sequence intervals without edge information. However, we can also store edge information with a given alignment edge, in the "usual" Pygr way:

>>> class Foo(object): pass # create a dummy class that we can add attributes to
...
>>> a=Foo() # create a dummy object
>>> a.x=7 # add an attribute
>>> m2=PathMapping() # create an empty alignment object
>>> m2+=s # add as node to the alignment graph
>>> m2[s][db['MYG_CHICK'][83:143] ]=a # add an edge mapping interval s -> an int
erval of MYG_CHICK
>>> for e in m2[s[:10] ].edges(): # get alignment edges for the first 10 letters
 again...
...     print repr(e.destPath),e.x # we can access edge attributes
...
MYG_CHICK[83:93] 7

1.2.4 Alignment Query

Since Pygr alignments follow the same interface as any Pygr graph, we can query them using the standard GraphQuery class. Let's say we have a Python script load_alignments.py that loads two alignments:

To find out how the known SwissProt annotations map on to our mRNA sequences requires a join, which can be formulated as a simple Pygr graph, consisting of a mapping of an mRNA sequence interval (node 1), onto a SwissProt sequence interval (node 2), onto a feature annotation (node 3):

>>> from load_alignments import * # load the alignments
>>> from pygr.graphquery import *      # import the graph query code
>>> queryGraph={1:{2:None},2:{3:dict(dataGraph=swiss_features)},3:{}} # 
edge 2->3 must come from swiss_features
>>> l=[dict(d) for d in GraphQuery(mRNA_swiss,queryGraph)] # run the query and 
save the mappings
>>> len(l) # how many annotations mapped onto our mRNA sequences?
4703

We assumed that mRNA_swiss would be passed as the default dataGraph, and specified directly that edge 2->3 should be looked up in swiss_features. We then captured all the results from the GraphQuery iterator using a Python list comprehension. Note that since the iterator returns each result in the same container (mapping object), if we want to save all the individual results we have to copy each one to a new mapping (dict) object, as illustrated in this example. Storing Alignments in a Relational Database

1.2.5 Example: Working with the UCSC Multigenome Alignment Data

David Haussler's group has constructed alignments of multiple genomes. These alignments are extremely useful and interesting, but so large that it is cumbersome to work with the dataset using conventional methods. For example, for the 8-genome alignment you have to work simultaneously with the individual genome datasets for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as well as the huge alignment itself. Pygr makes this quite easy. Here we illustrate an example of mapping an alternative 3' exon, which has two alternative splice sites (start1 and start2) and a single terminal splice site (stop). We use the alignment database to map each of these splice sites onto all the aligned genomes, and to print the percent-identity and percent-aligned for each genome, as well as the two nucleotides consituting the splice site itself. To examine the conservation of the two exonic regions (between start1 and start2, and the adjacent region terminated by stop, we print the same information for each genome's alignment to these two regions as well. The code first opens the alignment database. The function (getAlt3Conservation) obtains sequence slice objects representing the various ``sites'' to be queried. The actual alignment database query is performed in printResults:

from pygr import cnestedlist
msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE ALIGNMENT DB

def printResults(prefix,msa,site,altID='NULL',cluster_id='NULL',dbUnion=None):
    'get alignment of each genome to site, print %identity and %aligned'
    for src,dest,edge in msa[site].edges(mergeMost=True): # ALIGNMENT QUERY!
        print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
              %(altID,cluster_id,prefix,dbUnion.dicts[dest.pathForward.db],
                100.*edge.pIdentity(),100.*edge.pAligned(),src[:2],dest[:2])

def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
    'gene must be a slice of a sequence in our genome alignment msa'
    ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
    ss2=gene[start2-2:start2]
    ss3=gene[stop:stop+2]
    e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
    e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND stop
    zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd REQUESTS
    cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS TO MINIMIZE TRAFFIC
    for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3),('e1',e1),('e2',e2)]:
        printResults(prefix,msa,site,dbUnion=msa.seqDict,**kwargs)

# RUN A QUERY LIKE THIS...
# getAlt3Conservation(msa,some_gene,some_start,other_start,stop)

A few notes:

(Actually, because of Pygr's caching / optimizations, considerably more is going on than indicated in this simplified sketch. But you get the idea: Pygr makes it relatively effortless to work with a variety of disparate (and large) resources in an integrated way.)

Here is some example output:

1       Mm.99996        ss1     hg17    50.0    100.0   AG      GG
1       Mm.99996        ss1     canFam1 50.0    100.0   AG      GG
1       Mm.99996        ss1     panTro1 50.0    100.0   AG      GG
1       Mm.99996        ss1     rn3     100.0   100.0   AG      AG
1       Mm.99996        ss2     hg17    100.0   100.0   AG      AG
1       Mm.99996        ss2     canFam1 100.0   100.0   AG      AG
1       Mm.99996        ss2     panTro1 100.0   100.0   AG      AG
1       Mm.99996        ss2     rn3     100.0   100.0   AG      AG
1       Mm.99996        ss3     hg17    100.0   100.0   GT      GT
1       Mm.99996        ss3     canFam1 100.0   100.0   GT      GT
1       Mm.99996        ss3     panTro1 100.0   100.0   GT      GT
1       Mm.99996        ss3     rn3     100.0   100.0   GT      GT
1       Mm.99996        e1      hg17    78.9    100.0   AG      GG
1       Mm.99996        e1      canFam1 84.2    100.0   AG      GG
1       Mm.99996        e1      panTro1 77.6    100.0   AG      GG
1       Mm.99996        e1      rn3     97.4    98.7    AG      AG
1       Mm.99996        e2      hg17    91.6    99.1    CC      CC
1       Mm.99996        e2      canFam1 88.8    99.1    CC      CC
1       Mm.99996        e2      panTro1 91.6    99.1    CC      CC
1       Mm.99996        e2      rn3     97.2    100.0   CC      CC

1.2.6 Building an Alignment Database from MAF files

UCSC has defined a new file format for large multigenome alignment, called MAF. Pygr provides high-performance utilities for reading MAF alignment files and building a disk-based NLMSA alignment database. (These utilities are written in C for performance). Here's an example of building an alignment database from scratch using a set of MAF files stored in a directory called maf/:

import os
from pygr import cnestedlist,seqdb

genomes={'hg17':'hg17','mm5':'mm5', 'rn3':'rn3', 'canFam1':'cf1', 
         'danRer1':'dr1', 'fr1':'fr1','galGal2':'gg2', 'panTro1':'pt1'}
for k,v in genomes.items(): # PREFIX DICTIONARY FOR UNION OF GENOMES
    genomes[k]=seqdb.BlastDB(v) # USE v AS FILENAME FOR FASTA FILE
genomeUnion=seqdb.PrefixUnionDict(genomes) # CREATE UNION OF THESE DBs
# CREATE NLMSA DATABASE ucsc8 ON DISK, FROM MAF FILES IN maf/
msa=cnestedlist.NLMSA('ucsc8','w',genomeUnion,os.listdir('maf'))

The only real work here is due to the fact that UCSC's MAF files use a prefix.suffix notation for identifying specific sequences, where prefix gives the name of the genome, and suffix gives the identifier of the sequence in that genome database. Here we use Pygr's PrefixUnionDict class to wrap the set of genome databases in a dict-like interface that accepts string keys of the form prefix.suffix and returns the right sequence object from the right genome database. As an added twist, the genome names in the MAF files match the filenames of the associated genome databases in most cases, but not all, so we have to create an initial dictionary giving the correct mapping. Actually building the NLMSA requires just one line, but actually a number of steps are happening behind the scenes:

1.2.7 Persistent Alignment Storage

Pygr makes it easy to store alignments persistently in a relational database. To illustrate, here we run BLAST to generate an alignment of one sequence to manyother sequences, and store it to a MySQL table test.alignment:

from pygr.seqdb import *
db=BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
m=db.blast(db['CYGB_HUMAN']) # get alignment to all BLAST hits in db
import MySQLdb # import the MySQL adapter
import os
# open a connection to the test database
cursor=MySQLdb.Connection(db='test',read_default_file=os.environ['HOME']+'/.my.cnf').cursor()
# save the alignment, creating table test.alignment if not exists
createTableFromRepr(m.repr_dict(),'test.alignment',cursor,
                    {'src_id':'varchar(12)','dest_id':'varchar(12)'})

The only line here that needs comment is Pygr's createTableFromRepr() function. It takes four arguments:

1. an iterator, each of whose return values is a dictionary mapping column names to value representations that can be stored in a relational table. Here we have passed m.repr_dict() as the iterator. It is a generator function that simply yields edge.repr_dict() on each of the edges stored in m. The edge.repr_dict() method in turn simply returns a dictionary mapping column names (like 'src_id') onto values that are storable in a relational table. 2. a table name: in this case, 'test.alignment' 3. a cursor for connecting to the database 4. an optional dictionary for customizing exactly how specific columns should be stored in the database. In this case, we specified varchar(12) for both src_id and dest_id (the identifiers of the source and destination (aligned) sequences)

This function constructs a schema and creates the table in the database if it does not already exist. It then saves all of the output of m.repr_dict() as rows in the database table. To see exactly what it generates, we can look at its output interactively:

>>> for e in m.repr_dict(): break
... 
>>> e
{'src_id': 'CYGB_HUMAN', 'blast_score': 344, 'dest_id': 'CYGB_HUMAN', 'src_end':
 190, 'src_ori': 1, 'percent_id': 90, 'dest_ori': 1, 'src_start': 0, 'dest_start' : 0, e_value': 94.301000000000002, 'dest_end': 190}

Here is the schema automatically created by createTableFromRepr():

mysql> desc alignment;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| src_id      | varchar(12) |      | MUL |         |       |
| blast_score | int(11)     |      |     | 0       |       |
| dest_id     | varchar(12) |      | MUL |         |       |
| src_end     | int(11)     |      |     | 0       |       |
| src_ori     | int(11)     |      |     | 0       |       |
| percent_id  | int(11)     |      | MUL | 0       |       |
| dest_ori    | int(11)     |      |     | 0       |       |
| src_start   | int(11)     |      |     | 0       |       |
| dest_start  | int(11)     |      |     | 0       |       |
| e_value     | float       |      |     | 0       |       |
| dest_end    | int(11)     |      |     | 0       |       |
+-------------+-------------+------+-----+---------+-------+
11 rows in set (0.01 sec)

Thus the basic format of stored alignments is

We can now make use of the stored alignment in Pygr just as if it were a normal (in-memory) alignment object. The only difference is we use the StoredPathMapping class instead of the standard PathMapping class. Here is an example script that connects to the stored alignment, and prints out the alignment of one sequence:

import MySQLdb
from pygr.seqdb import *
cursor=MySQLdb.Connection(db='test',read_default_file=os.environ['HOME']+'/.my.cnf').cursor()
t=SQLTableMultiNoCache('test.alignment',cursor)
t._distinct_key='src_id' # use this column as the ID field to search
sp=BlastDB('sp') # OPEN SWISSPROT BLAST DB
m=StoredPathMapping(t,sp,sp) # tell it to search sp for both source and target 
sequence IDs
for e in m[sp['CYGB_HUMAN'] ].edges(): # SHOW ALL ALIGNMENTS FOR THIS SEQUENCE
    print repr(e.srcPath),repr(e.destPath),e.blast_score

StoredPathMapping expects three arguments:

1. an SQLTable object which provides the interface to the database table containing the alignment. t is just an interface to the MySQL table. Since each sequence can potentially have many rows (many alignment intervals against one or more destination sequences), we used the SQLTableMultiNoCache subclass: each key value (source sequence ID) can have multiple rows; and we chose not to have it cache rows that it retrieves (there would be no benefit from doing so, since StoredPathMapping itself performs result caching). Also note the t._distinct_key assignment: this informs the table object that when we try to access a particular key (e.g. t['CYGB_HUMAN']), the column to search for this key is 'src_id'. 2. a sequence dictionary object from which to obtain the source sequences. This dictionary must map a source sequence identifier to an actual sequence object representing that sequence. In this case, we just used our sp BLAST database as the source sequence dictionary. 3. a sequence dictionary object from which to obtain the destination sequences. This dictionary must map a destination sequence identifier to an actual sequence object representing that sequence. We just used the sp database again.

Thus we informed StoredPathMapping to treat t as a mapping of sp onto itself.

The StoredPathMapping object (m) acts as an interface to the relational database, which we can treat just like a regular PathMapping alignment object. The difference is that it looks up the alignment from the back-end database table when you request a mapping of a specific sequence. In other words, only when we invoked m[sp['CYGB_HUMAN'] ] did it obtain the alignment information for that source sequence from the database; once retrieved, this information is cached locally so that subsequent queries of the same source sequence will just use the local data without having to re-query the database. Example: Mapping an entire gene set onto a new genome version To illustrate how Pygr can perform a big task with a little code, here is an example that maps a set of gene sequences onto a new version of the genome, using megablast to do the mapping, and a relational database to store the results. Moreover, since mapping 80,000 gene clusters takes a fair amount of time, the calculation is parallelized to run over a large number of compute nodes simultaneously:

from pygr.apps.leelabdb import * # this accesses our databases
from pygr import coordinator     # this provides parallelization support

def map_clusters(server,genome_rsrc='hg17',dbname='HUMAN_SPLICE_03',
                 result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
                 rmOpts='',**kwargs):
    "CLIENT FUNCTION: map clusters one by one"
    # CONSTRUCT RESOURCE FOR US IF NEEDED
    genome=BlastDB(ifile=server.open_resource(genome_rsrc,'r'))
    # LOAD DB SCHEMA
    (clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna, 
    protein, clusterExons,clusterSplices)=getSpliceGraphFromDB(spliceCalcs[dbname])
    # NOW MAP CLUSTER SEQUENCES ONE BY ONE TO OUR NEW genome
    for cluster_id in server:
        g=genomic_seq[cluster_id] # GET THE OLD GENOMIC SEQUENCE FOR THIS CLUSTER
        m=genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
        # SAVE ALIGNMENT m TO DATABASE TABLE result_table USING cursor
        createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
                            {'src_id':'varchar(12)','dest_id':'varchar(12)'})
        yield cluster_id # WE MUST FUNCTION AS GENERATOR TO KEEP ERROR TRAPPING 
		         # HAPPY

def serve_clusters(dbname='HUMAN_SPLICE_03',
                   source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
    "SERVER FUNCTION: serve up cluster_id one by one to as many clients as you want"
    cursor=getUserCursor(dbname)
    t=SQLTable(source_table,cursor)
    for id in t:
        yield id # HAND OUT ONE CLUSTER ID TO A CLIENT

if __name__=='__main__': # AUTOMATICALLY RUN EITHER THE CLIENT OR SERVER FUNCTION
    coordinator.start_client_or_server(map_clusters,serve_clusters,['hg17'],__file__)

First, let's just focus on the map_clusters() function, which illustrates how the mapping of each gene is generated and saved. Let's examine the data piece by piece:

This code will run in parallel over as many compute nodes as you have free, using Pygr's coordinator module. The parallelization model for this particular task is simple: a single iterator (server) dispensing task IDs to many clients.


1.2.8 More examples

Additional examples of how to use pygr can be found in the tests/ directory within the pygr distribution package.

See About this document... for information on suggesting changes.