Pygr is an open source software project used to develop graph database interfaces for the popular Python language, with a strong emphasis on bioinformatics applications ranging from genome-wide analysis of alternative splicing patterns, to comparative genomics queries of multi-genome alignment data.
The following introductory examples show how to use Pygr for graph queries, sequence searching and alignment queries, annotation queries, and multigenome alignment queries.
SELECT * FROM exons t1, exons t2, exons t3, splices t4, splices t5, splices t6 WHERE t1.cluster_id=t4.cluster_id AND t1.gen_end=t4.gen_start AND t4.cluster_id=t2.cluster_id AND t4.gen_end=t2.gen_start AND t2.cluster_id=t5.cluster_id AND t2.gen_end=t5.gen_start AND t5.cluster_id=t3.cluster_id AND t5.gen_end=t3.gen_start AND t1.cluster_id=t6.cluster_id AND t1.gen_end=t6.gen_start AND t6.cluster_id=t3.cluster_id AND t6.gen_end=t3.gen_start;
Such a six-way JOIN is painfully slow in a relational database; in general such queries just aren't practical. More fundamentally, the relational schema is forced to represent the graph relation with combinations of foreign keys and other data that the user really should not have to remember. All the user should know is that there is a specific relation, e.g. from this exon, the "next" exon is X, and the relation joining them is splice Y.
In Pygr, writing the query is just a matter of writing down the graph (edges from 1 to 2, 1 to 3, and 2 to 3, but no special "edge information"):
queryGraph={1:{2:None,3:None},2:{3:None},3:{}}
We can now execute the query using the GraphQuery class:
results=[dict(m) for m in GraphQuery(spliceGraph,queryGraph)]
This is more or less equivalent to writing a bunch of for-loops for iterating over the possible closures:
results=[]
for e1 in spliceGraph: # FIND ALL EXONS
for e2 in spliceGraph[e1]: # NEXT EXON
for e3 in spliceGraph[e2]: # NEXT EXON
if e3 in spliceGraph[e1]: # MAKE SURE SPLICE FROM e1 -> e3
results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH
It is often convenient to bind an object attribute to a graph, so that you can use either the graph syntax or a traditional object attribute and mean exactly the same thing. In the splice graph example, we bind the exon.next attribute to the spliceGraph, so the above for-loops can also be written:
results=[]
for e1 in spliceGraph: # FIND ALL EXONS
for e2 in e1.next: # NEXT EXON
for e3 in e2.next: # NEXT EXON
if e3 in e1.next: # MAKE SURE SPLICE FROM e1 -> e3
results.append({1:e1,2:e2,3:e3}) # OK, SAVE MATCH
Another interesting query in the alternative splicing field is the so-called U12-adapter exon query (see slide 21 of the ISMB slides):
queryGraph={0:{1:dict(dataGraph=alt5Graph),
2:dict(filter=lambda edge,**kwargs:edge.type=='U11/U12')},
1:{3:None},
2:{3:None},
3:{}}
Here we use edge information in the query graph to add a few constraints:
Note that the query graph "nodes" (in this example, the integers 0, 1, 2, 3) are quite arbitrary. We could have used strings, or other kinds of objects instead.
Now if we want to see the results right away, we use the mapping returned by GraphQuery to look at individual nodes and edges of the dataGraph that matched our query:
for m in GraphQuery(spliceGraph,queryGraph):
print m[1].id,m[0,2].id # PRINT EXON ID FOR EXON 1,
# SPLICE ID FOR SPLICE 0 -> 2
The match is returned by GraphQuery as a mapping from nodes and edges of the query graph to nodes and edges of the data graph. Edges are specified simply as tuples of the nodes you want to get the edge for (in this example 0,2). Constructing a Graph How was the spliceGraph created in the first place? Let's say we have an initial list of tuples giving connections between exon objects and splice objects, where each tuple consists of a pair of exons connected by a splice.
for exon1,exon2,splice in spliceConnections:
spliceGraph+=exon1 # add exon1 as a node in the graph
spliceGraph+=exon2 # if already a node in the graph, does nothing...
exon1.next[exon2]=splice # add an edge, with splice as the edgeinfo
The last operation makes use of the binding of exon.next to spliceGraph, and is equivalent to
spliceGraph[exon1][exon2]=splice
If we didn't want to save the edge information, we could use the simpler syntax
spliceGraph[exon1]+=exon2 # equivalent to exon1.next+=exon2
This "short" form is equivalent to saving None as the edge information.
Sequences and alignments also can be modeled as graph structures in Pygr, providing the same consistent and simple framework for queries.
>>> from pygr.sequence import *
>>> s=Sequence('attatatgccactat','bobo') #create a sequence named bobo
>>> s # interpreter will print repr(s)
bobo[0:15]
>>> t=s[-8:] #python slice gives last 8 nt of s
>>> t # interpreter will print repr(t)
bobo[7:15]
>>> str(t) #string conversion just yields the sequence as a string
'gccactat'
>>> rc= -s #get the reverse complement
>>> str(rc[:5]) #its first five letters
'atagt'
Several points:
Pygr provides a variety of "back-end" implementations of sequence objects, ranging from sequences stored in a relational database table, or a BLAST database, to sequences created by the user in Python (as above). All of these provide the same consistent interface, and in general try to be efficient. For example, Pygr sequence objects are just "placeholders" that record what sequence interval you're working with, but if the back-end is an external database, the sequence object itself does not store the sequence, and creating new sequence objects (e.g. taking slices of the object as above) will not require anything to be done on the actual sequence itself (such as copying a portion of it). Pygr only obtains sequence information when you actually ask for it (e.g. by taking the string value str(s) of a sequence object), and normally only obtains just the portion that you ask for (i.e. str(s[1000000:1000100]) only obtains 100nt of sequence, even if s is a 100 megabase sequence. By contrast str(s)[1000000:1000100] would force it to obtain the whole sequence from the database, then slice out just the 100 nt you selected).
Here's an example of working with sequences from a relational database:
>>> import MySQLdb # standard module for accessing MySQL, now get a cursor...
>>> rdb=MySQLdb.connect(db='HUMAN_SPLICE_03',read_default_file=os.environ['HOME'
]+'/.my.cnf')
>>> t=SQLTable('genomic_cluster_JUN03',rdb.cursor()) #interface to a table of
sequences
>>> from pygr.seqdb import * # pygr module for working with sequences from databases
>>> t.objclass(DNASQLSequence) #use this class as "row objects"
>>> s2=t['Hs.1162'] # get a specific sequence object by ID
>>> str(s2[1000:1050]) # this will only get 50 nt of the genomic sequence from
MySQL
'acctgggtgatgaaataaatttttacgccaaatcccgatgacacacaatt'
(Note: in this example we used MySQLdb.connect()'s ability to read database server and user authentication information directly from the standard /.my.cnf file normally used by the MySQL client).
Here's an example of working with sequences from a BLAST database:
>>> from pygr.seqdb import *
>>> db=BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
>>> s=db['CYGB_HUMAN'][90:150] # get a sequence by ID, and take a slice
>>> str(s)
'TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA'
>>> m=db.blast(s) # get alignment to all BLAST hits in db
>>> for e in m.edges(): # print out the alignment edges
... print e.srcPath,repr(e.srcPath),'\n',e.destPath,repr(e.destPath),e.blast
_score,'\n'
...
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPP CYGB_HUMAN[90:145]
TLVENLRDADKLNTIFNQMGKSHALRHKVDPVYFKILAGVILEVLVEAFPQCFSP CYGB_BRARE[87:142] 72
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
120
TVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGVILEVVAEEFASDFPPETQRA CYGB_HUMAN[90:150]
TVVENLHDPDKVSSVLALVGKAHALKHKVEPMYFKILSGVILEVIAEEFANDFPVETQKA CYGB_MOUSE[90:150]
112
...
This example introduces the use of a Pygr alignment object to store the mapping of s onto homologous sequences in db, obtained from BLAST. Here's what Pygr actually does:
Pygr multiple alignment objects can be treated as mappings of sequence intervals onto sequence intervals. Here is a very simple example (continuing from the previous example), showing basic operations for constructing an alignment:
>>> m2=PathMapping() # create an empty alignment object >>> m2[s]=db['MYG_CHICK'][83:143] # add an edge mapping interval s -> an interva l of MYG_CHICK >>> m2[s[:10] ] # get alignment for first 10 letters, and print representation MYG_CHICK[83:93] >>> m2[s]=db['MYG_CANFA'][45:105] # add an additional edge >>> for s2 in m2[s[:10] ]: # get aligned seqs for the first 10 letters again... ... print repr(s2) ... MYG_CHICK[83:93] MYG_CANFA[45:55]
>>> class Foo(object): pass # create a dummy class that we can add attributes to ... >>> a=Foo() # create a dummy object >>> a.x=7 # add an attribute >>> m2=PathMapping() # create an empty alignment object >>> m2+=s # add as node to the alignment graph >>> m2[s][db['MYG_CHICK'][83:143] ]=a # add an edge mapping interval s -> an int erval of MYG_CHICK >>> for e in m2[s[:10] ].edges(): # get alignment edges for the first 10 letters again... ... print repr(e.destPath),e.x # we can access edge attributes ... MYG_CHICK[83:93] 7
Since Pygr alignments follow the same interface as any Pygr graph, we can query them using the standard GraphQuery class. Let's say we have a Python script load_alignments.py that loads two alignments:
>>> from load_alignments import * # load the alignments
>>> from pygr.graphquery import * # import the graph query code
>>> queryGraph={1:{2:None},2:{3:dict(dataGraph=swiss_features)},3:{}} #
edge 2->3 must come from swiss_features
>>> l=[dict(d) for d in GraphQuery(mRNA_swiss,queryGraph)] # run the query and
save the mappings
>>> len(l) # how many annotations mapped onto our mRNA sequences?
4703
We assumed that mRNA_swiss would be passed as the default dataGraph, and specified directly that edge 2->3 should be looked up in swiss_features. We then captured all the results from the GraphQuery iterator using a Python list comprehension. Note that since the iterator returns each result in the same container (mapping object), if we want to save all the individual results we have to copy each one to a new mapping (dict) object, as illustrated in this example. Storing Alignments in a Relational Database
David Haussler's group has constructed alignments of multiple genomes. These alignments are extremely useful and interesting, but so large that it is cumbersome to work with the dataset using conventional methods. For example, for the 8-genome alignment you have to work simultaneously with the individual genome datasets for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as well as the huge alignment itself. Pygr makes this quite easy. Here we illustrate an example of mapping an alternative 3' exon, which has two alternative splice sites
(start1 and start2) and a single terminal splice site (stop).
We use the alignment database to map each of these splice sites onto all the aligned
genomes, and to print the percent-identity and percent-aligned for each genome,
as well as the two nucleotides consituting the splice site itself.
To examine the conservation of the two exonic regions (between start1
and start2, and the adjacent region terminated by stop,
we print the same information for each genome's alignment to these two regions as well.
The code first opens the alignment database. The function (getAlt3Conservation)
obtains sequence slice objects representing the various ``sites'' to be queried.
The actual alignment database query is performed in printResults:
from pygr import cnestedlist
msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE ALIGNMENT DB
def printResults(prefix,msa,site,altID='NULL',cluster_id='NULL',dbUnion=None):
'get alignment of each genome to site, print %identity and %aligned'
for src,dest,edge in msa[site].edges(mergeMost=True): # ALIGNMENT QUERY!
print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
%(altID,cluster_id,prefix,dbUnion.dicts[dest.pathForward.db],
100.*edge.pIdentity(),100.*edge.pAligned(),src[:2],dest[:2])
def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
'gene must be a slice of a sequence in our genome alignment msa'
ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
ss2=gene[start2-2:start2]
ss3=gene[stop:stop+2]
e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND stop
zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd REQUESTS
cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS TO MINIMIZE TRAFFIC
for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3),('e1',e1),('e2',e2)]:
printResults(prefix,msa,site,dbUnion=msa.seqDict,**kwargs)
# RUN A QUERY LIKE THIS...
# getAlt3Conservation(msa,some_gene,some_start,other_start,stop)
A few notes:
printResults().
msa is the database; site is the interval query; and the
edges methods iterates over the results, returning a tuple for
each, consisting of a source sequence interval (i.e. an interval of
site), a destination sequence interval (i.e. an interval in
an aligned genome), and an edge object describing that alignment.
We are taking advantage of Pygr's group-by operator mergeMost,
which will cause multiple intervals in a given sequence to be merged
into a single interval that constitutes their ``union''. Thus,
for each aligned genome, the edges iterator will return a single
aligned interval. The alignment edge object provides some useful
conveniences, such as calculating the percent-identity between src
and dest automatically for you. pIdentity() computes
the fraction of identical residues; pAligned computes the
fraction of aligned residues (allowing you to see if there are
big gaps or insertions in the alignment of this interval). If we
had wanted to inspect the detailed alignment letter by letter, we
would just iterate over the letters attribute instead of
the edges method. (See the NLMSASlice documentation for
further information).
src[:2] and dest[:2] print the first two nucleotides
of the site in gene and in the aligned genome.
fastacmd),
not the cnestedlist database. Basically, each genome is being queried
as a separate BLAST formatted database, represented in Pygr by the
BlastDB class. Pygr makes this complex set of multi-database
operations more or less transparent to the user.
For further information, see the BlastDB documentation.
gene sequence to obtain the specific sites that we want to query.
Note: gene must itself be a slice of a sequence in our alignment,
or the alignment query msa[site] will raise an IndexError informing
the user that the sequence site is not in the alignment.
e1=ss1+ss2.
This obtains a single interval that contains both of the input intervals.
(Actually, because of Pygr's caching / optimizations, considerably more is going on than indicated in this simplified sketch. But you get the idea: Pygr makes it relatively effortless to work with a variety of disparate (and large) resources in an integrated way.)
Here is some example output:
1 Mm.99996 ss1 hg17 50.0 100.0 AG GG 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT 1 Mm.99996 e1 hg17 78.9 100.0 AG GG 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG 1 Mm.99996 e1 rn3 97.4 98.7 AG AG 1 Mm.99996 e2 hg17 91.6 99.1 CC CC 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC 1 Mm.99996 e2 rn3 97.2 100.0 CC CC
maf/:
import os
from pygr import cnestedlist,seqdb
genomes={'hg17':'hg17','mm5':'mm5', 'rn3':'rn3', 'canFam1':'cf1',
'danRer1':'dr1', 'fr1':'fr1','galGal2':'gg2', 'panTro1':'pt1'}
for k,v in genomes.items(): # PREFIX DICTIONARY FOR UNION OF GENOMES
genomes[k]=seqdb.BlastDB(v) # USE v AS FILENAME FOR FASTA FILE
genomeUnion=seqdb.PrefixUnionDict(genomes) # CREATE UNION OF THESE DBs
# CREATE NLMSA DATABASE ucsc8 ON DISK, FROM MAF FILES IN maf/
msa=cnestedlist.NLMSA('ucsc8','w',genomeUnion,os.listdir('maf'))
The only real work here is due to the fact that UCSC's MAF files use a prefix.suffix notation for identifying specific sequences, where prefix gives the name of the genome, and suffix gives the identifier of the sequence in that genome database. Here we use Pygr's PrefixUnionDict class to wrap the set of genome databases in a dict-like interface that accepts string keys of the form prefix.suffix and returns the right sequence object from the right genome database. As an added twist, the genome names in the MAF files match the filenames of the associated genome databases in most cases, but not all, so we have to create an initial dictionary giving the correct mapping. Actually building the NLMSA requires just one line, but actually a number of steps are happening behind the scenes:
fastacmd's horrible
memory requirements and slow speed, so we implemented fast sequence
indexing.
from pygr.seqdb import *
db=BlastDB('sp') # open BLAST database associated with FASTA file 'sp'
m=db.blast(db['CYGB_HUMAN']) # get alignment to all BLAST hits in db
import MySQLdb # import the MySQL adapter
import os
# open a connection to the test database
cursor=MySQLdb.Connection(db='test',read_default_file=os.environ['HOME']+'/.my.cnf').cursor()
# save the alignment, creating table test.alignment if not exists
createTableFromRepr(m.repr_dict(),'test.alignment',cursor,
{'src_id':'varchar(12)','dest_id':'varchar(12)'})
The only line here that needs comment is Pygr's createTableFromRepr() function. It takes four arguments:
1. an iterator, each of whose return values is a dictionary mapping column names to value representations that can be stored in a relational table. Here we have passed m.repr_dict() as the iterator. It is a generator function that simply yields edge.repr_dict() on each of the edges stored in m. The edge.repr_dict() method in turn simply returns a dictionary mapping column names (like 'src_id') onto values that are storable in a relational table. 2. a table name: in this case, 'test.alignment' 3. a cursor for connecting to the database 4. an optional dictionary for customizing exactly how specific columns should be stored in the database. In this case, we specified varchar(12) for both src_id and dest_id (the identifiers of the source and destination (aligned) sequences)
This function constructs a schema and creates the table in the database if it does not already exist. It then saves all of the output of m.repr_dict() as rows in the database table. To see exactly what it generates, we can look at its output interactively:
>>> for e in m.repr_dict(): break
...
>>> e
{'src_id': 'CYGB_HUMAN', 'blast_score': 344, 'dest_id': 'CYGB_HUMAN', 'src_end':
190, 'src_ori': 1, 'percent_id': 90, 'dest_ori': 1, 'src_start': 0, 'dest_start' : 0, e_value': 94.301000000000002, 'dest_end': 190}
Here is the schema automatically created by createTableFromRepr():
mysql> desc alignment;
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| src_id | varchar(12) | | MUL | | |
| blast_score | int(11) | | | 0 | |
| dest_id | varchar(12) | | MUL | | |
| src_end | int(11) | | | 0 | |
| src_ori | int(11) | | | 0 | |
| percent_id | int(11) | | MUL | 0 | |
| dest_ori | int(11) | | | 0 | |
| src_start | int(11) | | | 0 | |
| dest_start | int(11) | | | 0 | |
| e_value | float | | | 0 | |
| dest_end | int(11) | | | 0 | |
+-------------+-------------+------+-----+---------+-------+
11 rows in set (0.01 sec)
Thus the basic format of stored alignments is
We can now make use of the stored alignment in Pygr just as if it were a normal (in-memory) alignment object. The only difference is we use the StoredPathMapping class instead of the standard PathMapping class. Here is an example script that connects to the stored alignment, and prints out the alignment of one sequence:
import MySQLdb
from pygr.seqdb import *
cursor=MySQLdb.Connection(db='test',read_default_file=os.environ['HOME']+'/.my.cnf').cursor()
t=SQLTableMultiNoCache('test.alignment',cursor)
t._distinct_key='src_id' # use this column as the ID field to search
sp=BlastDB('sp') # OPEN SWISSPROT BLAST DB
m=StoredPathMapping(t,sp,sp) # tell it to search sp for both source and target
sequence IDs
for e in m[sp['CYGB_HUMAN'] ].edges(): # SHOW ALL ALIGNMENTS FOR THIS SEQUENCE
print repr(e.srcPath),repr(e.destPath),e.blast_score
StoredPathMapping expects three arguments:
1. an SQLTable object which provides the interface to the database table containing the alignment. t is just an interface to the MySQL table. Since each sequence can potentially have many rows (many alignment intervals against one or more destination sequences), we used the SQLTableMultiNoCache subclass: each key value (source sequence ID) can have multiple rows; and we chose not to have it cache rows that it retrieves (there would be no benefit from doing so, since StoredPathMapping itself performs result caching). Also note the t._distinct_key assignment: this informs the table object that when we try to access a particular key (e.g. t['CYGB_HUMAN']), the column to search for this key is 'src_id'. 2. a sequence dictionary object from which to obtain the source sequences. This dictionary must map a source sequence identifier to an actual sequence object representing that sequence. In this case, we just used our sp BLAST database as the source sequence dictionary. 3. a sequence dictionary object from which to obtain the destination sequences. This dictionary must map a destination sequence identifier to an actual sequence object representing that sequence. We just used the sp database again.
Thus we informed StoredPathMapping to treat t as a mapping of sp onto itself.
The StoredPathMapping object (m) acts as an interface to the relational database, which we can treat just like a regular PathMapping alignment object. The difference is that it looks up the alignment from the back-end database table when you request a mapping of a specific sequence. In other words, only when we invoked m[sp['CYGB_HUMAN'] ] did it obtain the alignment information for that source sequence from the database; once retrieved, this information is cached locally so that subsequent queries of the same source sequence will just use the local data without having to re-query the database. Example: Mapping an entire gene set onto a new genome version To illustrate how Pygr can perform a big task with a little code, here is an example that maps a set of gene sequences onto a new version of the genome, using megablast to do the mapping, and a relational database to store the results. Moreover, since mapping 80,000 gene clusters takes a fair amount of time, the calculation is parallelized to run over a large number of compute nodes simultaneously:
from pygr.apps.leelabdb import * # this accesses our databases
from pygr import coordinator # this provides parallelization support
def map_clusters(server,genome_rsrc='hg17',dbname='HUMAN_SPLICE_03',
result_table='GENOME_ALIGNMENT.hg17_cluster_JUN03_all',
rmOpts='',**kwargs):
"CLIENT FUNCTION: map clusters one by one"
# CONSTRUCT RESOURCE FOR US IF NEEDED
genome=BlastDB(ifile=server.open_resource(genome_rsrc,'r'))
# LOAD DB SCHEMA
(clusters,exons,splices,genomic_seq,spliceGraph,alt5Graph,alt3Graph,mrna,
protein, clusterExons,clusterSplices)=getSpliceGraphFromDB(spliceCalcs[dbname])
# NOW MAP CLUSTER SEQUENCES ONE BY ONE TO OUR NEW genome
for cluster_id in server:
g=genomic_seq[cluster_id] # GET THE OLD GENOMIC SEQUENCE FOR THIS CLUSTER
m=genome.megablast(g,maxseq=1,minIdentity=98,rmOpts=rmOpts) # MASK, BLAST, READ INTO m
# SAVE ALIGNMENT m TO DATABASE TABLE result_table USING cursor
createTableFromRepr(m.repr_dict(),result_table,clusters.cursor,
{'src_id':'varchar(12)','dest_id':'varchar(12)'})
yield cluster_id # WE MUST FUNCTION AS GENERATOR TO KEEP ERROR TRAPPING
# HAPPY
def serve_clusters(dbname='HUMAN_SPLICE_03',
source_table='HUMAN_SPLICE_03.genomic_cluster_JUN03',**kwargs):
"SERVER FUNCTION: serve up cluster_id one by one to as many clients as you want"
cursor=getUserCursor(dbname)
t=SQLTable(source_table,cursor)
for id in t:
yield id # HAND OUT ONE CLUSTER ID TO A CLIENT
if __name__=='__main__': # AUTOMATICALLY RUN EITHER THE CLIENT OR SERVER FUNCTION
coordinator.start_client_or_server(map_clusters,serve_clusters,['hg17'],__file__)
First, let's just focus on the map_clusters() function, which illustrates how the mapping of each gene is generated and saved. Let's examine the data piece by piece:
Additional examples of how to use pygr can be found in the tests/ directory within the pygr distribution package.
See About this document... for information on suggesting changes.