Introduction
 

1. Overview of ASAP II Database
2. Detecting Orthologous Exons and Introns using MULTIZ multiple alignments
3. Splice Junctions from MULTIZ multiple alignments and Finding Lineage-Specific Genes
4. Human EST Library Classification and Tissue/Cancer/Normal Specific Alternative Splicing

5. Integration with pygr graph query module and NLMSA comparative genomics tool

 

1. Overview of ASAP II Database [Back, Top]

 

We have updated our ASAP database into ASAP II with new interface and integration of comparative features using UCSC BLASTZ multiple alignments. ASAP II supports 9 vertebrate species, 4 insects, and nematodes, and provides with extensive alternative splicing analysis and their splicing variants. As for human, newly added EST libraries were classified and included into previous tissue and cancer classification, and lists of tissue & cancer (normal) specific alternatively spliced genes are re-calculated and updated. We have created a novel orthologous exon & intron databases and their splice variants based on multiple alignment among several species. These orthologous exon & intron database can give more comprehensive homologous gene information than protein similarity based method. Furthermore, splice junction and exon identity among species can be valuable resources to elucidate species-specific genes. ASAP II database can be easily integrated with pygr (unpublished, the Python Graph Database Framework for Bioinformatics) and its powerful features such as graph query, multi-genome alignment query and etc. ASAP II is available at http://www.bioinformatics.ucla.edu/ASAP2.

Web Interface

ASAP II can be searched by several different criteria such as gene symbol, gene name and ID [UniGene, GenBank etc.]. The web interface provides 7 different kinds of views: (I) user query, UniGene annotation, orthologous genes and genome browsers; (II) genome alignment; (III) exons & orthologous exons; (IV) introns & orthologous introns; (V) alternative splicing; (IV) isoform and protein sequences; (VII) tissue & cancer vs. normal specificity. ASAP II shows genome alignments of isoforms, exons, and introns in UCSC-like genome browser (See A in following figure). Users can easily navigate among all the views by clicking links of interest. Alternative and constitutive exons are highlighted in red and blue, respectively. All alternative splicing relationships with supporting evidence information (See B in following figure), types of alternative splicing patterns (See C&D in following figure), and inclusion rate for skipped exons (See E in following figure) are listed in separate tables. Users can also search human data for tissue- and cancer-specific splice forms at the bottom of the gene summary page (See F in following figure). We report p-values for tissue-specificity as log-odds (LOD) scores, and highlight the results for LOD >= 3 and at least 3 EST sequences. See User's Guide for more comprehensive information.

Major output page for ASAP II. (A) Part of ASAP II genome browser showing isoform, exon and intron alignments. Alternative and constitutive exons & introns are denoted in red and blue, respectively. (B) List of all alternative splicing relationships. (C) Alternative donor and acceptor events. (D) Exon skipping events. (E) Exon inclusion rate for skipped exon. (F) Human EST library classification, tissue and normal vs. cancer specificity. High confident tissue-specific and cancer-specific alternative splicing relationships (LOD >= 3, at least 3 EST evidences) will be highlighted in red.

Statistics for ASAP II

Organism

Genome

Assembly*

UniGene Clusters

Detected Splices/Clusters

Isoforms

Alternative Splicing

Alternatively spliced **

Total

Mapped

Splices

Clusters

Relationships

Clusters

Human

hg17

66488

47477

193023

22220

260198

89078

11717

53 %

Mouse

mm7

43104

32522

141284

16404

135465

33057

8711

53 %

Rat

rn3

41687

34003

82941

14195

53212

7210

3378

24 %

Western clawed frog

xenTro1

33132

24617

65633

10880

34293

4836

2349

22 %

Chicken

galGal2

30470

19708

51471

9671

26557

4244

2154

22 %

Cow

bosTau2

39432

28709

60813

11448

32401

6692

3008

26 %

Dog

canFam2

22930

16645

29290

6834

11424

1633

951

14 %

C. elegans

ce2

20621

15546

54395

12580

23393

1309

763

6 %

Ciona

ci2

15587

1373

5611

972

2161

150

98

10 %

Zebrafish

danRer3

32400

22297

67598

12136

27547

2611

1577

13 %

Fruit fly

dm2

16635

14568

37469

9683

26854

4850

1841

19 %

Fugu

fr1

2355

1980

3014

798

866

33

24

3 %

Yellow fever mosquito

AaegL 1

15182

10624

3594

1787

2529

120

87

5 %

Honeybee

apiMel2

5900

5027

6270

2548

2990

90

57

2 %

African malaria mosquito

anoGam1

15609

14173

17278

8013

15115

1070

605

8 %

* Genome Assembly sequences were downloaded from UCSC genome browser except for Yellow fever mosquito, which was downloaded from Enesmbl genome browser
** Alternative spliced genes (%) = No. of alternatively spliced clusters / No. of spliced clusters

53 % of human and mouse multi-exon genes are detected to contain alternative splicing. Focusing on genes with at least mRNA, 75 % and 60 % of human and mouse multi-exons genes were detected to contain alternative splicing (see ASAP II website for details). Due to limited mRNA and EST coverage (Fugu and honeybee) and incomplete genome assembly (Fugu, Ciona, and yellow fever mosquito), number of mapped clusters (Ciona, 9 %; 1373 out of 15587) or alternatively spliced clusers (24 for Fugu, 98 for Ciona, 57 for honeybee, 87 for and yellow fever mosquito) can be significantly lower than expected, these data cannot be considered comprehensive. 19 ~ 26 % of fruit fly, western clawed frog, chicken, rat, and cow multi-exon genes were detected to contain alternative splicing.

Click here to download full statistics for ASAP II

UniGene release date and Statistics for Alternative Splicing Events
 

Organism

UniGene
Build

Release
Date

Exon
Skipping

Alternative
Donor

Alternative
Acceptor

Mutually
Exclusive

Total Alternatively Spliced Cluster

Human

#188

04 Dec 2005

7868

3307

2408

1365

11717

Mouse

#151

02 Oct 2005

4147

1870

1277

321

8711

Rat

#148

04 Oct 2005

1238

448

293

55

3378

Western clawed frog

#28

13 Oct 2005

878

302

229

31

2349

Chicken

#29

12 Dec 2005

753

254

199

30

2154

Cow

#74

14 Dec 2005

167

483

279

45

3008

Dog

#13

19 Aug 2005

328

75

38

16

951

C. elegans

#25

25 Sep 2005

192

95

90

10

763

Ciona

#18

02 Nov 2005

24

15

14

1

98

Zebrafish

#89

05 Dec 2005

448

223

138

8

1577

Fruit fly

#40

24 Nov 2005

409

263

109

35

1841

Fugu

#3

30 Dec 2004

4

1

2

0

24

Yellow fever mosquito

#1

04 Jan 2006

12

11

7

0

87

Honeybee

#5

31 Jul 2004

11

10

6

1

57

African malaria mosquito

#31

17 Sep 2005

68

50

36

11

605

All UniGene database was download in January 2006. Due to UniGene updating frequency, release date would be older than January 2006. Current release of ASAP II is version JAN06. We will update ASAP II database if total number of available sequences is significantly increased. Right side of table shows statistics for alternative splicing events. Exon skipping events are most common events.

 

2. Detecting Orthologous Exons and Introns using MULTIZ multiple alignments [Back, Top]

 

Comparative genomics is a major focus for the ASAP II database, diaplaying results from its new orthologous exons and introns database. We can download MULTIZ multiple alignments in UCSC genome browser for human (hg17), mouse (mm7), chicken (galGal2), fruit fly (dm2), zebrafish (danRer3), and western clawed frog (xenTro1). Orthologous Exons and Introns were defined by sharing at least one splice site in multigenome alignments. This strategy can increase the possibilities of finding orthologous exons, because the exons can be within well-conserved blocks of multigenome alignments. Conventioinal protein similarity based method can give only orthologous exons only if protein sequences are available. Moreover, multigenome alignment based method enables us to interpret how alternatively spliced exons and introns evolved across distance species.

Above figure shows segment of multigenome alignment. As you can see, human exon is well conserved for mouse, rat, rabbit, dog, elephant and opossum. And, splice site consensus is all canonical, AC/CT (GT...AG if reverse complement). By comparing all splice sites within ASAP II database, we constructed ortholous exons and introns database.

Statistics for Orthologous Exons and Introns

Multiple Alignments*

Exons with Orthologous Exons

Total Internal Exons

Introns with Orthologous Introns

Total Canonical Introns

hg17 referenced 17 species Multigenome Alignments

85673

129981

100447

193024

mm7 referenced 17 species Multigenome Alignments

81296

105260

97371

141285

galGal2 referenced 7 species Multigenome Alignments

20471

36865

24973

51472

danRer3 referenced 5 species Multigenome Alignments

18977

50792

22367

67599

xenTro1 referenced 5 species Multigenome Alignments

23428

49679

26893

65634

* Only orthologous exons and introns that have two exact matches of both canonical splice sites (U1/U2 and U11/U12).
* List of species used in multigenome alignments is available at UCSC genome browser.

66 % (85673 out of 129981) for human and 77 % (81296 out of 105260) for mouse internal exons have at least one orthologous exons. 52 % (100447 out of 193024) for human and 69 % (97371 out of 141285) for mouse canonical introns have at least one orthologous introns. Most of orthologous exons and introns were from human and mouse orthologs due to larger number of mRNA and EST sequences than other genomes. 56 %, 37 %, and 47 % of chicken (galGal2), zebrafish (danRer3), and western clawed frog (xenTro1) internal exons have at least one orthologous exons and 49 %, 33 %, and 41 % for orthologous introns. Because a set of genome assemblies used for multigenome alignments is different from ASAP II calculation for chicken, zebrafish and western clawed frog (see Table 1 for details), numbers of orthologous exons and intron cans be decreased.

 

3. Splice Junctions from MULTIZ multiple alignments and Finding Lineage-Specific Genes [Back, Top]

 

Another major feature of ASAP II is a multigenome splice site database. Actually, it is a multiple alignment of splice sites of introns as shown in figure below. One can easily see that this pair of splice sites appears to have evolved in an early mammalian ancestor, but not before. For example, researchers could identify "recently evolved splice sites" by selecting introns whose canonical splice site sequences (GT/AG) are only conserved within closely related species, but not in distance species.

Multiple alignment of splice sites is extraced using Pygr and its phylogenetic tree is generated by UCSC Phylogenetic Tree GifMaker on the fly.

 

4. Human EST Library Classification and Tissue/Cancer/Normal Specific Alternative Splicing [Back, Top]

 

In order to update lists of tissue and cancer vs. normal specific genes for human, we downloaded EST library information from UniLib (ftp:/ftp.ncbi.nih.gov/repository/UniLib/). 2895 new human EST libraries were classified and added into existing 47 tissue categories and normal/tumor types. In total, 8828 human EST libraries were classified into 47 tissues and normal/tumor. We used same method used by Xu et al. for LOD value calculation for tissue and normal vs. cancer specificity.

We found 1709 high-confidence (LOD >= 3) tissue-specific alternative splicing relationships from 960 genes (Click Here to download all high confident tissue-specific relationships), and 273 high-confidence (LOD >= 3) cancer-specific relationships from 198 genes (Click Here to download all high confident cancer-specific relationships). The largest categories of tissue-specific splice forms were identified from brain/nerve, testis, skin, muscle, and lymph (Click Here to download statistics for tissue-specific genes).  Users can download all EST library classification and LOD calculation results from ASAP II download page and mine their own experimental candidates.

After one uploads MySQL tables in download page, all high confident alternative relationships can be retrieved by following SQL syntax.

mysql> select * from LOD_Tissue_hg17 where LOD >= 3 and n_s1_tissue >= 3 order by cluster_id;
mysql> select * from LOD_Cancer_hg17 where LOD >= 3 and n_s1_tissue >= 3 and tissueId = 1 order by cluster_id;
mysql> select count(t1.tissueId), t1.tissueId, t2.tissueTerm from LOD_Tissue_hg17 as t1, EST_Library_Classification_Tissue_Term_hg17 as t2 where t1.LOD >= 3 and t1.n_s1_tissue >= 3 and t1.tissueId = t2.tissueId group by t1.tissueId;

 

5. Integration with pygr graph query module and NLMSA comparative genomics tool [Back, Top]

 

Pygr (the Python Graph Database Framework for Bioinformatics, http://www.bioinformatics.ucla.edu/pygr) has power features such as graph query, multigenome alignment query and etc. ASAP II database can be easily integrated with Pygr. Orthologous exons and introns, multigenome splice site database are constructed using Pygr.

1. How to upload ASAP II into your own database

After creating your own database, e.g. SPLICE_JAN06, you can upload all ASAP II database into your database server.

mysql> create database SPLICE_JAN06;

Download MySQL file in download page. And, uncompress using "gzip -d" command. You can upload using "mysql" command.

$ mysql SPLICE_JAN06 < alt_splice_hg17.sql

You can check uploaded database using SQL syntax.

mysql> desc alt_splice_hg17;
+--------------+---------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------------------+------+-----+---------+-------+
| splice_id_1 | int(11) | NO | PRI | 0 | |
| n_est_obs_1 | int(11) | YES | | NULL | |
| n_mrna_obs_1 | int(11) | YES | | NULL | |
| gen_start_1 | int(11) | YES | | NULL | |
| gen_end_1 | int(11) | YES | | NULL | |
| splice_id_2 | int(11) | NO | PRI | 0 | |
| n_est_obs_2 | int(11) | YES | | NULL | |
| n_mrna_obs_2 | int(11) | YES | | NULL | |
| gen_start_2 | int(11) | YES | | NULL | |
| gen_end_2 | int(11) | YES | | NULL | |
| cluster_id | varchar(12) | NO | MUL | | |
| is_novel | enum('no','yes') | YES | | NULL | |
| evidence | enum('single','multiple') | YES | | NULL | |
+--------------+---------------------------+------+-----+---------+-------+
13 rows in set (0.00 sec)

In order to analyze ASAP II database using Pygr, you have to upload all tables in ALTSPLICE and ISOFORM at download page. Suffix of each table is genome assembly name, e.g. *_hg17.

2. How to install Pygr and Tutorials

Prerequisites: Python 2.2 or higher (Python 2.3 or higher recommended), Pyrex, MySQL client & Python MySQLdb module

You can download Pygr at http://sourceforge.net/projects/pygr/, Pygr can be installed by standard python installation method. (For distribution package, we already generate C files from Pyrex *.pyx. You don't need Pyrex unless you want to do change *.pyx files). Connection to MySQL database is essential, thus you need both MySQL client and MySQLdb python module to work with ASAP II database.

$ python setup.py install

Check whether Pygr is installed correctly.

$ python
Python 2.4.2 (#2, Sep 30 2005, 14:23:11)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pygr

Pygr was presented in a software demo at ISMB2005 and ISMB2006. For Pygr 0.5, see this presentation for details.

3. Graph Query Examples

After you uncompress Pygr, you can see bunch of test modules in tests directory. Test module is based on ASAP or ASAP II database (with *_jan06.py). For Pygr graph query test, test_jan06.py is successfully tested using ASAP II database. You may need to change database name and table name in lldb_jan06.py. For human, it would take about 40 minutes (and about 1GB memory) to upload all alternative splicing relationships in ASAP II database. Results of test_jan06.py is test_jan06.log. Compare log file with your own results.

4. Multigenome Alignment Query using Pygr NLMSA

First thing to do is to download multigenome  alignment at UCSC genome browser. Many multigenome alignments are available, but you have to download same version with ASAP II database (see Statistics table above).

After you finish downloading and uncompress in your directory, you can generate Pygr NLMSA using createdb.py script. You may have to change chromosome names in the script.

In Pygr tests directory, you can find alt35_jan06.py. This test script is to extract splice sites from multigenome alignments. The output of that script is as follows.

1 Hs.99886 ss1 panTro1 100.0 100.0 GT GT
1 Hs.99886 ss1 monDom2 50.0 100.0 GT GA
1 Hs.99886 ss1 loxAfr1 100.0 100.0 GT GT
1 Hs.99886 ss1 dasNov1 100.0 100.0 GT GT
1 Hs.99886 ss1 rheMac2 100.0 100.0 GT GT
1 Hs.99886 ss1 echTel1 100.0 100.0 GT GT
1 Hs.99886 ss1 bosTau2 100.0 100.0 GT GT
1 Hs.99886 ss1 canFam2 100.0 100.0 GT GT
1 Hs.99886 ss1 rn3 50.0 100.0 GT AT
1 Hs.99886 ss1 mm7 50.0 100.0 GT CT
1 Hs.99886 ss2 panTro1 100.0 100.0 GT GT
1 Hs.99886 ss2 echTel1 100.0 100.0 GT GT
1 Hs.99886 ss2 loxAfr1 100.0 100.0 GT GT
1 Hs.99886 ss2 dasNov1 100.0 100.0 GT GT
1 Hs.99886 ss2 oryCun1 100.0 100.0 GT GT
1 Hs.99886 ss2 rheMac2 100.0 100.0 GT GT
1 Hs.99886 ss2 monDom2 100.0 100.0 GT GT
1 Hs.99886 ss2 bosTau2 100.0 100.0 GT GT
1 Hs.99886 ss2 canFam2 100.0 100.0 GT GT
1 Hs.99886 ss2 rn3 100.0 100.0 GT GT
1 Hs.99886 ss2 mm7 100.0 100.0 GT GT
1 Hs.99886 ss3 panTro1 100.0 100.0 AG AG

The python script, alt35_jan06.py is only for ASAP II database. If you want general Pygr NLMSA feature, multigenome alignment query, see ISMB2006 presentation PDF for details.