ASAP Glossary



contig

Contigs are continuous genomic sequences from GenBank. We use contigs to identify the consensus genomic sequence for a UniGene cluster of Expressed Sequences (ESTs). Note that several clusters can map to a single contig. Also, GeneBank has changed its contig names since when we performed our calculation, and BLAST searches might be necessary to identify current contig identifiers.



ug_id

A unique UniGene identifier for mRNA/EST sequences. from GeneBank. Also see gi_id, gb_id.



lib_id

UniGene identifer for cDNA library preparation and tissue source.



chromosome

Chromosome number (e.g. chromosome 21).



marker_id

Gene to which radiation hyprid mapping position was assigned.



gene

Gene symbol for this UniGene cluster (e.g. TCN1, DPMK).



location_id

NCBI identifier for the radiation hyprid mapping position.



consensus_id

Consensus sequence identifier for this UniGene cluster. We derive the multiple sequence consensus by aligning the cluster's EST/mRNA sequences to each other. If there are several such consensus sequences (identified by bundle_ids), we pick the one supported by most EST/mRNA sequences and give it a consensus_id. The consensus is later used to probe contigs to find the genomic sequence.



bundle_id

Identifier for a multiple sequence alignment used in consensus generation.



clone_id

UniGene identifier for the subcloned vector.



trace_id

Unique identifier for the sequence trace (chromatogram data).



cluster_id

A UniGene identifier for a set of clustered Expressed Sequences (EST/mRNA). The cluster identifiers may change as UniGene regroups its ESTs and new clusters are created. A single cluster is supposed to correspond to either a single gene or at least a part of a gene.



splice_id

A unique identifier for splices detected by aligning all Expressed Sequences within a cluster to the genomic sequence. We detect splices as large gaps in Expressed Sequences produced by their alignment to the genomic sequence. A splice is defined by its starting and ending positions on the genomic sequence.



gb_id

Stable GenBank identifier for the sequence. Also see gi_id and ug_id.



gi_id

GenBank identifier for the current version of this sequence.Also see gb_id and ug_id.



n_est

Number of supporting EST observations for the splice_id.



n_mrna

Number of supporting mRNA observations for the splice_id.



n_est_obs

See n_est



n_mrna_obs

See n_est



gen_start

Genomic location of the exonic nucleotide 5' of this splice. We detect splices as large gaps in Expressed Sequences produced by their alignment to the genomic sequence. Here, gen_start accounts for the nucleotide on the genomic sequence preceeding this gap. We use gen_start and gen_end to uniquely identify splices in a cluster_id.



gen_end

Genomic location of the exonic nucleotide 3' of this splice. We detect splices as large gaps in Expressed Sequences produced by their alignment to the genomic sequence. Here, gen_end accounts for the nucleotide on the genomic sequence following this gap. We use gen_end and gen_start to uniquely identify splices in a cluster_id.



date_added

Date when the EST/mRNA sequence was added to a UniGene EST cluster.



rad_pos

Radiation hybrid mapping position.



order_in_location

To be defined.



splice_id_1

Meaning depends on context:

  1. Splice identifier alternative to splice_id_2. Also see splice_id
  2. The tissue-specific splice form.


n_est_obs

Number of supporting EST observations for splice_id_1. Also see n_est.



n_mrna_obs_1

Number of supporting mRNA observations for splice_id_1. Also see n_mrna.



gen_start_1

Genomic position of the exonic nucleotide 5' of splice_id_1. Also see gen_start.



gen_end_1

Genomic position of the exonic nucleotide 3' of splice_id_1. Also see gen_end.



splice_id_2

Meaning depends on context:

  1. Splice identifier alternative to splice_id_1. Also see splice_id.
  2. The non_tissue_specific splice form.



n_est_obs_2

Number of supporting EST observations for splice_id_2. Also see n_est.



n_mrna_obs_2

Number of supporting mRNA observations for splice_id_2. Also see n_mrna.



gen_start_2

Genomic position of the exonic nucleotide 5' of splice_id_2. Also see gen_start.



gen_end_2

Genomic position of the exonic nucleotide 3' of splice_id_2. Also see gen_end.



length

Meaning depends on context. Possible options are:

  1. Number of nucleotides in an EST.
  2. Number of amino acids in a protein.



evidence

Indicates how much evidence we have for the alternative splicing event. An entry of multiple evidence indicates that both splices have at least two ESTs or at least one mRNA observation. All other alternative splices are said to have single evidence.



is_novel

Alternative splice event is novel if there are no mRNA sequences supporting it.



5_prime_site

Shows genomic sequence at the 5' exon/intron boundary. Intronic sequence is given in lower-case, and exonic sequence is given in upper-case letters.



3_prime_site

Shows genomic sequence at the 3' exon/intron boundary. Intronic sequence is given in lower-case, and exonic sequence is given in upper-case letters.



alternative_splice

Splice alternative to this one. We detect 5' alternative splices, 3' alternative splices, exon skips and mutually exclusive splices



type

Meaning depends on context. Possible options are:

  1. Refers to the spliceosome (major or minor) responsible for the observed splicing event. Possible values are 'U1/U2' for major and 'U11/U12' for minor spliceosomes.
  2. Refers to the type of sequence (e.g. protein, EST, STS, mRNA, etc...).



donor

Information about the tissue donor



tissue

Tissue where the EST is expressed



Number_of_TS_splices

Number of tissue-specific splices in the tissue



tissue_id

A unique identifier for the tissue category in our human tissue classification.



TS

Tissue Specificity score. The higher it is higher the confidence score.



n_s1

The number of ESTs which suport the splice form 1 within the tissue indicated by tissue_id.



n_s2

The number of ESTs which suport the splice form 2 within the tissue indicated by tissue_id.



n_s1_other

The number of ESTs which suport the splice form 1 within all tissues except the tissue indicated by tissue_id.



n_s2_other

The number of ESTs which suport the splice form 2 within all tissues except the tissue indicated by tissue_id.



r_tissue

The robustness score to measure the stability of the TS-value with and without an EST observation of splice_id_1 within the tissue indicated by tissue_id.



r_other

The robustness score to measure the stability of the TS value with and without an EST observation of splice_id_2 within all the other tissues except the tissue indicated by tissue_id.



confidence

The confidence level of the tissue specificity measured by TS-value and robustness score. Higher confidence level indicates that the splice form is very likely to be preferentially found in the tissue indicated.



tissue_name

The tissue name which coresponds to the tissue_id.
title

Meaning depends on context. Possible options are:

  1. Title is UniGene's attempt to assign a common biological meaning to all EST/mRNA sequences clustered together. According to UniGene, there are several possible sources for the title given here in order of preference: "(i) LocusLink title; (ii) name of product; (iii) defline of mRNA record; (iv) defline of genomic record; (v) ESTs, similar to something; (vi) ESTs." Note, that the mRNA or genomic sequence responsible for title annotation is chosen arbitrarily from the set of sequences clustered together (refer to UniGene's FAQ for more).
  2. A UniGene's comment for the EST/mRNA sequence considered.
  3. A UniGene's comment for the cDNA library considered.



seq_start

Expressed Sequence (EST/mRNA) position of the exonic nucleotide 5' of the splice. Also see gen_start.



num_seq

The number of EST/mRNA sequences present in a given cluster. It is well known that Human UniGene clusters have vastly different sizes: up to 65% of all UniGene clusters contain less then 5 EST/mRNA sequences, while 0.3% of all clusters contain 1028 sequences or more (refer to UniGene's statistics page).



num_seq_cal

The number of EST/mRNA sequences selected for the splice calculation. See include_cal for the selection criteria.



percent_cal

The ratio of num_seq_cal and num_seq. This is an overall figure representing the percentage of a cluster's EST/mRNA sequences that align well to the identified genomic sequence.



status

A field for tracking the status of consensus generation and its alignment to the genomic sequence. Status COMPLETE indicates that we have successfully completed both of these steps.



contig_start

This number represents the contig position of the first nucleotide aligned to the consensus sequence. We use the contig start and end positions to obtain the genomic sequence for the UniGene Expressed Sequence Cluster.



contig_end

This number represents the contig position of the last nucleotide aligned to the consensus sequence. We use the contig start and end positions to obtain the genomic sequence for the UniGene Expressed Sequence Cluster.



orientation

Orientation indicates which strand of the contig aligns to the consensus. Two possible values are 5'-> 3' or 3'-> 5'. We denote the positive strand with 5'-> 3', and the negative strand with 3'-> 5'.



ori

See orientation.



seq_length

Length of the EST/mRNA sequence aligned to the genomic sequence.



head

The length of insertion at the beginning (head) of the mRNA/EST relative to our genomic sequence. One possible reason for the insertion is vector sequence contamination.



middle

The largest insertion relative to the genomic sequence in the inner (middle) region of EST/mRNA. All middle insertions larger then 6 nucleotides disqualify the EST/mRNA sequence from our splicing calculation.



tail

The length of insertion at the end (tail) of the mRNA/EST relative to our genomic sequence. Many of the tail insertions are due to the poly-A tails.



percent_align

The percent of nucleotides in the EST/mRNA aligned (match or mismatch) to the genomic sequence. If less then 70% of a consensus' nucleotides are aligned with the contig, we do not consider the EST/mRNA sequence in the splice calculation.



include_cal

Indicates whether to include the EST/mRNA sequence in the splicing calculation. We do this to filter out the EST/mRNA sequences that might have been misclustered by UniGene . We consider that an alignment is "good" if middle is less then six nucleotides, and if percent_align is more then 70%.



real_site

Indicates if this splice has consensus GT/AG splice sites.



seq

Meaning depends on context. Possible options are:

  1. The genomic sequence corresponding to a particular UniGene cluster. Also see contig and consensus_id.
  2. The EST/mRNA sequence. Also see ug_id.



percent_ins

Percent of EST/mRNA aligned to the genomic sequence.



shared

To be defined...



shared2

To be defined...



n_shared

To be defined...



intron_id percent_exonic ver_gen_start ver_gen_end

These are depricated fields and should not be considered.