GeneSieve

A Probe Selection Tool for cDNA Microarrays

  About Genesieve  

About Genesieve


Genesieve is a tool for Clone Selection. It makes the method of choosing ESTs for populating microarrays many times faster and simpler than manual methods. In genesieve, the ESTs available for each organism have been selected and clustered to give contigs. The ests and contigs have then been blasted against the protein sequences of the model organism which is Arabidopsis thaliana. Quality scores and other scores have been computed and made available so that selection of ESTs becomes easy when the criteria for selection are known beforehand. The steps involved in the selection process are:

  • Obtain protein and gene sequences for Model organism (Arabidopsis thaliana)
  • Obtain EST sequences and corresponding base quality scores for organism/species of interest.
  • Cluster ESTs into contigs and singletons using PHRAP or any other similar clustering algorithm.
  • Calculate homology between these contigs and Arabidopsis thaliana proteins using BLAST or any other similar tool (Genesieve uses BLAST). Select contigs showing homology to proteins of interest.
  • Choose an EST from each of the selected contigs based on length, proximity to 3' or 5' end, protein homology or cross hybridization properties.




  • GeneSieve System Architecture


    Contigs are assembled using PHRAP
    PHRAP is the most commonly used program for assembling shotgun DNA/EST sequences. It allows use of the entire read of both sequence and quality data, trims off any near-homopolymer runs ate the ends of reads and uses a combination of both user-supplied and internally computed data quality information to improve accuracy. It uses pairwise matches to identify confirmed parts of reads; and uses these to compute revised quality values; and computes LLR scores for each match (based on qualities of discrepant and matching bases) and iterates these two steps. Among several overlappings, it finds the highest LLR scores and hence, the best alignment for each matching pair of reads in a given region. Contigs layouts are then constructed using consistent pairwise matches in in decreasing score order (greedy algorithm). Contig sequences are constructed as a mosaic of the highest quality parts of reads.
    Taken from www.genome.ou.edu/phrap.html.
    Besides, PHRAP provides about assembly like quality values for contigs, constituent ESTs and start and end positions of EST-contig alignments. It also handles large datasets upto a total size of 1 Gb.

    Selection of Contigs
    Using BLAST all contigs are BLASTed against the Arabidopsis thaliana protein database using BLASTX. All protein homologues with P < 10E-06 are found. The one with the highest bit score is considered and its annotation is assigned to the contig. The protein coverage for each contig is then calculated as follows:

    Protein Coverage = Length of protein aligned with the contig / Length of protein

    When there are more than one contigs showing similarities to the protein of interest, contigs are selected based on their Protein Coverage scores.

    Selection of ESTs
    The Quality Score (Q) for each EST is calculated as follows:

    Q = PH - CH + RL

    where,

  • PH = Protein Homology Score
      = EST to protein score / Contig to protein score
      (If first hit for EST and contig are the same protein, else PH=0)
      Typical range for PH is [0,1]. But in some cases PH can be greater than one.
  • CH = Cross Hybridization
      = EST to first contig score / EST to second contig score
      (First contig should be the one the EST belongs to, else CH=1)
      Range for CH is [0,1]
  • RL = Relative Length
      = Length of EST aligned to the contig / Length of Contig
      Range for RL is [0,1]

  • (Quality Score is calculated only for those ESTs which belong to Contigs showing similarity with one of the proteins)

    Selection Methods
    These factors and the Quality Scores are calculated for all ESTs and then the ESTs are chosen by one of the following selection methods:

  • Maximum Length: Among all ESTs in a contig, select the one with the maximum length.

  • 5' Proximity: Find all ESTs which align to the 5' end of the contig. If there are more than one such ESTs, select the one with the maximum length.

  • 3' Proximity: Find all ESTs which align to the 3' end of the contig. If there are more than one such ESTs, select the one with the maximum length.

  • Maximum PH: Among all ESTs in a contig, select the one with the maximum Protein Homology Score.

  • Minimum CH: Among all ESTs in a contig, select the one with the minimum Cross Hybridization Score.

  • Maximum RL: Among all ESTs in a contig, select the one with the maximum Relative Length.

  • Maximum Q: Among all ESTs in a contig, select the one with the maximum Quality Score.


  • Selection Methods are compared based on the results and the Maximum Quality method is found to work best.

    Currently, Quality Analysis and related data are available for the following organisms:
  • Pine
  • Potato
  • Tomato
  • Barley
  • Maize
  • Log in to view data for all the above organisms.

    Genesieve Home