Bio::ToolBox::parser::ucsc - Parser for UCSC genePred, refFlat, etc formats


  use Bio::ToolBox::parser::ucsc;
  ### A simple transcript parser
  my $ucsc = Bio::ToolBox::parser::ucsc->new('file.genePred');
  ### A full fledged gene parser
  my $ucsc = Bio::ToolBox::parser::ucsc->new(
        file      => 'ensGene.genePred',
        do_gene   => 1,
        do_cds    => 1,
        do_utr    => 1,
        ensname   => 'ensemblToGene.txt',
        enssrc    => 'ensemblSource.txt',
  ### Retrieve one transcript line at a time
  my $transcript = $ucsc->next_feature;
  ### Retrieve one assembled gene at a time
  my $gene = $ucsc->next_top_feature;
  ### Retrieve array of all assembled genes
  my @genes = $ucsc->top_features;
  # Each gene or transcript is a SeqFeatureI compatible object
  printf "gene %s is located at %s:%s-%s\n", 
    $gene->display_name, $gene->seq_id, 
    $gene->start, $gene->end;
  # Multiple transcripts can be assembled into a gene
  foreach my $transcript ($gene->get_SeqFeatures) {
    # each transcript has exons
    foreach my $exon ($transcript->get_SeqFeatures) {
      printf "exon is %sbp long\n", $exon->length;
  # Features can be printed in GFF3 format
  print STDOUT $gene->gff_string(1); 
   # the 1 indicates to recurse through all subfeatures


This is a parser for converting UCSC-style gene prediction flat file formats into BioPerl-style Bio::SeqFeatureI compliant objects, complete with nested objects representing transcripts, exons, CDS, UTRs, start- and stop-codons. Full control is available on what to parse, e.g. exons on, CDS and codons off. Additional gene information can be added by supplying additional tables of information, such as common gene names and descriptions, available from the UCSC repository.

Table formats supported

Supported files are tab-delimited text files obtained from UCSC and described at Formats are identified by the number of columns, rather than specific file extensions, column name headers, or other metadata. Therefore, unmodified tables should only be used for correct parsing. Some errors are reported for incorrect lines. Unadulterated files can safely be downloaded from Files obtained from the UCSC Table Browser can also be used with caution. Files may be gzip compressed.

File formats supported include the following.

  • Gene Prediction (genePred), 10 columns

  • Gene Prediction with RefSeq gene Name (refFlat), 11 columns

  • Extended Gene Prediction (genePredExt), 15 columns

  • Extended Gene Prediction with bin (genePredExt), 16 columns

  • knownGene table, 12 columns

Supplemental information

The UCSC gene prediction tables include essential information, but not detailed information, such as common gene names, description, protein accession IDs, etc. This additional information can be associated with the genes or transcripts during parsing if the appropriate tables are supplied. These tables can be obtained from the UCSC download site

Supported tables include the following.

  • refSeqStatus, for refGene, knownGene, and xenoRefGene tables

  • refSeqSummary, for refGene, knownGene, and xenoRefGene tables

  • ensemblToGeneName, for ensGene tables

  • ensemblSource, for ensGene tables

  • kgXref, for knownGene tables


For an implementation of this module to generate GFF3 formatted files from UCSC data sources, see the Bio::ToolBox script


Initalize the parser object


Initiate a UCSC table parser object. Pass a single value (a table file name) to open a table and parse its objects. Alternatively, pass an array of key value pairs to control how the table is parsed. Options include the following.


Provide a file name for a UCSC gene prediction table. The file may be gzip compressed.


Pass a string to be added as the source tag value of the SeqFeature objects. The default value is 'UCSC'. If the file name has a recognizable name, such as 'refGene' or 'ensGene', it will be used instead.


Pass a boolean (1 or 0) value to combine multiple transcripts with the same gene name under a single gene object. Default is true.

-item do_exon


Pass a boolean (1 or 0) value to parse certain subfeatures, including exon, CDS, five_prime_UTR, three_prime_UTR, stop_codon, and start_codon features. Default is false.


Pass a boolean (1 or 0) value to assign names to subfeatures, including exons, CDSs, UTRs, and start and stop codons. Default is false.


Pass a boolean (1 or 0) value to recycle shared subfeatures (exons and UTRs) between multiple transcripts of the same gene. This results in reduced memory usage, and smaller exported GFF3 files. Default is true.


Pass the appropriate file name for additional information.


Pass the name of a Bio::SeqFeatureI compliant class that will be used to create the SeqFeature objects. The default is to use Bio::ToolBox::SeqFeature.

Modify the parser object

These methods set or retrieve parameters, and load supplemental files and new tables.


These methods retrieve or set parameters to the parsing engine, same as the options to the new method.


Set or retrieve the file handle of the current table. This module uses IO::Handle objects. Be careful manipulating file handles of open tables!


Pass the name of a new table to parse. Existing gene models loaded in memory, if any, are discarded. Counts are reset to 0. Supplemental tables are not discarded.

load_extra_data($file, $type)
        my $file = 'hg19_refSeqSummary.txt.gz';
        my success = $ucsc->load_extra_data($file, 'summary');

Pass two values, the file name of the supplemental file and the type of supplemental data. Values can include the following

  • refseqstatus or status

  • refseqsummary or summary

  • kgxref

  • ensembltogene or ensname

  • ensemblsource or enssrc

The number of transcripts with information loaded from the supplemental data file is returned.

Feature retrieval

The following methods parse the table lines into SeqFeature objects. It is best if methods are not mixed; unexpected results may occur.


This will read the next line of the table and parse it into a gene or transcript object. However, multiple transcripts from the same gene are not assembled together under the same gene object.


This method will return all top features (typically genes), with multiple transcripts of the same gene assembled under the same gene object. Transcripts are assembled together if they share the same gene name and the transcripts overlap. If transcripts share the same gene name but do not overlap, they are placed into separate gene objects with the same name but different primary_id tags. Calling this method will parse the entire table into memory (so that multiple transcripts may be assembled), but only one object is returned at a time. Call this method repeatedly using a while loop to get all features.


This method is similar to "next_top_feature", but instead returns an array of all the top features.

Other methods

Additional methods for working with the parser object and the parsed SeqFeature objects.


Parses the table into memory. If a table wasn't provided using the "new" or "open_file" methods, then a filename can be passed to this method and it will automatically be opened for you.

        my $gene = $ucsc->find_gene(
                display_name => 'ABC1',
                primary_id   => 'gene000001',

Pass a gene name, or an array of key => values (name, display_name, ID, primary_ID, and/or coordinate information), that can be used to find a gene already loaded into memory. Only really successful if the entire table is loaded into memory. Genes with a matching name are confirmed by a matching ID or overlapping coordinates, if available. Otherwise the first match is returned.


This method will return a hash of the number of genes and RNA types that have been parsed.


This method will return a comma-delimited list of the feature types or primary_tags found in the parsed file. Returns a generic list if a file has not been parsed.


A bare bones method that will convert a tab-delimited text line from a UCSC formatted gene table into a SeqFeature object for you. Don't expect alternate transcripts to be assembled into genes.


Returns an array or array reference of the names of the chromosomes or reference sequences present in the table.


Returns a hash reference to the chromosomes or reference sequences and their corresponding lengths. In this case, the length is inferred by the greatest gene end position.


This is a private module that is responsible for building SeqFeature objects from UCSC table lines. It is not intended for general public use.


Bio::ToolBox::SeqFeature, Bio::ToolBox::parser::gff


 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.