NAME

Bio::ToolBox::db_helper::seqfasta

DESCRIPTION

This module provides support for Bio::DB::SeqFeature::Store and Bio::DB::Fasta BioPerl database adaptors. Other BioPerl-style databases may have limited support.

USAGE

The module requires BioPerl to be installed, which includes database adaptors Bio::DB::SeqFeature::Store and Bio::DB::Fasta. The database storage adapter, e.g MySQL, SQLite, etc., will have additional requirements.

In general, this module should not be used directly. Use the methods available in Bio::ToolBox::db_helper or <Bio::ToolBox::Data>.

All subroutines are exported by default.

Opening databases

open_store_db

Opens a Bio::DB::SeqFeature::Store database. The connection parameters are typically stored in a configuration file, .biotoolbox.cfg. Multiple database containers are supported, including MySQL, SQLite, and in-memory.

Pass the name of the database or database file.

open_fasta_db

Opens a Bio::DB::Fasta database. Either a single fasta file or a directory of fasta files may be provided. Pass the path name to the file or directory.

Collecting scores

Scores from seqfeature objects stored in the database may be retrieved. The scores may be collected as is, or they may be associated with genomic positions (indexed scores). Scores may be restricted to strand by specifying the desired strandedness. For example, to collect transcription data over a gene, pass the strandedness value 'sense'. If the strand of the region database object (representing the gene) matches the strand of the wig file data feature, then the data is collected.

Legacy wig file support uses GFF SeqFeature databases to store the file paths of the binary wiggle (.wib) files. If the seqfeature objects returned from the database include the wigfile attribute, then these objects are forwarded on to the Bio::ToolBox::db_helper::wiggle adaptor for appropriate score collection.

collect_store_scores

This subroutine will collect only the score values from database features for the specified database region. The positional information of the scores is not retained, and the values may be further processed through some statistical method (mean, median, etc.).

The subroutine is passed a parameter array reference. See "Data Collection Parameters Reference" below for details.

The subroutine returns an array or array reference of the requested dataset values found within the region of interest.

collect_wig_position_scores

This subroutine will collect the score values form features in the database for the specified region keyed by position.

The subroutine is passed a parameter array reference. See "Data Collection Parameters Reference" below for details.

The subroutine returns a hash or hash reference of the requested dataset values found within the region of interest keyed by position. Note that only one value is returned per position, regardless of the number of dataset features passed.

Data Collection Parameters Reference

The data collection subroutines are passed an array reference of parameters. The recommended method for data collection is to use the "get_segment_score" in Bio::ToolBox::db_helper method.

The parameters array reference includes these items:

1. chromosome
1. start coordinate
3. stop coordinate

Coordinates are in BioPerl-style 1-base system.

4. strand

Should be standard BioPerl representation: -1, 0, or 1.

5. strandedness

A scalar value representing the desired strandedness of the data to be collected. Acceptable values include "sense", "antisense", or "all". Only those scores which match the indicated strandedness are collected.

6. score method

Acceptable values include score, count, ncount, and pcount.

   * score returns the basepair coverage of alignments over the 
   region of interest
   
   * count returns the number of alignments that overlap the 
   search region. 
   
   * pcount, or precise count, returns the count of alignments 
   whose start and end fall within the region. 
   
   * ncount, or named count, returns an array of alignment read  
   names. Use this to avoid double-counting paired-end reads by 
   counting only unique names. Reads are taken if they overlap 
   the search region.
7. A database object.

This should be a Bio::DB::SeqFeature::Store database.

8. database type

Set the type or primary_tag of the dataset within the database to collect from. Additional dataset items may be added to the list when merging data.

SEE ALSO

Bio::ToolBox::Data::Feature, Bio::ToolBox::db_helper, Bio::DB::SeqFeature::Store

AUTHOR

 Timothy J. Parnell, PhD
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.