NAME

Bio::ToolBox::db_helper::bam

DESCRIPTION

This module provides support for binary bam alignment files to the Bio::ToolBox package through the samtools C library.

USAGE

The module requires Bio::DB::Sam to be installed, which in turn requires the samtools C library version 1.20 or less to be installed.

In general, this module should not be used directly. Use the methods available in Bio::ToolBox::db_helper or <Bio::ToolBox::Data>.

All subroutines are exported by default.

Available subroutines

All subroutines are exported by default.

open_bam_db

This subroutine will open a Bam database connection. Pass either the local path to a Bam file (.bam extension) or the URL of a remote Bam file. A remote bam file must be indexed. A local bam file may be automatically indexed upon opening if the user has write permissions in the parent directory.

It will return the opened database object.

open_indexed_fasta

This will open an indexed fasta file using the Bio::DB::Sam::Fai module. It requires a .fa.fai file to built, and one should be automatically built if it is not present. This provides a very fast interface to fetch genomic sequences, but no other support is provided. Pass the path to an uncompressed genomic fasta file (multiple sequences in one file is supported, but separate chromosome sequence files are not). The fasta index object is returned.

check_bam_index

This subroutine will check whether a bam index file is present and, if not, generate one. The Bio::DB::Sam module uses the samtools style index extension, .bam.bai, as opposed to the Picard style extension, .bai. If a .bai index is present, it will copy the file as .bam.bai index. Unfortunately, a .bai index cannot be used directly.

This method is called automatically prior to opening a bam database.

write_new_bam_file

This subroutine will open a new empty Bam file. Pass the name of the new file as the argument. It will return a low level Bio::DB::Bam object to which you can write a header followed by alignments. Be sure you know what to do before using this method!

collect_bam_scores

This subroutine will collect only the data values from a binary bam file for the specified database region. The positional information of the scores is not retained.

Collected data values may be restricted to strand by specifying the desired strandedness (sense, antisense, or all), depending on the method of data collection. Collecting scores, or basepair coverage of alignments over the region of interest, does not currently support stranded data collection (as of this writing). However, enumerating alignments (count method) and collecting alignment lengths do support stranded data collection. Alignments are checked to see whether their midpoint is within the search interval before counting or length collected.

As of version 1.30, paired-end bam files are properly handled with regards to strand; Strand is determined by the orientation of the first mate. However, pairs are still counted as two alignments, not one. To avoid this, use the value_type of 'ncount' and count the number of unique alignment names. (Previous versions treated all paired-end alignments as single-end alignments, severely limiting usefulness.)

The subroutine is passed a parameter array reference. See "Data Collection Parameters Reference" below for details.

The subroutine returns an array or array reference of the requested dataset values found within the region of interest.

collect_bam_position_scores

This subroutine will collect the score values from a binary bam file for the specified database region keyed by position.

The subroutine is passed a parameter array reference. See "Data Collection Parameters Reference" below for details.

The subroutine returns a hash or hash reference of the defined dataset values found within the region of interest keyed by position. The feature midpoint is used as the key position. When multiple features are found at the same position, a simple mean (for length data methods) or sum (for count methods) is returned. The ncount value type is not supported with positioned scores.

sum_total_bam_alignments

This subroutine will sum the total number of properly mapped alignments in a bam file. Pass the subroutine one to four arguments in the following order.

1. Bam file path or object

The name of the Bam file which should be counted. Alternatively, an opened Bio::DB::Sam object may also be given. Required.

2. Minimum mapping quality (integer)

Optionally pass the minimum mapping quality of the reads to be counted. The default is 0, where all alignments are counted. Maximum is 255. See the SAM specification for details.

3. Paired-end (boolean)

Optionally pass a boolean value (1 or 0) indicating whether the Bam file represents paired-end alignments. Only proper alignment pairs are counted. The default is to treat all alignments as single-end.

4. Number of forks (integer)

Optionally pass the number of parallel processes to execute when counting alignments. Walking through a Bam file is time consuming but can be easily parallelized. The module Parallel::ForkManager is required, and the default is a conservative two processes when it is installed.

The subroutine will return the number of alignments.

Data Collection Parameters Reference

The data collection subroutines are passed an array reference of parameters. The recommended method for data collection is to use the "get_segment_score" in Bio::ToolBox::db_helper method.

The parameters array reference includes these items:

1. chromosome
1. start coordinate
3. stop coordinate
4. strand

Should be standard BioPerl representation: -1, 0, or 1.

5. strandedness

A scalar value representing the desired strandedness of the data to be collected. Acceptable values include "sense", "antisense", or "all". Only those scores which match the indicated strandedness are collected.

6. score method

Acceptable values include score, count, pcount, and ncount.

   * score returns the basepair coverage of alignments over the 
   region of interest
   
   * count returns the number of alignments that overlap the 
   search region. 
   
   * pcount, or precise count, returns the count of alignments 
   whose start and end fall within the region. 
   
   * ncount, or named count, returns an array of alignment read  
   names. Use this to avoid double-counting paired-end reads by 
   counting only unique names. Reads are taken if they overlap 
   the search region.
7. Database object.

Not used here.

8. Paths to bam file

Subsequent bam files may also be provided as additional list items. Opened Bam file objects are cached.

SEE ALSO

Bio::ToolBox::Data::Feature, Bio::ToolBox::db_helper, Bio::DB::Sam

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.