"fb_shmiggle.pm" TopoView (AKA shmiggle) glyph was developed for fast 3D-like demonstration of RNA-seq data consisting of multiple individual subsets. Main purposes were to compact presentation as much as possible (in one reasonably sized track) and to allow easy visual detection of coordinated behavior of the expression profiles of different subsets.
It was found that log2 conversion dramatically changes perception of expression profiles and kind of illuminates coordinated behavior of different subsets. Glyph and data indexer/formatter were in fact modified with the assumption that final data produced by indexer/formatter will always be a log2 conversion of the original coverage, therefore represented by short integer with values in range of 0-200 or so.
Comparing performance (retrieval of several Kbp of data profiles
for several subsets of some RNA-seq experiment) of wiggle binary
method and of several possible alternatives, it was discovered that
one of the approaches remarkably outperforms wiggle bin method
(although it requires several times more space for formatted data
storage). Optimal storage/retrieval method stores all experiment
data (all subsets of the experiment) in one text file, where
structure of the file in fact is one of the most simple wiggle
(coverage files) formats with the addition of some positioning
data (two-column format, without runlength specification, without
omission of zero values). This is the only format which glyph is able
to handle (there are many reasons for that) so any modification
of indexer/formatter _must_ produce exact equivalent of that
format. In my experience, 90% of the debugging with new incoming
data was related to
the problems of that exact format conversion. Example of the formatted
# subset=BS107_all_unique chromosome=2LHet -200000 0 0 0 19955 1 19959 0 19967 2 19972 0 19977 2 20027 0 20031 2 20035 0 20043 1 20045 0 20049 1 20055 0 20062 2 20069 0 20073 2 20082 0 20097 3 20115 0 20125 3 20127 0 20134 3 20139 0 20140 3 20144 0 20145 3 20150 0 20157 3 20162 0 20172 3 20183 0
Glyph is supplied with a "index_cov_files.pl" data indexer/formatter which is converting original coverage (wiggle) files into data structure which will be used for fast retrieval. You should run this script in some separate directory, containing original coverage files (gzipped form works too). After it finishes, directory will contain two new files: data.cat and index.bdbhash. Both files required for data retrieval by glyph. Files can be moved freely between different directories or even operational systems (Mac and PC included, I think). Content of the dat file is subject of accurate check - this is if you want to avoid long debugging sessions on the level of running GBrowse. Size of files is quite big, but in my experience it is like twice less than gzipped size of all initial coverage files - which is quite acceptable.
Example of GBrowse conf file insert (shows actual FlyBase config sections for
Baylor and modENCODE RNA-seq tracks):
In configuration, it is very important to set 'datadir' variable (relative to server DOCUMENT_ROOT) so that glyph will know where to take data and index.
Setting 'subsetsorder' allows you to display expression profiles of subsets in some predefined order. If setting omitted, glyph will display sets in alphabetical order of the initial subsets names.
Setting 'subsetsnames' allows to rename subsets (very important as in most cases workflow names of subsets are unsutable for intelligent data display to end users). If setting omitted, initial subsets names will be used for display.
For the glyph to be properly activated, you need to insert in all of your GFF files (ones for which you have RNA-seq data) virtual contig-long features which will activate expression data display. To cover whole range of the contig (chromosome arm), it is better to use coordinates presented in 'sequence-region' definition at the top of GFF file. Example of such feature lines for FlyBase data is shown below:
2LHet Baylor RNAseq_profile 1 368874 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2L Baylor RNAseq_profile 1 23011544 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2RHet Baylor RNAseq_profile 1 3288763 . + . Comment=This is a reference feature for RNAseq wiggle tracks 2R Baylor RNAseq_profile 1 21146708 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3LHet Baylor RNAseq_profile 1 2555493 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3L Baylor RNAseq_profile 1 24543557 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3RHet Baylor RNAseq_profile 1 2517509 . + . Comment=This is a reference feature for RNAseq wiggle tracks 3R Baylor RNAseq_profile 1 27905053 . + . Comment=This is a reference feature for RNAseq wiggle tracks 4 Baylor RNAseq_profile 1 1351857 . + . Comment=This is a reference feature for RNAseq wiggle tracks XHet Baylor RNAseq_profile 1 204113 . + . Comment=This is a reference feature for RNAseq wiggle tracks X Baylor RNAseq_profile 1 22422827 . + . Comment=This is a reference feature for RNAseq wiggle tracks YHet Baylor RNAseq_profile 1 347040 . + . Comment=This is a reference feature for RNAseq wiggle tracks
Questions about TopoView glyph should be directed to Victor Strelets (email@example.com).