Use hashes to estimate MLST
``` hashest-index.pl: indexes a fasta file Fasta file have deflines in the format of >locus_allele where locus is a string and allele is an int Usage: hashest-index.pl [options] .fasta [.gbk...] --k kmer length [default: 16] --hashing Hashing algorithm md5_hex, sha1_hex [default: md5_hex] --output Output prefix for index files --version print version and exit --help This useful help menu
hashest-search.pl: reports an MLST profile for a genome assembly Usage: hashest-search.pl [options] .fasta [.gbk...] > out.tsv --db Database from hashest-index.pl --numcpus Number of threads to use [default: 1] --dump Dump the database instead of analyzing anything --novel-alleles (optional) A filename to write novel alleles into a fasta format. Defline will be
locusname_hashsum --putatives Print a '?' instead of an int when a locus has been detected but no exact allele was found --help This useful help menu ```
- Step 1: get a fasta file or set of fasta files with alleles in the format of
- Step 2: run
hashest-index.plon the set of fasta file(s) to create a new index. The database is described in its own section below.
- Step 3: analyze an assembly against the new index with
hashest-search results in a tsv stdout output.
Columns are loci, rows are assemblies, and values are alleles.
~) represent multiple allele matches and probably multiple copies/variations of a gene.
Question marks (
?) indicate a match to a locus via a hash match, but no allele match was found.
Requires perl with threads and BioPerl
git clone email@example.com:lskatz/hashest.git
Inspired by Gustle
Uses native perl md5 hashing.
- Index the database
- hash the first k nucleotides of each allele in the database
- save whole sequence of the alleles too
- Save to index file
- Search the database
- hash a sliding window of a genome assembly of k length
- Find the right locus: match hash to locus
- Find the right allele of the locus: match sequence to alleles of locus
- If multiple cpus given, multiple assemblies will be analyzed at the same time, each single threaded.
Database is in a Perl storable object, similar to a Python pickle. The data structure has these keys
- locusArray => [array of locus names]
- locus => associative array of
- allele => associative array of
- settings => information about the database. Stores
md5_hexin v0.2 and later).