Question: Set the max coverage wanted with Hisat2 ?
0
gravatar for Chvatil
13 months ago by
Chvatil50
Chvatil50 wrote:

Hello everyone, I'm actually using Hisat in order to map reads against one genome.

Here is the job file :

FILE=Genome.fa
READS1a=SRR9110374_1.fastq.gz
READS2a=SRR9110374_2.fastq.gz


###############################################################
#       SOFTWARES                                             #
##############################################################

# MAPPING HISAT2
HISAT2=/TOOLS/hisat2-2.1.0/


#bedtools
BEDTOOLS=/usr/bin/bedtools

###############################################################
#        MAP SHORT READS TO INFER COVERAGE DEPTH              
##############################################################

# copy on cluster

cd $OUT

# build index
$HISAT2/hisat2-build $FILE mapping_index
# mapping
$HISAT2/hisat2 -k 1 -q -x mapping_index -1 $READS1a -2 $READS2a -S mapping.sam > stats_mapping.txt

But the issue is here :

The R1 and R2 files are relatively huge and when I generate the .sam file it takes a lot of memory space ! I wondered then if Hisat2 was able to specify the threshold coverage to reach ? In other word can we define the number of read we want to map against the genome even if we do not take all the reads present in fastQ files ?

Tanks for your help

hisat2 mapping • 441 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by Chvatil50
2
gravatar for Devon Ryan
13 months ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

The -u option can be used to specify how many reads/pairs in total to align. Note however that this will NOT notably decrease memory usage, since the fastq files and the output aren't held in memory. Usually hisat2 requires < 30GB RAM, which is quite reasonable.

ADD COMMENTlink written 13 months ago by Devon Ryan98k

I was talking about the hard disk space sorry. So if I desire a coverage mapping of 10X I have to put -u 10 right ?

ADD REPLYlink written 13 months ago by Chvatil50

No, see my answer. I removed the accepted tag as the question is not solved yet, even though Devon Ryan 's answer is of course perfectly fine from the technical side towards how to limit hisat2 to process only a certain number of reads.

ADD REPLYlink modified 13 months ago • written 13 months ago by ATpoint44k
1
gravatar for ATpoint
13 months ago by
ATpoint44k
ATpoint44k wrote:

It is not that simple. You cannot assume that 10mio reads will give you 10x coverage. It depends on genome size and read length.

Lets calculate theoretical coverage with Lander/Waterman equation: (https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf)

C = LN / G where C stands for coverage, G is the haploid genome length, L is the read length and N is the number of reads.

  • the Apis mellifera genome has a length of about 200Mb, so G=200.000.000
  • the read length is 2x100bp (see here) so L = 200
  • the dataset has 147.505.611 reads so N= 147.505.611

Therfore assuming a perfect world with all reads fully usable and no duplicates: 200 * 147505611 / 200000000 = 147x coverage

So you have 14.7fold more reads than necessary, which is of course the theoretical value. Considering duplicates, low quality reads etc. you might downsample the total read number to like 20%.

Also, use Unix pipes and do not save alignments in SAM but in the much smaller BAM format. BAM is a binary version of SAM and there is no reason to use the much larger SAM format. 99.9% of downstream tools can read BAM. For this do:

$HISAT2/hisat2 -k 1 -q -x mapping_index -1 $READS1a -2 $READS2a | samtools view -o mapping.bam

The | sends the output of hisat2 to so-called stdout which is then processed by the tool behind | and in this case directly saved in the compressed BAM format. Given that BAM is much smaller than BAM you might even try to simply save the entire dataset instead of a subset.

I am also reasonably sure that hisat2 does not send alignment stats to stdout so you would need to capture stderr for stats_mapping.txt using 2>.

ADD COMMENTlink modified 13 months ago • written 13 months ago by ATpoint44k

@ATpoint thank you very much for all this information it was very helpfull!

ADD REPLYlink written 13 months ago by Chvatil50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour
_