I'm trying to analyse some RNA-seq data which was handed to me with very little info, only fastq files.
I first trimmed illumina adapters and then ran Hisat2 on the most recent GRCh38 primary assembly with:
hisat2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa hg38
hisat2 -x hg38/hg38 -1 trimmed_${name1}_R1.fastq.gz -2 trimmed_${name1}_R2.fastq.gz -S ${name1}.sam
and I don't get particularly great alignment rate (only 50% or so). When I look at the bam files on IGV, there is only alignments on chr1, 10 and 11.
I've checked with idxstats:
samtools idxstats ${name1}.nodup.srt.bam | head -n 10
1 248956422 10142469 148
10 133797422 4686247 103
11 26070428 2051017 13
* 0 0 15262866
and it confirms that only chr1, 10 and 11 are there. Why is Hisat2 only aligning to chromosomes with 1 in the name? Did I do something incorrect with the hisat2-build?
Thanks
What is the output of
grep '^>' Homo_sapiens.GRCh38.dna.primary_assembly.fa
andsamtools view -H your.bam
. The odd name "no.dup.srt" suggests further manipulation after hisat, please show respective code, this is 99.9% either an incomplete genome file or a code error on the way after hisat.You're both on the money.
grep '^>' Homo_sapiens.GRCh38.dna.primary_assembly.fa
gives me only 1, 10 and 11. Plus simply checking the file size - the .fa is only 400MB. This is solved, thanks both.I got it from here: https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz