Question

Using featureCounts on paired-end reads aligned with HISAT2 resulting in 0 assigned fragments

1

Entering edit mode

3.6 years ago

turo.1 ▴ 20

I am attempting to do an RNA-seq analysis with 3 SRA files, using HISAT2 to align and then featureCount to quantify transcripts. Organism is Arabidopsis thaliana. I'm making this post because I'm having a lot of trouble trying to quantify transcripts with featureCount, where all of the reads are unassigned and the program output contains 0 assigned reads. Here's what I've done so far:

First I downloaded the tair10 reference genome which I built into an index with:

hisat2-build -p 16 genome.fa genome

I entered my paired-end reads in with the command:

hisat2 -x <path_to_index> -1 trimmed_paired_SRR_1.fastq -2 trimmed_paired_SRR_2.fastq -U trimmed_unpaired_SRR_1.fastq,trimmed_unpaired_SRR_2.fastq -S hit_SRR

This resulted in SAM output that looked like this (feels important to include this though I'm not sure whether there's a better way to format it for comprehensibility):

more hit_SRR.sam

@HD VN:1.0  SO:unsorted 
@SQ SN:1    LN:30427671 
@SQ SN:2    LN:19698289 
@SQ SN:3    LN:23459830 
@SQ SN:4    LN:18585056 
@SQ SN:5    LN:26975502
@SQ SN:mitochondria LN:366924
@SQ SN:chloroplast  LN:154478
@PG ID:hisat2   PN:hisat2   VN:2.1.0    CL:"hisat2 -x <path_to_index> -1 trimmed_paired_SRR_1.fastq -2 trimmed_paired_SRR_2.fastq -U trimmed_unpaired_SRR_1.fastq,trimmed_unpaired_SRR_2.fastq -S hit_SRR"
SRR.1   83  1   28116855    60  17M163N133M =   28116833    -291    TTCAGTCCATGGAACTCCTCGTTTCCTCTCGATTTTGCCGTTAGAAGAAGACAT
GGGAAGAGCTTCATCCGCCGAGGCGTAACCGGCCGGGGTTTTGTTCATATCTTGTTCATCTCTATCTTCGCCGTCGATCTTTGGAGTCTCCGCCGT    FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:1
50  YS:i:0  YT:Z:CP XS:A:-  NH:i:1
SRR.1   163 1   28116833    60  39M163N111M =   28116855    291 GCAAGAACAGCTTGTGTTCTTCTTCAGTCCATGGAACTCCTCGTTTCCTCTCGA
TTTTGCCGTTAGAAGAAGACATGGGAAGAGCTTCATCCGCCGAGGCGTAACCGGCCGGGGTTTTGTTCATATCTTGTTCATCTCTATCTTCGCCGT    FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF,FFFFFFFFFFF:FFFFFF
FFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFFFFFFF:,FFFFFFFFFFF,FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:1
50  YS:i:0  YT:Z:CP XS:A:-  NH:i:1
SRR.2   99  5   20218748    60  1S135M76N14M    =   20218806    167 TTATCTTAAAGCGAAATAGTATAGTGGAAAGCTTAGACTTTGGTACAGACGTTA
GCAGCCAGTTGGAGTTTGTGGATTTACAATACAATGAAATAACTGATTATAAACCATCAGCTAATAAAGTCCTCCAAGTAATATTGGCAAATAATC    FFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:1
49  YS:i:0  YT:Z:CP XS:A:+  NH:i:1

Then I converted the output SAM to sorted BAM with

samtools view -S -b hit_SRR.sam > hit_SRR.bam

samtools sort hit_SRR.bam -o hit_SRR.sorted.bam

Then I tried to quantify my transcripts with featureCounts, which is where I'm having a lot of trouble.

First, I understand featureCounts requires a full annotation file for the reference genes in gtf input and I downloaded the Araport11 annotations gtf that is based on the TAIR10 genome that I used for my index. This file looks like this:

head araport11_genes.gtf 
Chr1    Araport11   five_prime_UTR  3631    3759    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   exon    3631    3913    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   transcript  3631    5899    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   CDS 3760    3913    .   +   0   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   exon    3996    4276    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   CDS 3996    4276    .   +   2   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   exon    4486    4605    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   CDS 4486    4605    .   +   0   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   exon    4706    5095    .   +   .   transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11   CDS 4706    5095    .   +   0   transcript_id "AT1G01010.1"; gene_id "AT1G01010";

So, my usage of featureCounts is:

featureCounts -T 4 -p -a araport11_genes.gtf -o filename.featureCounts.txt hit_SRR1.sorted.bam hit_SRR2.sorted.bam hit_SRR3.sorted.bam

This yields the output:

more filename.featureCounts.txt.summary 
Status  hit_SRR1.sorted.bam hit_SRR2sorted.bam  hit_SRR3.sorted.bam
Assigned    0   0   0
Unassigned_Ambiguity    0   0   0
Unassigned_MultiMapping 3696122 1999717 1862896
Unassigned_NoFeatures   26964381    30128594    20699201
Unassigned_Unmapped 151924  155922  111851
Unassigned_MappingQuality   0   0   0
Unassigned_FragmentLength   0   0   0
Unassigned_Chimera  0   0   0
Unassigned_Secondary    0   0   0
Unassigned_Nonjunction  0   0   0
Unassigned_Duplicate    0   0   0

And the more extended output:

more <outputfile>

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
    v1.5.0-p2

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 3 BAM files                                      ||
||                           P hit_SRR1.sorted.bam                     ||
||                           P hit_SRR2.sorted.bam                     ||
||                           P hit_SRR3.sorted.bam                     ||
||                                                                            ||
||             Output file : filename.featureCounts-rfmt_options.txt       ||
||                 Summary : filename.featureCounts-rfmt_options.txt.s ... ||
||              Annotation : araport11_genes.gtf (GTF)                        ||
||                                                                            ||
||                 Threads : 4                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file araport11_genes.gtf ...                               ||
||    Features : 323164                                                       ||
||    Meta-features : 38113                                                   ||
||    Chromosomes/contigs : 7                                                 ||
||                                                                            ||
|| Process BAM file hit_SRR1.sorted.bam...                             ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Both single-end and paired-end reads were found.                        ||
||    Total fragments : 30812427                                              ||
||    Successfully assigned fragments : 0 (0.0%)                              ||
||    Running time : 0.64 minutes                                             ||
||                                                                            ||
|| Process BAM file hit_SRR2.sorted.bam...                             ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 32284233                                              ||
||    Successfully assigned fragments : 0 (0.0%)                              ||
||    Running time : 0.65 minutes                                             ||
||                                                                            ||
|| Process BAM file hit_SRR3.sorted.bam...                             ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 22673948                                              ||
||    Successfully assigned fragments : 0 (0.0%)                              ||
||    Running time : 0.45 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

So, can anyone help me understand why I'm getting exactly 0 assigned reads?

I think it's a somewhat simple fix that I haven't realized yet, as it's not a problem of only some reads being assigned- the problem is that zero are being assigned. Yet some of them are mapping to multiple places and most are mapping to no features- how could that be happening? I don't see how that's possible given my SAM that is specifying fragment aligning locations and my reference annotation GTF that specifies the annotation location information?

It seems likely to me that I have:

Failed to provide any readers with sufficient info about some facet of this that I don't realize is important- please let me know what details might help anyone that can give me advice.
Overlooked some simple solution that's causing all of my problems (optimistic).

Finally, I have tried googling this problem. Relevant threads that didn't help me solve my problem include:

Thanks for any help or insights!

RNA-Seq alignment rna-seq gene Assembly • 7.5k views

ADD COMMENT • link 3.6 years ago by turo.1 ▴ 20

0

Entering edit mode

not sure this is the cause of the issue(s), but in the samtools sort step, make sure you order them on name rather than position.

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

0

Entering edit mode

Hi Lieven.Sterck, I hadn't heard about this before. I googled the question and found this biostars thread, but I don't think I understand the downstream significance of sorting by chromosomal position vs read name. It seems like sometimes it's better one way and other times it's better the other. Am I right in supposing that for differential expression analysis, it's better to sort by read name because it retains read pairing, which is important for only counting a fragment once rather than twice if the reads are separated? Thanks for your comment

ADD REPLY • link 3.6 years ago by turo.1 ▴ 20

1

Entering edit mode

I think Brian's answer is pretty clear though.

in a nutshell: if the goal is to display reads on a chromosome fragment (eg in a genome browser or such) you want position sorting as it is important (or more efficient) that the browser can quickly access all reads for a certain position. If the goal is to check for correct pairing (eg. DEG analysis, counting the numbers of correct mapped reads, ... ) then you are better of with name sorting as this makes the whole process more efficient (same read names will be 'next to each other' in the sorted file).

You do not have the risk of counting them twice (at least not for most modern tools) but as said it's just more efficient, moreover some tools require it because they don't spend the time looking for the 'mate' further down the file

ADD REPLY • link 3.6 years ago by lieven.sterck 15k

score 2 · Answer 1 · 2020-09-30

2

Entering edit mode

3.6 years ago

swbarnes2 14k

From your sam file, you can see that your chromosome names from the file you aligned to are bare numbers. In your gft, the chromosomes have "Chr"

Your reference fastq and gtf must match for any of this to work.

It's the same problem identified in one of your links.

I confess I don't quite understand what the difference is between Araport and TAIR, but the TAIR link has both a genome and a gff; are you totally sure you shouldn't be using the gff from here?

https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release

ADD COMMENT • link 3.6 years ago by swbarnes2 14k

1

Entering edit mode

Hi swbarnes2, thanks for your answer. It led me to the correct solution that as you observed had been in front of me the whole time! In my defense (I'm going to edit my original post to reflect this) I had try renaming the GTF chromosome names but that didn't solve the problem. However, adjusting the header of my BAM files in the way suggested in my linked question in Devon Ryan's comment to rename the header with:

samtools view -H your_file.bam > header
...edit the header in your favourite text editor ...
samtools reheader header your_file.bam > fixed.bam

worked perfectly and solved my assignment problem on the first try.

Also, no worries in re the difference between Araport and TAIR. Just for the sake of this serving as a potential reference to someone that comes across this record I'll clarify: TAIR is a consortium that publishes the official Arabidopsis reference genome and Araport is a similar team that produces reference annotations derived from the TAIR genome. While TAIR also provides reference annotation files, as of this writing the most-recent Araport annotations are newer than the TAIR ones that are nearly a decade old.

Thanks again for your response

ADD REPLY • link 3.6 years ago by turo.1 ▴ 20

score 1 · Answer 2 · 2020-09-30

The output snippets from the bam file and the annotation file indicates both use a different notation for the chromosome names: the annotation uses chr1, chr2, etc; the reference genome uses 1, 2, etc.

I remember reading featureCounts is smart enough to detect when the annotation and reference genome differ in chromosome name convention and one uses 1, 2, and so on, and the other uses chr1, chr2, and so on. But I am not sure about this (I can't find where I've read this), thus as swbarnes2 pointed out, this may be the cause of your problem.

You can run featureCounts with the --verbose flag, it will provide additional output that will confirm if this is indeed the problem:

--verbose           Output verbose information for debugging, such as un-
                    matched chromosome/contig names.

If such is the case, you can fix the annotation with awk, or use the -A parameter for featureCounts:

-A <string>         Provide a chromosome name alias file to match chr names in
                    annotation with those in the reads. This should be a two-
                    column comma-delimited text file. Its first column should
                    include chr names in the annotation and its second column
                    should include chr names in the reads. Chr names are case
                    sensitive. No column header should be included in the file.

As the log output indicates (and the manual states), featureCounts tries hard to make things easy for the user, so:

1) featureCounts read counting defaults to unstranded - which you are specifying by omission, as you didn't use the -s parameter.

2) featureCounts detects you are mixing single and paired-end reads in the same bam file (I would not recommend this, in particular as it seems the single-end reads are just left-over from the trimming process, and should account for a very small percentage of the reads). In addition, you are using the -p flag, which makes paired reads be counted twice:

-p                  If specified, fragments (or templates) will be counted
                    instead of reads. This option is only applicable for
                    paired-end reads; single-end reads are always counted as
                    reads.

I can't think of a scenario where this would be useful.

3) featureCounts also detects the bam is not sorted by name, so it automatically sorts the reads by name before counting them. However, if the only analysis you are going to perform is differential expression, there is no need to sort the bam files by position.

edit: finally, it seems you are using an fairly old version of the Subread package, as Subread 2.0.1 is the current version .