[RNAseq, featureCounts] Can paired-end data be processed as single end?
0
0
Entering edit mode
7 weeks ago
Texx • 0

I'm doing an RNAseq analysis and encounter an error when trying to perform featureCounts using Subread tool.

The error is the following:

and this is the command and the parameters:

featureCounts -T 8 -p -a Salmonella.gff3 -t CDS,exon,rRNA,tmRNA,tRNA -g ID -o counts.txt 1.bam


however if I omit the parameter -p the data is processed, but I'm not sure if what I receive is correct (e.g. double counting of the reads?).

A bit of background:

• the sequencing data was paired-end
• the mapping of the sequenced reads to respective genome was performed using bowtie2; 1 output BAM file was produced (I suppose paired-end reads were merged into 1 file) from 2 input files containing paired end reads

So my question is whether is it correct to omit the -p parameter and process the data as single end reads? If not, what should I do?

paired-end RNAseq featureCounts • 796 views
1
Entering edit mode

You should find out why that happens. Please show the bowtie command, the error is probably in there.

0
Entering edit mode

Actually I now realized that even with the single-end processing it didn't work, there were 0 successfully assigned alignements.

|| Process BAM file 1.bam...
|| || Single-end reads are included.
|| || Total alignments : 43134536
|| || Successfully assigned alignments : 0 (0.0%)
|| || Running time : 0.22 minutes

Here's the bowtie command:

bowtie2 -p 8 -x sal_index/salmonella -U 1.fq.gz | samtools view -@ 8 -Sb -o 1.bam


1.fq.gz already contains merged paired-end reads and was obtained before via BBsplit (because of slight contamination of sequencing reads), i.e. decon_sal.fq.gz was renamed to 1.fq.gz and fed to bowtie2.

java -ea -Xmx14g -cp D:\RNAseq\Software\bbmap\current\ align2.BBSplitter in1=trimmed_1.fq.gz in2=trimmed_2.fq.gz ref_sal=salmonella.fa ref_ent=enterobacter.fa out_sal=decon_sal.fq.gz out_ent=decon_ent.fq.gz outu1=clean_1.fq outu2=clean_2.fq threads=8 ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=16000 refstats=log.txt

0
Entering edit mode

And that's the counts.txt.summary from single-end processing

Status 1.bam Assigned 0 Unassigned_Unmapped 122521 Unassigned_Read_Type 0 Unassigned_Singleton 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 0 Unassigned_Secondary 0 Unassigned_NonSplit 0 Unassigned_NoFeatures 43012015 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 0

1
Entering edit mode

Assuming you are using a new version of featureCounts you should also add the second option below to make sure reads are counted as pairs.

-p                  Specify that input data contain paired-end reads. To
perform fragment counting (ie. counting read pairs), the
'--countReadPairs' parameter should also be specified in

is only applicable for paired-end reads.

0
Entering edit mode

I still receive the same error, please check my comment above, could it be that there's something wrong with the BAM file?

0
Entering edit mode

Are the chromosome names matching in your BAM file/GTF etc? That is generally the first thing to check.

You say that you decontaminated the reads but what was the contamination from? Bacterial genomes can be very similar so try to bbsplit the data may not work that well. Looking at the command above out_sal=decon_sal.fq.gz would actually contain reads that aligned to the Salmonella genome. clean* reads are actually reads that did not align to both Sal/Enterobacter genomes. I don't think that is what you wanted correct?

merged as in actually merged to create a long single read or interleaved to create a single file from R1/R2 reads?

How many reads did you lose in the process? What kind of alignment % did you get from your alignment to reference?

0
Entering edit mode

You say that you decontaminated the reads but what was the contamination from? Bacterial genomes can be very similar so try to bbsplit the data may not work that well. Looking at the command above out_sal=decon_sal.fq.gz would actually contain reads that aligned to the Salmonella genome. clean* reads are actually reads that did not align to both Sal/Enterobacter genomes. I don't think that is what you wanted correct?

The sample was composed mainly of Salmonella and slightly contaminated with Enterobacter; I wanted only Salmonella genome so I kept only the reads that aligned to Salmonella genome. That should be fine?

merged as in actually merged to create a long single read or interleaved to create a single file from R1/R2 reads?

Interleaved would be the correct term, at least that's what I intended. From what I understood this command produces interleaved decon_sal.fq.gz file from R1, R2 reads in1=trimmed_1.fq.gz in2=trimmed_2.fq.gz

java -ea -Xmx14g -cp D:\RNAseq\Software\bbmap\current\ align2.BBSplitter in1=trimmed_1.fq.gz in2=trimmed_2.fq.gz ref_sal=salmonella.fa ref_ent=enterobacter.fa out_sal=decon_sal.fq.gz out_ent=decon_ent.fq.gz outu1=clean_1.fq outu2=clean_2.fq threads=8 ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=16000 refstats=log.txt


Here's the log from BBSplit, so I guess the majority of reads were preserved.

 #name  %unambiguousReads   unambiguousMB   %ambiguousReads ambiguousMB unambiguousReads    ambiguousReadsassignedReads assignedBases
sal   97.73615    4225.287835 1.83828 79.340809   42344244    796438  42344244    4225287835
ent   0.02091 0.903353    1.83828 79.340809   9058    796438  805496  80244162


Regarding the alignment to the reference I seem to have lost the stats file somewhere but it was very high (the reference Salmonella genome and the employed strain of Salmonella bacteria are the same).

I also have a feeling that the names might not be matching, I did this:

samtools view -H S16/1.bam


@HD VN:1.5 SO:unsorted GO:query @SQ SN:6d4d1407f5594ab6_1 LN:4869951 @SQ SN:6d4d1407f5594ab6_2 LN:93832 @PG ID:bowtie2 PN:bowtie2 VN:2.4.5
CL:"/mnt/d/RNAseq/Software/bowtie2-2.4.5-linux-x86_64/bowtie2-align-s --wrapper basic-0 -p 8 -x /mnt/d/RNAseq/Software/resources/sal_index/salmonella -S /mnt/d/RNAseq/Decontaminated/S16/S1.sam -U /mnt/d/RNAseq/Decontaminated/S16/1.fq.gz" @PG ID:samtools
PN:samtools PP:bowtie2 VN:1.15.1 CL:samtools view -@ 8 -Sb -o /mnt/d/RNAseq/Decontaminated/S16/1.bam /mnt/d/RNAseq/Decontaminated/S16/1.sam @PG ID:samtools.1
PN:samtools PP:samtools VN:1.15.1 CL:./samtools view -H /mnt/d/RNAseq/Decontaminated/S16/1.bam

and this is the start of the GFF3 file

# Input: /data/dnb06/galaxy_db/files/8/7/f/dataset_87fa0561-f6b9-4182-b712-638105f488ad.dat
##gff-version 3
##sequence-region unknown 1 4869951
# conversion-by bp_genbank2gff3.pl
# organism unknown
# Note Strain strain.
# working on contig:unknown, unknown, Strain strain.
1   GenBank contig  1   4869951 .   +   1   ID=unknown;Name=unknown;Note=Strain strain.,Annotated using prokka 1.14.0 from https://github.com/tseemann/prokka. ;comment1=Annotated using prokka 1.14.0 from https://github.com/tseemann/prokka. ;mol_type=genomic DNA;organism=unknown;strain=strain


But that still doesn't explain why featureCounts doesn't detect the bam file as paired-end, any idea about that?

0
Entering edit mode

If you did not tell bowtie2 that the reads were interleaved then they were simply treated as single end reads. So the BAM file that you currently have is incorrect.

You can either re-do the alignments with correct flag that ATPoint mentions (exists) or split the interleaved reads into two files by doing

reformat.sh -Xmx2g in=Sal.fq.gz out1=R1.fq.gz out2=R2.fq.gz.


You can then use these files as input -1 and -2 to bowtie.

You could have simply aligned the data using bbmap.sh instead of moving to bowtie (as long as you have samtools available BBMap will directly create BAM files).

bbmap.sh -Xmx4g in=Sal.fq.gz out=aligned.bam ref=salmonella_ref.fa maxindel=0 ambig=random


We still can't see what the chromosome name is. What does following show?

grep "^>" salmonella_ref.fa


The chromosome name you see there, is it the same in your GTF file and your BAM?

0
Entering edit mode

Alright, I'll try to fix the problem regarding the alignment of paired-end data.

Regarding the

grep "^>" salmonella_ref.fa

this is what I get:

>6d4d1407f5594ab6_1 assembly_id="6d4d1407f5594ab6" genome_id="814482244337467e" atcc_catalog_number="ATCC 14028" species="Salmonella enterica" contig_number="1" topology="circular"

>6d4d1407f5594ab6_2 assembly_id="6d4d1407f5594ab6" genome_id="814482244337467e" atcc_catalog_number="ATCC 14028" species="Salmonella enterica" contig_number="2" topology="circular"

0
Entering edit mode

Those ID's are pretty ugly and they are likely not matching your annotation file. They are likely to break something in this myriad chain of programs.

If you are going to redo the alignments then make sure your chromosome names are simplified and match what is in the annotation.

0
Entering edit mode

The IDs were indeed a problem, the processing worked after changing them. Thanks for all the help!

0
Entering edit mode

I do not know bbsplit and how it returns reads but in general, make sure that if you input a paired-end fastq set with two files (R1 and R2) that you get in return two files with the cleaned reads. I do not know what "merged" here means. If it means you have in a single file alternating R1-R2-R1-R2 etc (called interleaved) then you have to tell bowtie that it is interleaved. It has a flag to read that format, check its manual for it. Then as genomax says make sure the chromosome names in bam and gtf match.

0
Entering edit mode

I added another comment above in case it helps further. But you would then suggest that the problem with featureCounts is because bowtie2 didn't have "interleaved" specified but -U instead (unpaired reads)?

How to verify that the names in BAM and GFF3/GTF match?