Rseqc infer_experiment.py: 0 usable reads sampled and unknown data type
0
0
Entering edit mode
21 months ago
pubsurfted ▴ 40

Hello,

I was trying to use rseqc to check strandedness of my fastq files. It requires two inputs: an aligned SAM file and a BED file. To create the SAM file, I aligned my reads using bowtie2.

bowtie2 -q --phred33 --threads 2 -x index -1 SRR5655560_1.atria.fastq -2 SRR5655560_2.atria.fastq -S SRR5655560.sam

This is how the first few lines of the SAM file looks like :

@HD VN:1.0  SO:unsorted
@SQ SN:KRH76248 LN:1234
@SQ SN:KRH76252 LN:915
@SQ SN:KRH76250 LN:1578
@SQ SN:KRH76249 LN:1578
@SQ SN:KRH76247 LN:762
@SQ SN:KRH76251 LN:915
@SQ SN:KRH77790 LN:756
@SQ SN:KRH77789 LN:1433
@SQ SN:KRH76519 LN:2413

To create the BED file, I first downloaded the Glycine_max.Glycine_max_v2.1.55.gtf from Ensembl Plants.This is how the first few lines of the gtf looks like:

#!genome-build Glycine_max_v2.1
#!genome-version Glycine_max_v2.1
#!genome-date 2018-07
#!genome-build-accession GCA_000004515.4
#!genebuild-last-updated 2018-08
18  ena gene    7595306 7600287 .   -   .   gene_id "GLYMA_18G079000"; gene_source "ena"; gene_biotype "protein_coding";
18  ena transcript  7595306 7600287 .   -   .   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
18  ena exon    7600155 7600287 .   -   .   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "KRG98527-1"; tag "Ensembl_canonical";
18  ena CDS 7600155 7600287 .   -   0   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "KRG98527"; tag "Ensembl_canonical";
18  ena start_codon 7600285 7600287 .   -   0   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

and converted it to BED using gtf2bed.py from https://github.com/signalbash/how_are_we_stranded_here/blob/master/how_are_we_stranded_here/gtf2bed.py

gtf2bed --gtf Glycine_max.Glycine_max_v2.1.55.gtf --bed output.bed

This is how the BED file looks like:

   18   7595305 7600287 KRG98527    0   -   7595305 7600287 0   6   846,344,141,156,219,133,    0,2159,2873,3601,4344,4849,
    18  7699667 7702882 KRG98546    0   +   7699667 7702882 0   2   1061,799,   0,2416,
    18  51277502    51280090    KRH00628    0   -   51277502    51280090    0   5   337,115,121,66,135, 0,670,1746,2034,2453,
    18  51277806    51280022    KRH00629    0   -   51277806    51280022    0   5   33,115,127,66,67,   0,366,1442,1730,2149,
    18  47409830    47414924    KRH00168    0   -   47409830    47414924    0   10  693,57,78,88,84,63,74,71,67,605,    0,1102,1328,1490,1712,1878,2092,3112,3388,4489,
    18  47409830    47414924    KRH00169    0   -   47409830    47414924    0   11  693,57,78,88,84,56,74,118,71,67,605,    0,1102,1328,1490,1712,1878,2092,2287,3112,3388,4489,
    18  7625583 7632431 KRG98532    0   +   7625583 7632431 0   6   1280,95,123,807,217,882,    0,3342,3538,4049,5666,5966,
    18  7625542 7632431 KRG98535    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,
    18  7625542 7632431 KRG98533    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,
    18  7625542 7632431 KRG98531    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,

I've browsed several similar posts and noticed that my SAM and BED files do not look as "neat" in comparison. I'd like to know where I'm going wrong.

Best wishes!

edit: formatting

rseqc infer_experiment.py RNA-seq • 1.8k views
ADD COMMENT
1
Entering edit mode

Answers on this post suggest there the chromosome name in your SAM file and those in your GTF do not match. For example the GTF says '18' and the SAM my say 'chr18'. Might be worth checking :)

ADD REPLY
0
Entering edit mode

Thanks for replying! the following SAM file output:

@HD VN:1.0  SO:unsorted
@SQ SN:KRH76248 LN:1234
@SQ SN:KRH76252 LN:915
@SQ SN:KRH76250 LN:1578
@SQ SN:KRH76249 LN:1578
@SQ SN:KRH76247 LN:762
@SQ SN:KRH76251 LN:915
@SQ SN:KRH77790 LN:756
@SQ SN:KRH77789 LN:1433
@SQ SN:KRH76519 LN:2413

I have two questions: 1) where is the chromosome number here? the third columns? 2) should I sort my sam file before giving it to infer_experiemnt.py

ADD REPLY
1
Entering edit mode

KRH77790

(names after SN:) is the chromosome name (and LN is the length). You can't use a reference and a GTF that came from two different sources.

ADD REPLY
0
Entering edit mode

Thank you for the clarification. But what do you mean by reference and GTF from two different sources. I'm using the reference genome and gtf file for Glycine_max.Glycine_max_v2.1.

ADD REPLY
1
Entering edit mode

Did you check all SQ lines to see if you have an entry for I guess chromosome 18?

Chromosome name looks like 18 in your BED file but is there a @SQ SN:18 LN:NNNNN line in your alignment file?

BTW: Have you tried running infer_experiment.py with files you have. Perhaps there is nothing wrong with them. Do not go on how they "look".

ADD REPLY
0
Entering edit mode

GenoMax Hello, I rechecked the sources for ref genome and gtf. You were correct! You have no idea how thankful I am for people like you for taking the time to help a bioinfo noob. Again, thank you so much!

ADD REPLY
0
Entering edit mode

What turned out to be the issue ultimately?

ADD REPLY
0
Entering edit mode

My dumbass was using a transcripts cdna.fasta file instead of dna.top_level.fasta to build the bowtie2 index.

ADD REPLY
0
Entering edit mode

Hi pubsurfted,

I ran into the same issue using a reference transcriptome with HISAT2 from the Ensembl database. For others with similar experience, I will share details. For infer_experiment.py from the rseqc package, I passed .sam files generated from aligning (via HISAT2) my PE bulk RNA-sequencing reads to Ensembl's reference transcriptome (Homo_sapiens.GRCh38.cdna.all.fa) and a .bed file converted from Ensembl's gene annotation file (Homo_sapiens.GRCh38.112.gtf). When passing the .sam file and .bed file to infer_experiment.py, I receive this error:

Reading reference gene model Homo_sapiens.GRCh38.112.bed ... Done
Loading SAM/BAM file ... Finished Total 0 usable reads were sampled
Unknown Data Type

As HISAT2 is specifically designed for aligning to a genome, aligning to the transcriptome will lead to problematic .sam files (i.e., may not be recognized as a known data type by programs expecting .sam files). As GenoMax described, the second column has chromosome names. In the context of Ensembl and HISAT2, a .sam file made from aligning to a reference transcriptome and sorted by read names (via Samtools) has a header like this:

@HD VN:1.0  SO:queryname  SS:queryname:natural  
@SQ SN:ENST00000476239.1 LN:33
@SQ SN:ENST00000343417.1 LN:21
@SQ SN:ENST00000343411.1 LN:8
@SQ SN:ENST00000343411.1 LN:15
@SQ SN:ENST00000360411.1 LN:16

On the contrary, a sorted .sam file made from aligning to a reference genome appears like this:

@HD VN:1.0  SO:queryname  SS:queryname:natural 
@SQ SN:4 LN:334566891
@SQ SN:22 LN:236024378
@SQ SN:1 LN:82435658
@SQ SN:19 LN:83400351
@SQ SN:7 LN:16160345

Alignment to the Ensembl genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa) via HISAT2 resolved the error, generating appropriate .sam files that were recognized by rseqc programs.

Hope this clarifies this important detail for others.

Maze

ADD REPLY

Login before adding your answer.

Traffic: 1179 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6