Question

Rseqc infer_experiment.py: 0 usable reads sampled and unknown data type

0

Entering edit mode

16 months ago

pubsurfted ▴ 40

Hello,

I was trying to use rseqc to check strandedness of my fastq files. It requires two inputs: an aligned SAM file and a BED file. To create the SAM file, I aligned my reads using bowtie2.

bowtie2 -q --phred33 --threads 2 -x index -1 SRR5655560_1.atria.fastq -2 SRR5655560_2.atria.fastq -S SRR5655560.sam

This is how the first few lines of the SAM file looks like :

@HD VN:1.0  SO:unsorted
@SQ SN:KRH76248 LN:1234
@SQ SN:KRH76252 LN:915
@SQ SN:KRH76250 LN:1578
@SQ SN:KRH76249 LN:1578
@SQ SN:KRH76247 LN:762
@SQ SN:KRH76251 LN:915
@SQ SN:KRH77790 LN:756
@SQ SN:KRH77789 LN:1433
@SQ SN:KRH76519 LN:2413

To create the BED file, I first downloaded the Glycine_max.Glycine_max_v2.1.55.gtf from Ensembl Plants.This is how the first few lines of the gtf looks like:

#!genome-build Glycine_max_v2.1
#!genome-version Glycine_max_v2.1
#!genome-date 2018-07
#!genome-build-accession GCA_000004515.4
#!genebuild-last-updated 2018-08
18  ena gene    7595306 7600287 .   -   .   gene_id "GLYMA_18G079000"; gene_source "ena"; gene_biotype "protein_coding";
18  ena transcript  7595306 7600287 .   -   .   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
18  ena exon    7600155 7600287 .   -   .   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "KRG98527-1"; tag "Ensembl_canonical";
18  ena CDS 7600155 7600287 .   -   0   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "KRG98527"; tag "Ensembl_canonical";
18  ena start_codon 7600285 7600287 .   -   0   gene_id "GLYMA_18G079000"; transcript_id "KRG98527"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

and converted it to BED using gtf2bed.py from https://github.com/signalbash/how_are_we_stranded_here/blob/master/how_are_we_stranded_here/gtf2bed.py

gtf2bed --gtf Glycine_max.Glycine_max_v2.1.55.gtf --bed output.bed

This is how the BED file looks like:

   18   7595305 7600287 KRG98527    0   -   7595305 7600287 0   6   846,344,141,156,219,133,    0,2159,2873,3601,4344,4849,
    18  7699667 7702882 KRG98546    0   +   7699667 7702882 0   2   1061,799,   0,2416,
    18  51277502    51280090    KRH00628    0   -   51277502    51280090    0   5   337,115,121,66,135, 0,670,1746,2034,2453,
    18  51277806    51280022    KRH00629    0   -   51277806    51280022    0   5   33,115,127,66,67,   0,366,1442,1730,2149,
    18  47409830    47414924    KRH00168    0   -   47409830    47414924    0   10  693,57,78,88,84,63,74,71,67,605,    0,1102,1328,1490,1712,1878,2092,3112,3388,4489,
    18  47409830    47414924    KRH00169    0   -   47409830    47414924    0   11  693,57,78,88,84,56,74,118,71,67,605,    0,1102,1328,1490,1712,1878,2092,2287,3112,3388,4489,
    18  7625583 7632431 KRG98532    0   +   7625583 7632431 0   6   1280,95,123,807,217,882,    0,3342,3538,4049,5666,5966,
    18  7625542 7632431 KRG98535    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,
    18  7625542 7632431 KRG98533    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,
    18  7625542 7632431 KRG98531    0   +   7625542 7632431 0   6   1321,95,123,807,217,882,    0,3383,3579,4090,5707,6007,

I've browsed several similar posts and noticed that my SAM and BED files do not look as "neat" in comparison. I'd like to know where I'm going wrong.

Best wishes!

edit: formatting

rseqc infer_experiment.py RNA-seq • 1.5k views

ADD COMMENT • link 16 months ago by pubsurfted ▴ 40

1

Entering edit mode

Answers on this post suggest there the chromosome name in your SAM file and those in your GTF do not match. For example the GTF says '18' and the SAM my say 'chr18'. Might be worth checking :)

ADD REPLY • link 16 months ago by Jack Tierney ▴ 350

0

Entering edit mode

Thanks for replying! the following SAM file output:

@HD VN:1.0  SO:unsorted
@SQ SN:KRH76248 LN:1234
@SQ SN:KRH76252 LN:915
@SQ SN:KRH76250 LN:1578
@SQ SN:KRH76249 LN:1578
@SQ SN:KRH76247 LN:762
@SQ SN:KRH76251 LN:915
@SQ SN:KRH77790 LN:756
@SQ SN:KRH77789 LN:1433
@SQ SN:KRH76519 LN:2413

I have two questions: 1) where is the chromosome number here? the third columns? 2) should I sort my sam file before giving it to infer_experiemnt.py

ADD REPLY • link 16 months ago by pubsurfted ▴ 40

1

Entering edit mode

KRH77790

(names after SN:) is the chromosome name (and LN is the length). You can't use a reference and a GTF that came from two different sources.

ADD REPLY • link 16 months ago by GenoMax 141k

0

Entering edit mode

Thank you for the clarification. But what do you mean by reference and GTF from two different sources. I'm using the reference genome and gtf file for Glycine_max.Glycine_max_v2.1.

ADD REPLY • link 16 months ago by pubsurfted ▴ 40

1

Entering edit mode

Did you check all SQ lines to see if you have an entry for I guess chromosome 18?

Chromosome name looks like 18 in your BED file but is there a @SQ SN:18 LN:NNNNN line in your alignment file?

BTW: Have you tried running infer_experiment.py with files you have. Perhaps there is nothing wrong with them. Do not go on how they "look".

ADD REPLY • link 16 months ago by GenoMax 141k

0

Entering edit mode

GenoMax Hello, I rechecked the sources for ref genome and gtf. You were correct! You have no idea how thankful I am for people like you for taking the time to help a bioinfo noob. Again, thank you so much!

ADD REPLY • link 16 months ago by pubsurfted ▴ 40

0

Entering edit mode

What turned out to be the issue ultimately?

ADD REPLY • link 16 months ago by GenoMax 141k

0

Entering edit mode

My dumbass was using a transcripts cdna.fasta file instead of dna.top_level.fasta to build the bowtie2 index.

ADD REPLY • link 16 months ago by pubsurfted ▴ 40