Question

bowtie2 mapped read coordinates exceed transcript length

0

Entering edit mode

7.4 years ago

boaty ▴ 220

Hi guys,

I want some figures of tss/tes distribution so i mapped my reads from rna-seq on transcripts (22k transcripts). The transcripts fasta is downloaded from gencode and for each gene i selected the longest transcript isoform. I used transcript header as ID, so i could keep some useful information for downstream analysis.

But my pb is, when i check the .bed file (converted by bedtools), i found some mapping coordinates exceed the length of transcript, the transcript ENST00000623083.3 has 1687 bps and reads were mapped on position 2546-2621 :

ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding|  **2546  2621**  NB500892:13:HKCM7BGXX:1:12103:14291:17185   1   - 2546  2621    255,0,0 1   75  0
ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding|  **2550  2625**  NB500892:13:HKCM7BGXX:1:11204:2849:19931    1   - 2550  2625    255,0,0 1   75  0
ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding|  **2564  2639**  NB500892:13:HKCM7BGXX:2:13210:21760:16153   1   - 2564  2639    255,0,0 1   75  0

How could this happen? I am totally lost

rna-seq is single-end and i used default parameter for alignment.

Thank you in advance

sequencing bowtie2 RNA-Seq transcript • 1.4k views

ADD COMMENT • link updated 7.4 years ago by GouthamAtla 12k • written 7.4 years ago by boaty ▴ 220

0

Entering edit mode

DId you align in --local mode? I guess not, if you use default parameters, but worth asking.

If not, perhaps you could check, for all the reads that map on that transcript into your SAM file, the rightmost mapping position (field 4+read_length). If the numbers exceed the transcript length also there, then it's a mapping problem. If they don't, it's a conversion problem.

To do so, a quick way would be:

samtools view yoursamfile.sam | grep ENST00000623083.3 | cut -f 4 | awk '{print $1+<insert_read_length_here>}' | sort -nr | awk '{if ($1 > <insert_transcript_length_here>) {print $0}}' | wc -l

(remove the "<>" signs when you insert values)

If the output number is 0, no read mapped out of bounds and the problem is your conversion. If the output number is > 0, then there are some mapping problems.

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thank you Macspider, I check my data, as you said it's some mapping problems. but I re-mapped read by using --local, problems remain the same.

ADD REPLY • link 7.4 years ago by boaty ▴ 220

0

Entering edit mode

Ok. Then I have a question: are you sure you are mapping your reads against the mRNA sequences, and not aganist the gene region sequence that codes for that mRNA? In the latter case, you will have a longer sequence because introns.

ADD REPLY • link 7.4 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

you are right, problem is from my reference transcript. The transcript i got was incorrect, some header was missing

thank you again Macspider

ADD REPLY • link 7.4 years ago by boaty ▴ 220