Why are some mapped reads not mapped completely?
0
0
Entering edit mode
2.2 years ago
utsafar ▴ 80

In my work I generated contigs using Trinity, then extracted ORFs (min length 1200 bp) using get-orf, and then mapped RNA-Seq reads to these ORFs. While all reads are 101 bp length, when I watched mapping results using Integrated genome browser I see that mapped length of many reads are below 101 (mostly between 30 to 70 bp). Can some one please explain why only some parts of reads are mapped to contigs?

RNA-Seq mapping • 1.2k views
1
Entering edit mode

UTR ?

since you only extract the ORFs you will omit a part which is the UTR (= transcribed but not translated into protein and thus not part of the ORF) sometimes also even within the ORF you can have this as the assembly is a 'consensus' of all reads and might deviate from the original reads.

(min lenght of 1200bp for an ORF is quite large tbh)

0
Entering edit mode

I am aware of UTRs and omitted them consciously. Also in my species median protein length is 800 na (2400 bp) so I think looking for Orfs with length 1200 bp and above is fare. I expect, at least in middle of my ORFs, reads map completely. But nested reads (if I use this word correctly) are every where!

1
Entering edit mode

totally deviating from the original question, but a median protein length of 800 AA ????? seriously ??

that's like more than double the median size for all eukaryotes and nearly 3 times that of bacteria. What kind of freak species are we talking here?

But nested reads (if I use this word correctly) are every where!

what do you mean with that ('nested reads')?

1
Entering edit mode

If there are quality issues with your data some bases may be getting soft-clipped by aligner. There is an option in IGV to show these bases. It can be found in "preferences".

0
Entering edit mode

I used Trinity, get-orf and BWA-MEM in usegalaxy.eu and default options in my work. The quality of data is OK (fastQC). Aligner soft-clip even above 90 percent of reads. I am confused!

1
Entering edit mode

I suggested that as a possibility because you said that

why only some parts of reads are mapped to contigs?

Since we can't see your data this is something you will have to check yourself.

0
Entering edit mode

It is a shame that I didn't know soft-clipping. Just read about it and watched the CIGAR column data in my bam file. Many reads are soft-clipped. Now what I have to do? Is this mapping trust able at all?

2
Entering edit mode

another thing that could happen is that you have plenty of chimeric assembled transcripts in your dataset. This could also give the observations you describe (== reads only aligning partially). If those are clustered around a specific position it likely points to where the chimer is formed.

2
Entering edit mode

Another thing that is also often soft-clipped are adapters. During sequencing, sometimes the insert is smaller than the number of sequencing cycles, so you end up sequencing the insert + a part of the sequencing adapter. This is easy to check in FASTQC, like in the image below.

If soft-clipping is caused by adapters, then you could possibly trim them if you need clean sequences, otherwise it is ok to leave it that way.

2
Entering edit mode

One hopes that this was done especially prior to a de novo assembly but could indeed be a thing to check on.

1
Entering edit mode

You could check why the reads are soft-clipped. If they don't map in that location then that is one thing but if they are soft-clipped because of poor quality then that is another. Visualize them in IGV (by turning on the relevant option) and check.

0
Entering edit mode

I checked in IGB. The soft-clipped parts of reads are not mapped because their sequence is different with reference. In about 30 percent of reads, first 20 to 40 bp and/or last 20 to 40, with different sequences, are soft-clipped. What I have to do now?

1
Entering edit mode

you are somewhat reasoning in a circle here. yes they are soft-clipped because their sequence deviates, and because the sequence deviates, they are thus soft-clipped.

Did you check again the FastQC plots for those datasets? == to have an idea what is in those first/last 20-40 bp

1
Entering edit mode

In about 30 percent of reads, first 20 to 40 bp and/or last 20 to 40, with different sequences, are soft-clipped. What I have to do now?

In case you did not scan/trim your reads prior to doing the assembly it would be best to start over. If any extraneous sequence went into the assembly process (adapters etc) then the assembly is going to be incorrect.

0
Entering edit mode

Let me ask another question. Using FreeBayes I want to obtain coverage of both reference varinat and alternative variant in each variation position of each ORF (and do some more things with this coverage numbers). In every ORF, I checked coverage mean, standard variation and coefficient of variation (CV; standard deviation to mean ratio). In many of my ORFs, CV is more than 0.5. So, Is this coverage numbers trust able at all?

In a sentence, What is the best way to obtain depth of coverage of each position of each contig?

Thank you so much

1
Entering edit mode

best way to obtain depth of coverage of each position of each contig?

Use mosdepth (LINK).