I have some problem with my degradome data, the company provided me two files, raw reads and mappable reads (trimmed reads without the adapter), this is the fastQC report of mappable reads for overrepresented sequences:
Sequence Count Percentage Possible Source AGTTCTACAGTCCGACGATCAGTTCTACAGTCCGACGATCAGTTCTACAGT 8897 0.29136760032893066 Illumina DpnII expression Sequencing Primer (95% over 21bp) CAGAGTTCTACAGTCCGACGATCCAGAGTTCTACAGTCCGACGATCCAGAG 5043 0.16515306378091463 Illumina DpnII expression Sequencing Primer (100% over 23bp) GCGACCCCAGGTCAGGCGGGACCACCCGCTGAGTTTAAGCATATCAATAAG 4503 0.14746861911668818 No Hit GAGTTCTACAGTCCGACGATCGAGTTCTACAGTCCGACGATCGAGTTCTAC 4241 0.1388883885573783 Illumina Small RNA Adapter 1 (95% over 22bp) AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTCCGACGATCAGAGTTC 3681 0.1205489644611435 Illumina Small RNA Adapter 1 (96% over 26bp) CAGAGTTCTACAGTCCGACGATCCAGAGTTCTACAGTCCAACGATGGAATT 3234 0.10591017415575608 Illumina DpnII expression Sequencing Primer (100% over 23bp)
But the main issue is for sequence length, so the raw reads length is 51 bp, that mean RNA seq + adapter length = 51 bp. but in clean reads (mappable.fq.gz= trimmed reads without adapter ) again about 20 percent of sequence ( 500 000 ) has the SAME length which is 51 bp ( same length before trimming), how it could be possible?
Furthermore, in mappable files about 50 percent of reads could NOT match to target even with 2 allowed mismatch. The length of sequences is 10 to 51 bp
I think these data are not clean reads and there are some adapter remnants in sequences, is it right?
If you are worried about remnants of adapters/extraneous sequences then pass your data through a scan/trim program. I recommend
bbduk.shfrom BBMap suite.
That said, grab a sample of those reads that don't match the target and blast them at NCBI. Should immediately give you an idea of what you are looking at.
For standard libraries that just means those read did not have any illumina adapters. In general, you should not have any adapter sequence present as long as the size of insert is longer than the length of sequencing.
Edit: I am not directly familiar with
degradomeand if it uses some specific modification then the above observation will not apply.
I rechecked one of the clean reads with bowtie with this command
These are some of the unaligned reads with -v 2 (most of unaligned seqs have 51 nucleotide lenght)
I checked some of these and it looks like they are mapping to the grape genome. So unless you are not working with grape I am not sure why they did not align in first place.
actually the target genome is grape but the issue is low mapping rate , I rechecked for all 3 degradome libs and the average mapping rate with 2 mismatch is just 50 percent, I think the problem is this 51b reads because in unaligned fragment most of sequences have 51 base length , for instance in one these clean libs, there is about 1 M reads with 51 length but 800 000 of them could not match to genome with -v 2 !! strange point is a lot of these reads could not match with any organism in blast. it could be some kind of artifact ? I use same index for degradom and small RNA libs but the small RNA mapping rate is 80% (-v 1)
It makes more sense now.
bowtie v1is appropriate for small RNA because it does un-gapped alignments. If you are interested in allowing for splicing (or at least split-mapping) you may need to try a different aligner (that can also do gapped alignments, bowtie2, STAR, bbmap etc).
Those reads that don't match anything in blast could be some kind of experimental artifact/contamination.
these degradome reads is kind of small RNA (length distribution is between 10 to 51 bp)
Are you saying that you can't/don't want to allow gaps in alignments?
I just followed bowtie2 guide line " It is particularly good at aligning reads of about 50 up to 100s or 1,000s" if we allowed gap alignment for short reads , it not produce some misalignment?
bowtie v.1 and v.2 are different programs. You used bowtie v.1 for the stat above and that is good for small RNA.
You can try bowtie v.2 independently (you will need to recreate the indexes if I recall right) and see what you get. If you have reads that capture a splice (e.g. two parts going to two exons) then you need gapped alignments.
I lunched bowtie2 but no significant difference in results :
I would have expected bowtie2 to show more alignments. Did you use default parameters or change any? Are you willing to try a different aligner?
I used bowtie v.2 with default, but I think its due to contamination because I blast these unaligned reads in NBCI and some of these reads could not match with any organisms but the other could be matched but just in partial part for instance 30 bp of 51 bp aligned with the grape genome. I got a bit wired answer from the company, "it's expected that longer, degraded RNA fragments are generally harder to map to the genome than smaller fragments"
but the size of these degradome reads is 10 to 51 base, NOT so long, approximately as same as small RNA,
Scan/trim your original data set using the instructions I linked for
bbduk.shin thread above. Once that is done use
bbmap.shaligner with following options specific for small RNA alignments to see if you can get better mapping
should I use default adapter list with bbduk .sh?
If the experiment used one of the standard illumina/commercial kits then yes.
If you have paired-end data then trim both reads together. Modify
bbmap.shmapping command to include
in1= in2=to account for the two files.
I lunched the bbduk.sh with this flags
lots of reads already removed, what do you think ?
If those reads had predominantly adapter then that is what we wanted.
Is there any expected size range for degradome reads (e.g. miRNA expected to be 21-25 bp)? If so you could also enforce
minlen=to drop reads that go below that length.
Now try the
bbmap.shmapping with options noted above.
worse result , I think it's due to perfectmode flag but with bowtie v.2 and bowtie v.1 mapping rate increased about 30 percent
bowtie 2 mapping rate :
what is your opinion ?
Can you take out the perfect match mode from bbmap? That will allow split mapping though so keep that in mind.
Did you re-do
bowtie v.1 and v.2mapping with
bbduk.shtrimmed data? If not then we can't compare the results the way you posted them above. You would want to re-align the trimmed data again.
BTW: How many samples do you have? Is this happening with just one or all samples? What is the next step in the analysis?
obviously I used same re-trimmed data for bowtie v.1 and 2 and mapping with bbduk.sh. I have 3 samples but this problem (low mapped rate )is common for all of them, now we confirmed low rate mapped is due to some artifact and adapter contamination, right ?
Would you mind re-doing the bbmap mapping removing the perfect match restriction?
We have addressed the adapter contamination but not explained the low mapping as yet. I have a feeling that you do have some artifacts (or chimeric reads) based on the blast I remember looking at for a small number of reads I had tried. There is definitely plant DNA in your reads that do not map based on blast.
without perfect mod flag 85% mapped but with great error rate
Good. This is in line with what I expected to see. I would not worry about the error rate since that is not quite what it seems to mean. 57% unambig alignments is better than bowtie.
Can you load the alignments in IGV and check to see how the three aligners are doing visually? Do you see many more split alignments with bbmap and bowtie v.2 compared to bowtie v.1 and bbmap with perfect matches?
OK, I'II try to do that, but I have a problem with sending the post just 5 posts per 6hr, as you know I elaborate a lot on your nice website :)
That limit will increase as you participate more on the site over time. Just edit the last post and add more information to it if you are not able to create a new one.
here is the IGV visualization, upper case is bbmap out put.
In that small snapshot the alignments look reasonable. There are gapped alignments with
I am not sure what your downstream analysis pipeline looks like but you should now proceed with that (use both bbmap and bowtie mapped data in parallel) and see what you get.
as you know the issue was contaminated clean reads which provided by company and due to low mapping rate I tried to re-trim this clean reads, right now we confirm that we have a lot of artifact and adapter remanent, right ? (because company told me we are pretty sure there is not any kind of contamination in clean read libs !!!) This is another snapshot , it seems with bowtie2 we have better results in this case https://ibb.co/buVuex
Your data seems to have adapter contamination but that is not completely unusual. Depending how well the libraries turned out and/or the insert size you have. This is a characteristics of your sample/libraries (unless the company made the libraries and did not do that part well, even then they could have been ultimately limited by quality of your samples).
This is the company answer "Our lab is confident the mapping rate issue does not relate to any adaptor contamination, but with regard to this phenomenon you're observing, it's expected that longer, degraded RNA fragments are generally harder to map to the genome than smaller fragments." but after re-trimming just according to public available primer and adapter seq, mapping rate improve to over 70% BTW, Thanks in advance for your comments and help
I found some problem with bbmap , in bbmap output there are plenty amount of seqs with N which is mapped to genome but in bowtie2 output we don't have this problem. could you please guide me about that ? https://ibb.co/hsBte7 https://ibb.co/nsL8CS Thanks
can I ask your opinion about bowtie output ?