Question

smallRNA low percentage of mapping, N at the beninning of the reads and kmers

0

Entering edit mode

6.7 years ago

noeD ▴ 130

Hello!

I am working with smallRNA data. I have analyzed the fastq with fastqc, and I saw that there were illumina small RNA 3' adapter, in fact my sequence length distribution were centered on 51. Therefore I have used cutadapt in order to remove that adapter and my sequence length distribution changed: https://drive.google.com/file/d/0B4m6-7p8GFwIa3B5VGRmT3FjUDA/view?usp=sharing

After that I aligned my reads against reference genome (hg38) with botwie, using default parameters, in order to see how it performed. I obtained a very very low percentage of mapped read (0.30%).

I have checked again my fastq file with fastqc and I saw that there were several kmers at the end of the reads. Is it normal?

I have upload all images from fastqc at this link: https://drive.google.com/open?id=0B4m6-7p8GFwIQmppNjNQVm5BRVU Are there other adapter that I should trim? At the beginning of the reads I saw that in some case there were N, should I trim them?

I reported here and extract of my fastq:

@HISEQ2500:231:C9L77ACXX:1:2316:21153:100286 1:N:0:NTAGCT
AAGCCGCCAGTTGAAGAACTGT
+
<7<B00<<0<BFBFFIIIIIII
@HISEQ2500:231:C9L77ACXX:1:2316:21183:100346 1:N:0:NTAGCT
CTCCAGGCCGAGGAC
+
<B<<<0<<BB<0<BB
@HISEQ2500:231:C9L77ACXX:1:1101:1376:1894 1:N:0:CTAGCT
NAGCTTATCAGACTGATGTTGA
+
#00BBFFFFFFFFFFFIIIIBF
@HISEQ2500:231:C9L77ACXX:1:1101:1314:1913 1:N:0:CTAGCT
NGCTACATCTGGCTACTGGGTCT
+
#0<FFFFFFFFFFIIIIIIIIII

As you can see, in the first read there isn't a N at the beginning of the read, but it is presented in the index of the reads. In the last read exactly the opposite is happening: N at the beginning of the read, but not in the index of the reads.

How should fix that issue?

Thank you in advance

Best

smallRNA RNA-Seq alignment • 2.0k views

ADD COMMENT • link updated 6.7 years ago by Brian Bushnell 20k • written 6.7 years ago by noeD ▴ 130

2

Entering edit mode

small RNA data analysis requires pre-processing of the data in specific ways (based on the kit used etc). You may want to try a dedicated pipeline (e.g. miRquant or miRdeep2 ) for this purpose.

ADD REPLY • link 6.7 years ago by GenoMax 141k

score 2 · Answer 1 · 2017-07-28

You should trim leading/trailing Ns; they never help alignment, and are particularly bad with Bowtie as it allows very few mismatches. You can do that by quality-trimming to a q-score of 2. On the other hand, if the exact starting position of the read is important, just discard the reads containing Ns. As far as adapter contamination goes... if you successfully trimmed using Illumina's Small RNA adapters as a reference, I don't see a point in trying other adapter sequences as well.

You can also remove reads with Ns in the barcodes, or barcodes that do not exactly match the expected barcodes, during the demultiplexing process. I recommend this if you are multiplexing, to prevent crosstalk between libraries. It's also possible to remove them after the fact - BBDuk has the flags "barcodefilter" and "barcodes" for that purpose. If crosstalk is not a problem for the experiment, there's no reason to remove them.

As for the low alignment rate, it's hard to say what might cause that (could be that the library is mostly not human, for example). I'd suggest trying other aligners (bowtie2, bwa-aln, BBMap) to see if they improve things, and you might try BLASTing some of the longest unaligned reads to nt/RefSeq to see what they hit, though that's much more useful with longer reads.