reads processed: 30829794

Question

why 99% reads failed to align in mir-seq analysis?

0

Entering edit mode

3 months ago

mohammadhassanj ▴ 260

Hi, After trimming the reads as discussed in below

Is mir-seq reads quality good (Fastqc report) for DE analysis?

I used bowtie1 for alignment, but 99% of the reads were not aligned, what is the problem?

bowtie -t -x ../bowtieIndex/GCA_000001405.15_GRCh38_no_alt_analysis_set -p 15  --chunkmbs 512 -1 trimmed_SRR.R1.fastq.gz -2 trimmed_SRR.R2.fastq.gz -S SRR.sam

Time loading reference: 00:00:00 Time loading forward index: 00:00:00 Time loading mirror index: 00:00:00 Seeded quality full-index search: 00:16:27

reads processed: 30829794

reads with at least one alignment: 110845 (0.36%)

reads that failed to align: 30718949 (99.64%)

Reported 110845 paired-end alignments

the fastqc report after trim :

enter image description here

Fastqc bowtie1 miRNA-seq alignment differential-expression-analysis • 944 views

ADD COMMENT • link 3 months ago by mohammadhassanj ▴ 260

1

Entering edit mode

If you have real miRNA data then you should only have short reads left after trimming. Take a selection of trimmed reads that did not align and blast them to make sure they are from right genome.

You should also use only read 1 for alignment. It does not make sense to use paired data with reads expected to be shorter than 30 bp.

ADD REPLY • link 3 months ago by GenoMax 141k

0

Entering edit mode

Thank you so much after aligning with just one read (R1) the alignment finishes in 26 seconds! with the following log:

bowtie -t -x ../bowtieIndex/GCA_000001405.15_GRCh38_no_alt_analysis_set -p 15 --chunkmbs 512 trimmed_SRR.R1.fastq.gz --un unaligned.sam -S SRR.sam

Time loading forward index: 00:00:00                                                                                                                                             
Time loading mirror index: 00:00:00                                                                                                                                              
Seeded quality full-index search: 00:00:26                                                                                                                                       
reads processed: 30829794                                                                                                                                                      
reads with at least one alignment: 30136141 (97.75%)                                                                                                                           
reads that failed to align: 693653 (2.25%)                                                                                                                                     
Reported 30136141 alignments

Do you think this result is acceptable? if yes what is the reason? can you explain more

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260

0

Entering edit mode

If you have reads of 90-something bp that align to the genome then you don't have microRNAs but probably genomic DNA or other contamination. miRNAs are canonically 19-25bp long so it can't be miRNAs. So no, the results are suspicious.

ADD REPLY • link 3 months ago by ATpoint 82k

0

Entering edit mode

Hi this is the output of

samtools stats  SRR.sam

what do you think?

raw total sequences:    30829794    # excluding supplementary and secondary reads
SN  filtered sequences: 0
SN  sequences:  30829794
SN  is sorted:  0
SN  1st fragments:  30829794
SN  last fragments: 0
SN  reads mapped:   30136141
SN  reads mapped and paired:    0   # paired-end technology bit set + both mates mapped
SN  reads unmapped: 693653
SN  reads properly paired:  0   # proper-pair bit set
SN  reads paired:   0   # paired-end technology bit set
SN  reads duplicated:   0   # PCR or optical duplicate bit set
SN  reads MQ0:  0   # mapped and MQ=0
SN  reads QC failed:    0
SN  non-primary alignments: 0
SN  supplementary alignments:   0
SN  total length:   940179704   # ignores clipping
SN  total first fragment length:    940179704   # ignores clipping
SN  total last fragment length: 0   # ignores clipping
SN  bases mapped:   894663218   # ignores clipping
SN  bases mapped (cigar):   894663218   # more accurate
SN  bases trimmed:  0
SN  bases duplicated:   0
SN  mismatches: 3022515 # from NM fields
SN  error rate: 3.378383e-03    # mismatches / bases mapped (cigar)
SN  average length: 30
SN  average first fragment length:  30
SN  average last fragment length:   0
SN  maximum length: 150
SN  maximum first fragment length:  150
SN  maximum last fragment length:   0
SN  average quality:    36.5
SN  insert size average:    0.0
SN  insert size standard deviation: 0.0
SN  inward oriented pairs:  0
SN  outward oriented pairs: 0
SN  pairs with other orientation:   0
SN  pairs on different chromosomes: 0
SN  percentage of properly paired reads (%):    0.0

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260

0

Entering edit mode

As noted by ATPoint miRNA are going to be small. So in theory you need to only keep those reads that have the specific miRNA library adapter (rest are not usable reads). This adapter will need to be trimmed followed by alignment.

Looks like you are working with public SRA data. What accession number is the data from?

ADD REPLY • link 3 months ago by GenoMax 141k

0

Entering edit mode

Yes, The accession number for the mentioned data is SRR22399501

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260

0

Entering edit mode

Based on the kit mentioned it looks like AGATCGGAAGAGCACACGTCT is the smRNA 3'-adapter. Can you see that in your reads? Only keep those reads that contain this adapter if you do. Trim the reads to remove this adapter and everything to 3'-end of that from Read 1 and then align.

ADD REPLY • link 3 months ago by GenoMax 141k

0

Entering edit mode

Thanks if I understand correctly first, as you suggest I use this command

bbduk.sh in=SRR_1.fastq.gz in2=SRR_2.fastq.gz  literal="AGATCGGAAGAGCACACGTCT" qin=33 out=SRR_1.fastq.gz out2=SRR_2.fastq.gz

Almost more than 99% of the reads were recovered

after that, I use cutadapt to trim the adapter from the reads

cutadapt -q 30 -m 15 -a AGATCGGAAGAGCACACGTCT -o trimmed_SRR1.fastq.gz -p trimmed_SRR2.fastq.gz SRR_1.fastq.gz SRR_2.fastq.gz

the fastqc result of the above commands was same below command

fastp -c --detect_adapter_for_pe -i SRR1.fastq.gz -I id_2.fastq.gz -o rimmed_SRR1.fastq.gz -O trimmed_SRR2.fastq.gz

finally that the featueCount command log on this data was with just 22% assigned alignment

Assigned        6832830
Unassigned_Unmapped     471571
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 0
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   23259152
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    266241

 Features : 3218                                                         ||
||    Meta-features : 2653                                                    ||
||    Chromosomes/contigs : 112                                               ||
||                                                                            ||
|| Process BAM file SRR1.bam...                                                 ||
||    Single-end reads are included.                                          ||
||    Total alignments : 30955380                                             ||
||    Successfully assigned alignments : 6901421 (22.3%)                      ||
||    Running time : 0.12 minutes

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260

2

Entering edit mode

That appears reasonable. featureCounts will not count multi-mapped reads so you will need to take that in consideration when dealing with miRNA. You could try using a miRNA workflow.

ADD REPLY • link 3 months ago by GenoMax 141k

0

Entering edit mode

Does it make sense that the low assign percentage (featureCounts) is due to the large amount of overrepresented sequences in the fastqc report?

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260

0

Entering edit mode

This article in a bit old now (2016), but I think it provides a nice review of the challenges inherent to miRNAs alignment and provides a benchmark of different aligners and pipelines. I hope this will be useful:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4931105/

(Ziemann M, Kaspi A, El-Osta A. Evaluation of microRNA alignment techniques. RNA. 2016 Aug;22(8):1120-38. doi: 10.1261/rna.055509.115. Epub 2016 Jun 9. PMID: 27284164; PMCID: PMC4931105.)

ADD REPLY • link 3 months ago by Corentin ▴ 600

0

Entering edit mode

I found it interesting that contrary to this article's suggestion, many miRNA pipelines are written using bowtie1 as an aligner after paper publication!!

ADD REPLY • link 3 months ago by mohammadhassanj ▴ 260