Question: Extract overrepresented sequences from fastq or fastqc
2
gravatar for michele.tebaldi.92
11 months ago by
michele.tebaldi.9230 wrote:

Hi everybody. I'm doing single-cell rna-seq analysis. My starting point are fastq files with paired end reads from NexteraXT platform. Read length is 75 bp.

I've trimmed my reads for both quality and adapter content with Trim-Galore, then I performed Fastqc-Multiqc to check if everything is ok.

I found several samples (about 60 on 350 samples) to have overrepresented sequences to various extent, not much in major part of the cases

i'd like to blast the overrepresented sequences to see what they are.

Is there a simple way to get them?

I tried to search over internet but i can't find anything...

ADD COMMENTlink modified 11 months ago by Ido Tamir5.0k • written 11 months ago by michele.tebaldi.9230

Hi, welcome to Biostars! Which platform was used? 10x, SMARTseq? What was the read length? Trimmed for which adapter or quality? Overrepresented sequences: Which and to what extend, please post some details. There are several constant regions in scRNA-seq fragments depending on the platform so overrepresentation can be normal and expected.

ADD REPLYlink modified 11 months ago • written 11 months ago by ATpoint29k

sorry, im new here. i'll edit my question to add more details. you're right, i was too confident.

ADD REPLYlink written 11 months ago by michele.tebaldi.9230

Don't worry :)

ADD REPLYlink written 11 months ago by ATpoint29k

michele.tebaldi.92 : When you edit your original post it would help to have some visual information. This would help with that part: How to add images to a Biostars post

ADD REPLYlink written 11 months ago by genomax78k

may be sequence clustering software such as CD-HIT may help you michele.tebaldi.92

ADD REPLYlink written 11 months ago by cpad011212k
1
gravatar for JC
11 months ago by
JC9.5k
Mexico
JC9.5k wrote:

The FastQC report will show you the overrepresented sequences, if you want the full read sequence, you can extract it from the fastq, for example:

grep "ACTACTCATCAACTTGAC" reads.fastq | head -10 > ten_overrepresentesed_sequences.txt

if the fastq file is compressed, you can use zgrep:

zgrep "ACTACTCATCAACTTGAC" reads.fastq.gz | head -10 > ten_overrepresentesed_sequences.txt

ADD COMMENTlink written 11 months ago by JC9.5k
1
gravatar for Ido Tamir
11 months ago by
Ido Tamir5.0k
Austria
Ido Tamir5.0k wrote:

There are many fastqc report parsers written in different languges. E.g. fastqcr for R:

library("magrittr")

fastqcr::qc_read(fastqc)$overrepresented_sequences %>%
     dplyr::mutate(name=paste(">",1:n(),"-",Count,sep=""),fa=paste(name,Sequence,sep="\n")) %>%
     dplyr::pull(fa) %>% 
     readr::write_lines("overrepresented.fa")

If you want to write your own its simply in the fastqc_data.txt file between >>Overrepresented sequences and >>END_MODULE

>>Overrepresented sequences     warn
#Sequence       Count   Percentage      Possible Source
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC      1107975 0.48152514100356447     TruSeq Adapter, Index 1 (100% over 50bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC      994874  0.4323715274539409      TruSeq Adapter, Index 4 (100% over 50bp)
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC      780609  0.33925211200040745     TruSeq Adapter, Index 3 (100% over 50bp)
>>END_MODULE
ADD COMMENTlink written 11 months ago by Ido Tamir5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1691 users visited in the last hour