Question: Filter out specific reads from FASTQ files
gravatar for Paul
4.4 years ago by
European Union
Paul1.3k wrote:

Dear all,

I have pair-end RNA-seq data (Illumina) from parasite and I would like to do De-Novo assembly by TRINITY. I have reference genome of my host organism so I can map my data to host and remove from fastq contaminations.

my plan is:

1. Map with bwa/bowtie/novoaling my pair-end FASTQ files to a host reference genome

2. Remove hits from fastq files (cleaning contaminations)

3. For the rest of FASTQ files use TRINITY for De-Novo transcript assembly

My question is:

May I use aligners (bwa etc.) and align raw fastq files to host DNA and then remove contaminants from fastq files? Question is because my data are from RNA-seq project NOT DNA.

How can I remove the sequences from raw fastq files that align to host DNA (cleaning process)?

Or if you have any other advice how to prepare data to TRINITY pipeline I will appreciate it.

Thank you so much for any comment and sharing your experience.

ADD COMMENTlink modified 4.4 years ago by Peter5.8k • written 4.4 years ago by Paul1.3k
gravatar for Devon Ryan
4.4 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

If you have RNAseq data, you'd be better to stick with an aligner intended for spliced alignments (e.g. STAR). Most of these have an option to place unmapped reads/pairs in a new fastq file(s), which you could then feed to trinity or any other assembler (i.e, step #2 will be done for you). I don't have any advice on good assemblers, hopefully others will chime in with feedback there.

ADD COMMENTlink written 4.4 years ago by Devon Ryan89k

Thank you Devon, I will try STAR maybe TopHat and I'll see how does it work.

ADD REPLYlink written 4.4 years ago by Paul1.3k
gravatar for Manvendra Singh
4.4 years ago by
Manvendra Singh2.0k
Berlin, Germany
Manvendra Singh2.0k wrote:

I agree with Devon

I would do it in following ways:

1. map fastq files with tophat2 

2. convert unmapped.bam file to fastq (bamTofatsq) and remap with tophat2, this time provide junctions ( with an option -j you got from first run, (if replicates then merge the junctions)  .

3. the unmapped.bam from this run can be converted to fastq.

I think that this is the fastq from which the reads you are looking for.

ADD COMMENTlink written 4.4 years ago by Manvendra Singh2.0k

Thank you Manu for your comment. Why do you recommend mapping twice? Thank you for deeper explanation.

ADD REPLYlink written 4.4 years ago by Paul1.3k

In the next step of mapping, you provide all the junctions from your RNA-seq data,

I have noticed that  >5% of unalligned reads would be alligned on genome by doing so.

Now, you would have more robust mapped and unmapped reads, which you can follow up

ADD REPLYlink written 4.4 years ago by Manvendra Singh2.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 685 users visited in the last hour