Filter out specific reads from FASTQ files
3
1
Entering edit mode
9.4 years ago
Paul ★ 1.5k

Dear all,

I have pair-end RNA-seq data (Illumina) from parasite and I would like to do De-Novo assembly by TRINITY. I have reference genome of my host organism so I can map my data to host and remove from fastq contaminations.

My plan is:

  1. Map with bwa/bowtie/novoaling my pair-end FASTQ files to a host reference genome
  2. Remove hits from fastq files (cleaning contaminations)
  3. For the rest of FASTQ files use TRINITY for De-Novo transcript assembly

My question is:

May I use aligners (bwa etc.) and align raw fastq files to host DNA and then remove contaminants from fastq files? Question is because my data are from RNA-seq project NOT DNA.

How can I remove the sequences from raw fastq files that align to host DNA (cleaning process)?

Or if you have any other advice how to prepare data to TRINITY pipeline I will appreciate it.

Thank you so much for any comment and sharing your experience.

De-Novo FASTQ filtering Illumina RNA-Seq • 6.5k views
ADD COMMENT
6
Entering edit mode
9.4 years ago

If you have RNAseq data, you'd be better to stick with an aligner intended for spliced alignments (e.g. STAR). Most of these have an option to place unmapped reads/pairs in a new fastq file(s), which you could then feed to trinity or any other assembler (i.e, step #2 will be done for you). I don't have any advice on good assemblers, hopefully others will chime in with feedback there.

ADD COMMENT
0
Entering edit mode

Thank you Devon, I will try STAR maybe TopHat and I'll see how does it work.

ADD REPLY
2
Entering edit mode
9.4 years ago
Manvendra Singh ★ 2.2k

I agree with Devon

I would do it in following ways:

  1. Map fastq files with tophat2
  2. Convert unmapped.bam file to fastq (bamTofatsq) and remap with tophat2, this time provide junctions ( with an option -j you got from first run, (if replicates then merge the junctions).
  3. The unmapped.bam from this run can be converted to fastq.

I think that this is the fastq from which the reads you are looking for.

ADD COMMENT
0
Entering edit mode

Thank you Manu for your comment. Why do you recommend mapping twice? Thank you for deeper explanation.

ADD REPLY
0
Entering edit mode

In the next step of mapping, you provide all the junctions from your RNA-seq data,

I have noticed that >5% of unaligned reads would be aligned on genome by doing so.

Now, you would have more robust mapped and unmapped reads, which you can follow up

ADD REPLY

Login before adding your answer.

Traffic: 3019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6