How To Look For Known Fusion In Fastq File
Entering edit mode
10.8 years ago
Angel ▴ 220


I have an internal data for NCI-h660 file with 8m mapped pairs (HiSeq, 50bp paired end data) and I have an external dataset (4m mapped pairs, 50 bp paired end generated on GAII).

Questions: 1. I observe TMPRSS2-ERG fusion with external dataset, not with internal data from HiSeq. What could be the reasons? I use tophat2 fusion with same parameters for both the datasets.

  1. How can I investigate the FASTQ file to see if this fusion is present. The sequence of ERG-TMPRSS2 fusion is as mentioned here:

  2. Does this mean we need more data generated internally to find the same fusion? I use the following possible thresholds that are the minimum possible:

tophat-fusion-post -p $np --skip-read-dist --num-fusion-reads 1 --num-fusion-pairs 1 --num-fusion-both 2 $index

Any help will be greatly appreciated!! Thanks.

fusion fastq • 3.4k views
Entering edit mode
10.8 years ago

Use grep to search your fastq for a specific sequence.

Something like

grep -A 2 -B 1 GGAATAACCTGCCGCG myfastq.fastq > junctions.fastq

The -A means "Get 2 lines after the line that matches that sequence". -B means "get the one line before the line that matches the sequence". This will give you the full 4 lines of the fastq entry. If you don't need that, you can omit those two options. Check the rev-comp of that sequence too.

If your fastq is gzipped, use zgrep instead of grep. If you have a .bam file, do this to search the .bam

samtools view mybam.bam | grep GGAATAACCTGCCGCG - > junctions.sam

samtools view is reading the .bam, and converting it to a plain text .sam, and feeding that one line at a time to grep, which is only going to output the lines that contain your sequence to junctions.sam.


Login before adding your answer.

Traffic: 865 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6