Question

BAM -> FASTQ Conversion of CCLE data for STAR-Fusion. Filtering Steps?

0

Entering edit mode

4.6 years ago

denis.k ▴ 20

Hey everyone,

I'm pretty new to RNA-seqencing and was wondering if anyone could help me out. I am trying to run a variety of SV callers (STAR-Fusion, etc.) on data from the CCLE (https://portal.gdc.cancer.gov/legacy-archive).

Most SV Callers require .fastq files but all the data I have downloaded is in BAM format. Here are some more details:

Firstly, the BAM files are coordinate sorted, and after realizing that they needed to be sorted by name in order for the paired fastq files to be created correctly, I sorted all files by name

I am using Samtools 1.9.

samtools sort -n infile.bam outfile_sorted.bam

Then:

samtools fastq -1 outfile_sorted_1.fastq.gz -2 outfile_sorted_2.fastq.gz outfile_sorted.bam

Is this process enough in order to feed the .fastq reads into the SV caller? I figured if I filtered out any non-primary reads, that the reads corresponding to fusions would also be filtered out. I'm seeing a LOT of duplicated sequences in my QC reports but I figured that wasn't a problem. I just wanted to make sure that I wasn't keeping a bunch of artifiacts in my .fastq files and potentially making my whole project useless.

sequence gene RNA • 1.2k views

ADD COMMENT • link 4.6 years ago by denis.k ▴ 20