Question: Extract Fastq From Large Bam File
1
gravatar for michealsmith
6.6 years ago by
michealsmith740
michealsmith740 wrote:

I have some large bam files (~100GB), from which I would like to extract fastq sequences. Of course the most straightforward way is to directly apply softwares like bam2fastq. However, in order to speed up, I tried to split the large bam files (sorted according to coordinate but not read name) chromosome by chromosome, and would extract based on each bam file.

Then I realized, this will miss those read paires which map to different chromosomes. So I'm just confused any optimized way to split bam files first and then extract for each of bam, so that to improve speed? thx

Edit: Is there any bam split tools, which can evenly split bam into, say 10 parts?

To better explain my question: If I split my bam file by chromosome, say extract chr1 alignments from test.bam, which is named: test_1.bam

$ /share/bin/samtools-0.1.16/samtools view test_1.bam |wc -l
131168

Then if I tried to extract fastq from test_1.bam

$ bam2fastq -o test_%#_sequence.txt test_1.bam -f 
This looks like paired data from lane 1.
Output will be in test_1_1_sequence.txt and test_1_2_sequence.txt
131168 sequences in the BAM file
131168 sequences exported
WARNING: 3348 reads could not be matched to a mate and were not exported

Obviously, 3348 reads are NOT extracted, this is because test_1.bam only include any read mapping to chr1; and for those unmapped reads, it's highly likely that only one end maps to chr1. And bam2fastq will only extract paired-end reads, exluding those orphans.

So I would say, splitting BAM based on name/# of reads seem the only way out. I may need to sort bam according to read name.

fastq • 4.8k views
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by michealsmith740
1
gravatar for swbarnes2
6.6 years ago by
swbarnes24.5k
United States
swbarnes24.5k wrote:

Here's something I found on seqanswers about spliting a bam file by number of entries. If you sort your .bam by read name, which Picard can do, then you can get the fastq with pairs next to each other.

http://seqanswers.com/forums/showthread.php?t=16615

ADD COMMENTlink written 6.6 years ago by swbarnes24.5k

but I know, thanks. But how can I quickly split my bam already sorted by read name?

ADD REPLYlink written 6.6 years ago by michealsmith740
0
gravatar for Sukhdeep Singh
6.6 years ago by
Sukhdeep Singh9.5k
Netherlands
Sukhdeep Singh9.5k wrote:

I think better would be to split it according to chromosomes. Check this tool, its a python script. It indexes, sorts and splits your bamfile in to chromosomes.

Also, check this question, they talk here about splitting bam files. After splitting, them you can use bam2fastq, isn't it!!!!

Cheers

ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by Sukhdeep Singh9.5k

could you please edit the first link - seem to be wrong

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 78k

Done. Thanks!!!

ADD REPLYlink written 6.6 years ago by Sukhdeep Singh9.5k

But if split according to chromosome, those read pairs mapped to two different chromosomes will be distributed into different files. Then how can I reconstruct the paired-end reads?

ADD REPLYlink written 6.6 years ago by michealsmith740

Hi Gerry, Sorry I have no idea, how this will happen. According to Illumina Paired end note, the paired end is reading both the forward and reverse template strands of each cluster during one paired-end read. So, why they will map to different chromosomes, unless they are non-unique (MAPQ <1). Also, If you say they will be in different files, why you care, wont you be merging the fastq files back. Check this ques, its about identifying the mate pairs in different files.

ADD REPLYlink written 6.6 years ago by Sukhdeep Singh9.5k
0
gravatar for michealsmith
6.6 years ago by
michealsmith740
michealsmith740 wrote:

Hi Sukhdeep, look at below:

HWI-ST195_0162:8:26:12608:92688#NANCGN  145 18  85  23  10M1I89M    5   64409   0   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTAACCCTAACCCTTAACCCTAACCCTAACCCTAACC    ########################BAE?:AEEEBE@A?C;=>CDD@B>ABC;<7;>F?D<DBC??@>BDD;BEFGGBFGEGHHHHHGIGFIEGGGFGEHH    XT:A:U  NM:i:1  SM:i:23 AM:i:0  X0:i:1  X1:i:2  XM:i:0  XO:i:1  XG:i:1  MD:Z:99

For example, for the read-pair above, one end map to chr18, while the other map to chr5 on the reference genome. And this is not unusual to see, due to either translocation or segmental duplication.

I'll better explain my problems by editting the post soon.

ADD COMMENTlink written 6.6 years ago by michealsmith740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1344 users visited in the last hour