Split paired end fastq per lane from a bam file
3
1
Entering edit mode
3.9 years ago
MAPK ★ 2.1k

I have a bam file that I want to split into paired end fastq files per lane. Can somone please suggest me how it can be done?

fastq bam • 2.6k views
ADD COMMENT
1
Entering edit mode

You are going to need to write your own code to do this. You could either convert BAM into fastq and then sort or sort them as you convert.

ADD REPLY
1
Entering edit mode

The lane-splitting is the custom part, for the rest just run samtools collate piped into samtools fastq. I would probably run samtools fastq to get one fastq (not splitted by R1/R2, so an interleaved file) and then scan the headers for the lane information, writing the first entry of each pair (so the one that comes first) to lane_1.fastq.gz and the second per pair to lane_2.fastq.gz.

ADD REPLY
0
Entering edit mode

@ATpoint Do I need to sort the reads first ? Should I just do samtools collate input.bam -O | samtools fastq -0?

ADD REPLY
1
Entering edit mode

Yes. You should collate or name-sort reads before conversion to fastq.

ADD REPLY
1
Entering edit mode

You should group by read name, it does not matter whether you sort or collate, the effective result is the same. Reason is that some aligners expect randomly-ordered fastq files due to the way they infer the most likely TLEN (afaik).

ADD REPLY
0
Entering edit mode

@ATpoint Not sure why I am getting this:

samtools collate -uO LP6005117-DNA_G04.bam  | samtools fastq
No input file specified.
Usage: samtools fastq [options...] <in.bam>
ADD REPLY
1
Entering edit mode

samtools collate -uO LP6005117-DNA_G04.bam | samtools fastq -

ADD REPLY
0
Entering edit mode

@ATpoint Now that I have interleaved.fastq file, how do I split them into individual lane fastqs?

@HS2000-1015_160:5:1114:14809:86636/2
GTTTGGAAAACAGAAGGGATTCGTCTGGAAAGGGCGGGAAACAAAGGACTGGAAGGGAGTGATGGAAAAAAACATGATGAAACTGATCTGGAAAGAGCCA
+
@B@FFFFFHGHHHIIJJJHIJJJJIIJJJIJJJIJJIHFDDEDDBDDDDDBDDDDDDDD2?CDDDDDCCDDDDDDDDCCD@CDDDCDDEDDCDDDDC9<A
@HS2000-1015_160:5:2114:14081:69256/1
CACCCTGATCCAAGGTGGACCTCATATGCTATCCACCAATTCTTTCATTAATCTTTATCTGAGTTCCTATTACATATTTCAAAGAAACTGTCAAACAACT
+
CCCFFFFFHHHHHJJCGIIHIIJJJIJJJIIJJJJIJJ;FHIJIJJIGIGDIIJJJJJIJJIIFGIJIEIJJIJJIIIJJJHHFGHFFFFDFDEDEEDDD
@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:6:1306:14563:31825/1
TTTTAAACTTATGAACATCCAACAGCAACTAATGGTAAGTCAGAAAGCTTTTATTTTTAAGTATATGCATGTGTATGTGTGTGTTTTGTATGTGTGTGTC
+
CCCFFFFFHHGHHJEDHJIJJJJIIJJJJJJGGIICFHIDGHH@HGIIIGGHFHHG>GGDHEBFIJJJHGI@FCDEEEHHGHHEHHFFFEFFFEEEEEDD
@HS2000-1015_160:7:2308:13476:86835/2
ATGGTCCTAAGTAACGTGAGGCCCAGAGGGAGGAGACCTTCAGAGGAGGCCAGGATGGGGGTGATGGGGAACTTTTCCTGAAGCTGGAGTAGGGAAAGAT
+
@@CFDFFFHFHDHHGIGJGIJIJJEG@FHGCFGBF=DFGFHGG>GGGGHIHCHFCEHHFFD'8<ACDDDD38CCDDDDCDDACDCDC?B:>@C?>A?A>>
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:1110:16577:55703/2
AGAAGCTGAAGTACGGTGTAGGAATGTTTCTACCACGCATTTTCCCTGCACCCTCTCACATCATTCATGGGCCATTTGTAGACAGATGGGAAAGAGATGT
+
CC@FFFFFHGDDHGIG:C<FBHEFHIGGGIIJIIIJIFJJJJIIHIJIHHIIDHIIIJJJJJJJIHHGEHHGEFFFCEDECDDEE>@CDDCCDBCDCCDC
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:5:1216:14609:76938/2
TCAAAAGGCCTCTGGAGAGATTTCTGGGTGTGGGAGGTGCTATTTTGGGGCTTGGCCAGTCATATGGAGATAAGCCTACAAGGTTGGGGACCTGGCAGAT
+
CCCFFFFFFHHGHJJHHCCAFHIGHHJJ:E?FHHHDG?FGGIIJJIJIIJIJJIJHGCEECC@CBEFECEC@C@@CCDDDDDD>CCBDDD@??CBDC72:
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
ADD REPLY
0
Entering edit mode

You will need to read these files two records at a time, look at the read header for record 1

@HS2000-1015_160:6:2209:18284:44195/1

Decide what lane file to put it in based on the number in second field ( when separated by :).

Someone can provide an awk based solution but let me see if I can give you a bbmap based one.

ADD REPLY
3
Entering edit mode
3.9 years ago
GenoMax 145k

Using the example data you have above it can be demultiplexed into lane specific files using demuxbyname.sh from BBMap suite as follows. (use .gz extension to keep files zipped). Input here is interleaved (as in your example). If someone reaches this solution in future and have their inputs in two separate files then specify them with in1=R1 and in2=R2.

$ demuxbyname.sh -Xmx10g in=file.fq delimiter=: column=2 out=%_#.fq

Example output

$ more 5_1.fq 
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#

$ more 5_2.fq 
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
ADD COMMENT
1
Entering edit mode

As usual bbmap suite does it all :)

ADD REPLY
0
Entering edit mode

I ended up using this demuxbyname.sh. I found it very fast- takes only 15 minutes to split 200G fastq. But before doing this, I did: samtools collate -uO ${INBAM} | samtools fastq - -@ ${THREADS} -N -1 ${FQ_OUT1} -2 ${FQ_OUT2} -0 /dev/null -s /dev/null -n . It would be nice if they had an option to keep both column1 and column 2 as base name of splitted fastq. Anyway, thank you for your help.

ADD REPLY
1
Entering edit mode
3.9 years ago
GenoMax 145k

This could also be done starting with BAM directly as follows (not tested, MAPK if you can test would be great). Make sure BAM is name sorted (or could be sorted on fly before reformat.sh step).

reformat.sh -Xmx20g in=your.BAM out=stdout.fq primaryonly=t | demuxbyname.sh -Xmx10g in=stdin.fq delimiter=: column=2 out=%_#.fq
ADD COMMENT
0
Entering edit mode
3.8 years ago
MAPK ★ 2.1k

If someone wants awk solution, I also have written this. This also creates RG files.

#!/bin/bash
SM=SAMPLE
DNA=BARCODE
PR=PROJECT
FULLSM="SM^DNA^PR"
### splitfq1.sh
set -x
echo "Splitting lanes from FASTQ: ${FQ_OUT1}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
  arr=$1
  sub(/^@+/,"",arr)
  lane=$2
  close(outputFile)
  outputFile=out_dir"/"fullsm"."arr"."lane".r1.fastq"
  outputFile2=out_dir"/"fullsm"."arr"."lane".rgfile"
  print "@RG" tab "ID:"arr"."lane tab "DS:"fullsm tab "LB:"dna tab "PL:illumina" tab "PU:"arr"."lane tab "SM:"sm > (outputFile2)
  close(outputFile2)
}
{
  print >> (outputFile)
}' ${FQ_OUT1}

ls $OUT_DIR/*.rgfile | sort -V | xargs cat > "$OUT_DIR/${FULLSM}.RGFILES.txt"



## splitfq2.sh

!/bin/bash
set -x

# FQ2
echo "Splitting lanes from FASTQ: ${FQ_OUT2}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
  arr=$1
  sub(/^@+/,"",arr)
  lane=$2
  close(outputFile)
  outputFile=out_dir"/"fullsm"."arr"."lane".r2.fastq"
}
{
  print >> (outputFile)
}' ${FQ_OUT2}
ADD COMMENT

Login before adding your answer.

Traffic: 1545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6