Question: Split paired end fastq per lane from a bam file
0
gravatar for MAPK
3 months ago by
MAPK1.7k
MAPK1.7k wrote:

I have a bam file that I want to split into paired end fastq files per lane. Can somone please suggest me how it can be done?

bam fastq • 233 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by MAPK1.7k
1

You are going to need to write your own code to do this. You could either convert BAM into fastq and then sort or sort them as you convert.

ADD REPLYlink written 3 months ago by GenoMax96k
1

The lane-splitting is the custom part, for the rest just run samtools collate piped into samtools fastq. I would probably run samtools fastq to get one fastq (not splitted by R1/R2, so an interleaved file) and then scan the headers for the lane information, writing the first entry of each pair (so the one that comes first) to lane_1.fastq.gz and the second per pair to lane_2.fastq.gz.

ADD REPLYlink modified 3 months ago • written 3 months ago by ATpoint46k

@ATpoint Do I need to sort the reads first ? Should I just do samtools collate input.bam -O | samtools fastq -0?

ADD REPLYlink modified 3 months ago • written 3 months ago by MAPK1.7k
1

Yes. You should collate or name-sort reads before conversion to fastq.

ADD REPLYlink modified 3 months ago • written 3 months ago by GenoMax96k
1

You should group by read name, it does not matter whether you sort or collate, the effective result is the same. Reason is that some aligners expect randomly-ordered fastq files due to the way they infer the most likely TLEN (afaik).

ADD REPLYlink modified 3 months ago • written 3 months ago by ATpoint46k

@ATpoint Not sure why I am getting this:

samtools collate -uO LP6005117-DNA_G04.bam  | samtools fastq
No input file specified.
Usage: samtools fastq [options...] <in.bam>
ADD REPLYlink modified 3 months ago by ATpoint46k • written 3 months ago by MAPK1.7k
1

samtools collate -uO LP6005117-DNA_G04.bam | samtools fastq -

ADD REPLYlink written 3 months ago by ATpoint46k

@ATpoint Now that I have interleaved.fastq file, how do I split them into individual lane fastqs?

@HS2000-1015_160:5:1114:14809:86636/2
GTTTGGAAAACAGAAGGGATTCGTCTGGAAAGGGCGGGAAACAAAGGACTGGAAGGGAGTGATGGAAAAAAACATGATGAAACTGATCTGGAAAGAGCCA
+
@B@FFFFFHGHHHIIJJJHIJJJJIIJJJIJJJIJJIHFDDEDDBDDDDDBDDDDDDDD2?CDDDDDCCDDDDDDDDCCD@CDDDCDDEDDCDDDDC9<A
@HS2000-1015_160:5:2114:14081:69256/1
CACCCTGATCCAAGGTGGACCTCATATGCTATCCACCAATTCTTTCATTAATCTTTATCTGAGTTCCTATTACATATTTCAAAGAAACTGTCAAACAACT
+
CCCFFFFFHHHHHJJCGIIHIIJJJIJJJIIJJJJIJJ;FHIJIJJIGIGDIIJJJJJIJJIIFGIJIEIJJIJJIIIJJJHHFGHFFFFDFDEDEEDDD
@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:6:1306:14563:31825/1
TTTTAAACTTATGAACATCCAACAGCAACTAATGGTAAGTCAGAAAGCTTTTATTTTTAAGTATATGCATGTGTATGTGTGTGTTTTGTATGTGTGTGTC
+
CCCFFFFFHHGHHJEDHJIJJJJIIJJJJJJGGIICFHIDGHH@HGIIIGGHFHHG>GGDHEBFIJJJHGI@FCDEEEHHGHHEHHFFFEFFFEEEEEDD
@HS2000-1015_160:7:2308:13476:86835/2
ATGGTCCTAAGTAACGTGAGGCCCAGAGGGAGGAGACCTTCAGAGGAGGCCAGGATGGGGGTGATGGGGAACTTTTCCTGAAGCTGGAGTAGGGAAAGAT
+
@@CFDFFFHFHDHHGIGJGIJIJJEG@FHGCFGBF=DFGFHGG>GGGGHIHCHFCEHHFFD'8<ACDDDD38CCDDDDCDDACDCDC?B:>@C?>A?A>>
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:1110:16577:55703/2
AGAAGCTGAAGTACGGTGTAGGAATGTTTCTACCACGCATTTTCCCTGCACCCTCTCACATCATTCATGGGCCATTTGTAGACAGATGGGAAAGAGATGT
+
CC@FFFFFHGDDHGIG:C<FBHEFHIGGGIIJIIIJIFJJJJIIHIJIHHIIDHIIIJJJJJJJIHHGEHHGEFFFCEDECDDEE>@CDDCCDBCDCCDC
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:5:1216:14609:76938/2
TCAAAAGGCCTCTGGAGAGATTTCTGGGTGTGGGAGGTGCTATTTTGGGGCTTGGCCAGTCATATGGAGATAAGCCTACAAGGTTGGGGACCTGGCAGAT
+
CCCFFFFFFHHGHJJHHCCAFHIGHHJJ:E?FHHHDG?FGGIIJJIJIIJIJJIJHGCEECC@CBEFECEC@C@@CCDDDDDD>CCBDDD@??CBDC72:
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
ADD REPLYlink modified 3 months ago • written 3 months ago by MAPK1.7k

You will need to read these files two records at a time, look at the read header for record 1

@HS2000-1015_160:6:2209:18284:44195/1

Decide what lane file to put it in based on the number in second field ( when separated by :).

Someone can provide an awk based solution but let me see if I can give you a bbmap based one.

ADD REPLYlink written 3 months ago by GenoMax96k
3
gravatar for GenoMax
3 months ago by
GenoMax96k
United States
GenoMax96k wrote:

Using the example data you have above it can be demultiplexed into lane specific files using demuxbyname.sh from BBMap suite as follows. (use .gz extension to keep files zipped). Input here is interleaved (as in your example). If someone reaches this solution in future and have their inputs in two separate files then specify them with in1=R1 and in2=R2.

$ demuxbyname.sh -Xmx10g in=file.fq delimiter=: column=2 out=%_#.fq

Example output

$ more 5_1.fq 
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#

$ more 5_2.fq 
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
ADD COMMENTlink modified 3 months ago • written 3 months ago by GenoMax96k
1

As usual bbmap suite does it all :)

ADD REPLYlink written 3 months ago by ATpoint46k

I ended up using this demuxbyname.sh. I found it very fast- takes only 15 minutes to split 200G fastq. But before doing this, I did: samtools collate -uO ${INBAM} | samtools fastq - -@ ${THREADS} -N -1 ${FQ_OUT1} -2 ${FQ_OUT2} -0 /dev/null -s /dev/null -n . It would be nice if they had an option to keep both column1 and column 2 as base name of splitted fastq. Anyway, thank you for your help.

ADD REPLYlink modified 3 months ago • written 3 months ago by MAPK1.7k
1
gravatar for GenoMax
3 months ago by
GenoMax96k
United States
GenoMax96k wrote:

This could also be done starting with BAM directly as follows (not tested, MAPK if you can test would be great). Make sure BAM is name sorted (or could be sorted on fly before reformat.sh step).

reformat.sh -Xmx20g in=your.BAM out=stdout.fq primaryonly=t | demuxbyname.sh -Xmx10g in=stdin.fq delimiter=: column=2 out=%_#.fq
ADD COMMENTlink modified 3 months ago • written 3 months ago by GenoMax96k
0
gravatar for MAPK
3 months ago by
MAPK1.7k
MAPK1.7k wrote:

If someone wants awk solution, I also have written this. This also creates RG files.

#!/bin/bash
SM=SAMPLE
DNA=BARCODE
PR=PROJECT
FULLSM="SM^DNA^PR"
### splitfq1.sh
set -x
echo "Splitting lanes from FASTQ: ${FQ_OUT1}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
  arr=$1
  sub(/^@+/,"",arr)
  lane=$2
  close(outputFile)
  outputFile=out_dir"/"fullsm"."arr"."lane".r1.fastq"
  outputFile2=out_dir"/"fullsm"."arr"."lane".rgfile"
  print "@RG" tab "ID:"arr"."lane tab "DS:"fullsm tab "LB:"dna tab "PL:illumina" tab "PU:"arr"."lane tab "SM:"sm > (outputFile2)
  close(outputFile2)
}
{
  print >> (outputFile)
}' ${FQ_OUT1}

ls $OUT_DIR/*.rgfile | sort -V | xargs cat > "$OUT_DIR/${FULLSM}.RGFILES.txt"



## splitfq2.sh

!/bin/bash
set -x

# FQ2
echo "Splitting lanes from FASTQ: ${FQ_OUT2}"
awk -v sm="$SM" -v dna="$DNA" -v pr="$PR" -v fullsm="$FULLSM" -v out_dir="${OUT_DIR}" -v tab="\\\\t" '
BEGIN{FS = ":"}
/^@HS2000/{
  arr=$1
  sub(/^@+/,"",arr)
  lane=$2
  close(outputFile)
  outputFile=out_dir"/"fullsm"."arr"."lane".r2.fastq"
}
{
  print >> (outputFile)
}' ${FQ_OUT2}
ADD COMMENTlink modified 3 months ago • written 3 months ago by MAPK1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 970 users visited in the last hour
_