Question

How to convert mapping bam file to fastq without loseing the mapping information

0

Entering edit mode

3.1 years ago

btx288 • 0

Hi all,

I want to create my RNA mapping data into a library for further analysis. Now I have bowtie2 mapping data, which is in bam files, I now use bedtools to extract fastq mapping reads from those bam mapping files. (bedtools bamtofastq function) But it seems like the fastq data produced by bedtools would only contain the original sequence name, but not which , for example , chromosomes it mapped to, and where it is. Below is my bam files, I want to have the the mapped chromosomes and location information in the name of the fastq files, what I could do? Is there a simple way to do it? I have tried to do the replace with 'sed' command, but not very successful so far. Below I attached my bam files info and my attempt to replace names with sed command

sed files
```
while read p; do echo "$p"; sed -i -e $p mapped.fastq; done < replacename.txt
```
replacename.txt in each line, is in 's/original_sequence_name/replaced_mapping_information/g'

my bam files that need to convert to fastq

SRR1761528.1829103 16  CL1Contig15_2_sc_0.515159_l_155 337 1   50M *   0   0   TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG  BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@  AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:32C17  YT:Z:UU
SRR1761528.942684  16  CL34Contig23_2_sc_0.402716_l_155    34  0   12M1D38M    *   0   0   TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG  JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB  AS:i:-26    XS:i:-26    XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:12^G2G1G2T30   YT:Z:UU

For example, I want the CL1Conitg15_2_sc_0.515159_l_155 to be included in the fastq sequence name, to form like SRR1761528.1829103_CL1Conitg15_2_sc_0.515159_l_155.

If anyone has a good idea?

bedtools bamtofastq samtools • 1.9k views

ADD COMMENT • link updated 3.1 years ago by GenoMax 145k • written 3.1 years ago by btx288 • 0

score 0 · Answer 1 · 2021-08-13

0

Entering edit mode

3.1 years ago

GenoMax 145k

Try:

$ cat test.file
SRR1761528.1829103 16  CL1Contig15_2_sc_0.515159_l_155 337 1   50M *   0   0   TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG  BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@  AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:32C17  YT:Z:UU
SRR1761528.942684  16  CL34Contig23_2_sc_0.402716_l_155    34  0   12M1D38M    *   0   0   TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG  JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB  AS:i:-26    XS:i:-26    XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:12^G2G1G2T30   YT:Z:UU

$ cat test.file | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}'
@SRR1761528.1829103_CL1Contig15_2_sc_0.515159_l_155
TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG
+
BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@
@SRR1761528.942684_CL34Contig23_2_sc_0.402716_l_155
TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG
+
JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB

You will have to change awk split field to "\t" (currently it is a space) with a proper SAM file. I think the example above is not properly formatted.

ADD COMMENT • link 3.1 years ago by GenoMax 145k

0

Entering edit mode

You should probably run this through a samtools view with appropriate filtering flags before to make sure all non-primary alignments are filtered out. Otherwise you might create kind of "duplicate" entries in the fastq file.

ADD REPLY • link 3.0 years ago by ATpoint 84k

0

Entering edit mode

Thanks for your suggestion. I have filtered out the mappings with flags first, then extract the reads to fastq. The code I used in the end is

$ samtools view test.bam | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}' > output.fastq

Thanks very much!

ADD REPLY • link 3.0 years ago by btx288 • 0

0

Entering edit mode

Thank you very much! This is so helpful. Due to my data is in bam file. So I run it through samtools view and it works like magic.

The code is $ samtools view test.bam | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}' > output.fastq

ADD REPLY • link 3.0 years ago by btx288 • 0

score 0 · Answer 2 · 2021-08-15

Just for fun, I just added this feature to Genozip (released as 12.0.30):

genozip myfile.bam
genocat myfile.bam.genozip --fastq=all -o myfile.fq

This will do what you need.

See here (last example): https://genozip.com/sam2fq.html

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor