How to convert mapping bam file to fastq without loseing the mapping information
2
0
Entering edit mode
3.2 years ago
btx288 • 0

Hi all,

I want to create my RNA mapping data into a library for further analysis. Now I have bowtie2 mapping data, which is in bam files, I now use bedtools to extract fastq mapping reads from those bam mapping files. (bedtools bamtofastq function) But it seems like the fastq data produced by bedtools would only contain the original sequence name, but not which , for example , chromosomes it mapped to, and where it is. Below is my bam files, I want to have the the mapped chromosomes and location information in the name of the fastq files, what I could do? Is there a simple way to do it? I have tried to do the replace with 'sed' command, but not very successful so far. Below I attached my bam files info and my attempt to replace names with sed command

  1. sed files

    while read p; do echo "$p"; sed -i -e $p mapped.fastq; done < replacename.txt
    

    replacename.txt in each line, is in 's/original_sequence_name/replaced_mapping_information/g'

  2. my bam files that need to convert to fastq

    SRR1761528.1829103 16  CL1Contig15_2_sc_0.515159_l_155 337 1   50M *   0   0   TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG  BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@  AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:32C17  YT:Z:UU
    SRR1761528.942684  16  CL34Contig23_2_sc_0.402716_l_155    34  0   12M1D38M    *   0   0   TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG  JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB  AS:i:-26    XS:i:-26    XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:12^G2G1G2T30   YT:Z:UU
    

For example, I want the CL1Conitg15_2_sc_0.515159_l_155 to be included in the fastq sequence name, to form like SRR1761528.1829103_CL1Conitg15_2_sc_0.515159_l_155.

If anyone has a good idea?

bedtools bamtofastq samtools • 2.0k views
ADD COMMENT
0
Entering edit mode
3.2 years ago
GenoMax 146k

Try:

$ cat test.file
SRR1761528.1829103 16  CL1Contig15_2_sc_0.515159_l_155 337 1   50M *   0   0   TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG  BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@  AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:32C17  YT:Z:UU
SRR1761528.942684  16  CL34Contig23_2_sc_0.402716_l_155    34  0   12M1D38M    *   0   0   TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG  JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB  AS:i:-26    XS:i:-26    XN:i:0  XM:i:3  XO:i:1  XG:i:1  NM:i:4  MD:Z:12^G2G1G2T30   YT:Z:UU

$ cat test.file | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}'
@SRR1761528.1829103_CL1Contig15_2_sc_0.515159_l_155
TTGGAGTGTATTGGGTGCGTTCGTGGCAAAAAATCACTTCGTGATTCGCG
+
BFHFEECHEIIIHDIIIIHDIIIIIHDHIHIIHEG;GFHFHHDD<FF@@@
@SRR1761528.942684_CL34Contig23_2_sc_0.402716_l_155
TTTGGACATATTGAATTTACTGGGTGCGTTCGTGGCAAAAACTCACTTCG
+
JJJJJJJJJJJJIJJJJJIJJJJJJJJIJJJJJJJJJHGHFHEFDFFCCB

You will have to change awk split field to "\t" (currently it is a space) with a proper SAM file. I think the example above is not properly formatted.

ADD COMMENT
0
Entering edit mode

You should probably run this through a samtools view with appropriate filtering flags before to make sure all non-primary alignments are filtered out. Otherwise you might create kind of "duplicate" entries in the fastq file.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion. I have filtered out the mappings with flags first, then extract the reads to fastq. The code I used in the end is

$ samtools view test.bam | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}' > output.fastq

Thanks very much!

ADD REPLY
0
Entering edit mode

Thank you very much! This is so helpful. Due to my data is in bam file. So I run it through samtools view and it works like magic.

The code is $ samtools view test.bam | awk -F " " '{print "@"$1"_"$3"\n"$10"\n""+""\n"$11}' > output.fastq

ADD REPLY
0
Entering edit mode
3.2 years ago
Divon ▴ 230

Just for fun, I just added this feature to Genozip (released as 12.0.30):

genozip myfile.bam
genocat myfile.bam.genozip --fastq=all -o myfile.fq

This will do what you need.

See here (last example): https://genozip.com/sam2fq.html

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT

Login before adding your answer.

Traffic: 1794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6