Question

How to extract unmapped Paired end reads from BWA output sam file

2

Entering edit mode

7.4 years ago

Bioinfonext ▴ 460

I run BWA for mapping pair end raw reads to CDS transcripts. Now can suggest me how to get unmapped Right and Left reads in two separate file using samtool or sambanba cammand from the output sam file.

bwa sambamba • 19k views

ADD COMMENT • link updated 7.3 years ago by lakhujanivijay 5.8k • written 7.4 years ago by Bioinfonext ▴ 460

0

Entering edit mode

This is not a tool post, it's a question. When you have questions, please mark them as questions.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

By right and left, do you mean R1 and R2? Plus and minus? Please be very specific about what you wish to accomplish. Also, the reason why is helpful; I'm not sure that what you are trying to do is useful.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks, These are CDS transcripts from Draft genome. After getting unmaaped reads I want to de novo assembled them so that for raw read mapping I can use CDS transcript plus unique de novo assembled transcript as a reference.

Right means R1 and left means R2 reads.

ADD REPLY • link 7.4 years ago by Bioinfonext ▴ 460

1

Entering edit mode

What you want to do does not make sense to me. I suggest you de-novo assemble all reads, or else just use the transcriptome. Assembling from reads whose mates are mapped, but that are not mapped, is a bad idea. You'll basically get a bunch of junk.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, If any of those R1and R2 shows similarity with CDS, I will not collect them for de novo assembly.

I want to do this because I found that a large no. of genes transcripts were not present in reported CDS.

But I am not sure I can collect those R1 and R2 reads from bam file.

ADD REPLY • link 7.4 years ago by Bioinfonext ▴ 460

1

Entering edit mode

In that case you should assemble only pairs in which both R1 and R2 are not mapped. You can gather them like this, with the BBMap package:

bbmap.sh in1=r1.fq in2=r2.fq outu=unmapped.fq ref=ref.fa

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

score 6 · Answer 1 · 2016-12-13

6

Entering edit mode

7.4 years ago

lakhujanivijay 5.8k

Dear Yogesh,

You should be clear with your requirements. However, very likely you need the following (using samtools):

To get the unmapped reads from a bam file:

samtools view -f 4 file.bam > unmapped.sam

To get the output in bam:

samtools view -b -f 4 file.bam > unmapped.bam

ADD COMMENT • link 7.4 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Thanks, After getting unmapped.bam, how can I separate paired end reads in two separate files.

ADD REPLY • link 7.4 years ago by Bioinfonext ▴ 460

2

Entering edit mode

using bedtools

bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>

These simple tasks have been already discussed on biostars. A simple way to search is googling in the following manner

convert bam to fastq :biostars.org

ADD REPLY • link 7.4 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Can you please also suggest me how to count no. of mapped reads to each transcripts from bam file. When I am using using:samtool idstats it showing error: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.bam >readcount [bam_idxstats] fail to load the index.

Thanks

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

2

Entering edit mode

From your questions, you seem to be newbie for this kind of stuff. For the benefit of you, I am not going to answer this question. I strongly recommend you to first understand the "very basics" of the "very basic" tools for analysing sequencing data.

Hence, please go through this page thoroughly. Try to understand the whys and whats of basic steps like sorting and indexing a bam file.

Don't get me wrong, but your error points to a key step before you can proceed for idxstats. So, first go through the help link above and then post the problem if you encounter any.

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Thanks I am trying to understand samtools flags.

But when I used to convert bam to into files R1 and R2, it show warning on screen as below I posted: /home/yog/software/bedtools2/bin/bamToFastq -i Ca2._resorted -fq Ca2_read1.fastq -fq2 Ca2_read2.fastq

but it able to complete to write R1 and R2 reads in separate files: is it ok?

WARNING: Query PC168224:127:C98A1ANXX:1:2316:21235:23135 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. WARNING: Query PC168224:127:C98A1ANXX:1:2316:21236:13229 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. *WARNING: Query PC168224:127:C98A1ANXX:1:2316:21237:66490 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

1

Entering edit mode

Looks like your bam file is not sorted by read names. Hence, do the following:

Sort your bam file by read names:

samtools sort -n file.bam file.sorted.bam
Run bamtofastq on your new bam file

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.8k

0

Entering edit mode

showing same after sorting. I was previously also sorted first and than run the bamtofastq. It can be If any of those R1and R2 shows similarity with CDS. How Can I filter this bam file to extract only pair end read?

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

1

Entering edit mode

What about indexing the sorted bam ? Did you do it?

/home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.bam >readcount [bam_idxstats] fail to load the index.

Above error is related to indexing

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.8k

0

Entering edit mode

If I Indexing on sorted file it shows error:

/home/yog/software/samtools-1.3.1/samtools index unmap_216_5W_Ca2.resorted.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 9170 -1 samtools index: "unmap_216_5W_Ca2.resorted.bam" is corrupted or unsorted

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

0

Entering edit mode

sorry .......

I am repeating this step: by mistake for read counting I used unmapped file.

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Getting error when doing indexing of sorted bam file:

[root@psgl 216_raw_reads]# /home/yog/software/samtools-1.3.1/samtools sort -n 216_5W_Ca2.bam -O BAM -o 216_5W_Ca2.resorted.bam [bam_sort_core] merging from 14 files... [root@psgl 216_raw_reads]# /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca2.resorted.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 22591 -1 samtools index: "216_5W_Ca2.resorted.bam" is corrupted or unsorted

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

0

Entering edit mode

If I do sorting of bam fine by not using -n flag than its worting:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort 216_5W_Ca1.bam -O BAM -o 216_5W_Ca1.sort.bam

For indexing: /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca1.sort.bam

For read count: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.sort.bam > readcount[bam_idxstats]

But read count I am getting like this:

Rs025080 1341 239 27 Rs035250 621 0 0 Rs035280 408 0 0 Rs035290 318 0 0 Rs035300 456 87 0 Rs035320 1827 498 105 Rs035330 1119 18 0 Rs035340 924 317 56 Rs000040 321 0 0 Rs000940 948 0 0

Why there is three count column for each transcript.

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

1

Entering edit mode

Column 1: reference sequence name

Column 2: sequence length

Column 3: no. of mapped reads

Column 4: no. of unmapped reads.

The final line is the unmapped reads i.e. those reads which mapped nowhere (* instead of a reference sequence name, zero length).

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Please also suggest about the error:I extracted unmapped reads in bam file.

But when I used to convert bam to into files R1 and R2, it show warning on screen as below I posted:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort -n unmap_216_5W_Ca1.bam -o unmap_216_5W_Ca1._resorted

For Fastq:

/home/yog/software/bedtools2/bin/bamToFastq -i Ca2._resorted -fq Ca2_read1.fastq -fq2 Ca2_read2.fastq

but it able to complete to write R1 and R2 reads in separate files: Here also for sorting I should use cammand without -n flag .is it ok?

WARNING: Query PC168224:127:C98A1ANXX:1:2316:21235:23135 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. WARNING: Query PC168224:127:C98A1ANXX:1:2316:21236:13229 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. *WARNING: Query PC168224:127:C98A1ANXX:1:2316:21237:66490 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460

0

Entering edit mode

If I do sorting of bam fine by not using -n flag than its worting:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort 216_5W_Ca1.bam -O BAM -o 216_5W_Ca1.sort.bam

For indexing: /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca1.sort.bam

For read count: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.sort.bam > readcount[bam_idxstats]

But read count I am getting like this:

Rs025080 1341 239 27 Rs035250 621 0 0 Rs035280 408 0 0 Rs035290 31 0 0 Rs035300 456 87 0

Why there is three count column for each transcript.

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 460