Question: How to extract unmapped Paired end reads from BWA output sam file
1
gravatar for Bioinfonext
2.4 years ago by
Bioinfonext140
Korea
Bioinfonext140 wrote:

I run BWA for mapping pair end raw reads to CDS transcripts. Now can suggest me how to get unmapped Right and Left reads in two separate file using samtool or sambanba cammand from the output sam file.

sambamba bwa • 4.5k views
ADD COMMENTlink modified 2.3 years ago by Vijay Lakhujani4.0k • written 2.4 years ago by Bioinfonext140

This is not a tool post, it's a question. When you have questions, please mark them as questions.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

By right and left, do you mean R1 and R2? Plus and minus? Please be very specific about what you wish to accomplish. Also, the reason why is helpful; I'm not sure that what you are trying to do is useful.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Thanks, These are CDS transcripts from Draft genome. After getting unmaaped reads I want to de novo assembled them so that for raw read mapping I can use CDS transcript plus unique de novo assembled transcript as a reference.

Right means R1 and left means R2 reads.

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1

What you want to do does not make sense to me. I suggest you de-novo assemble all reads, or else just use the transcriptome. Assembling from reads whose mates are mapped, but that are not mapped, is a bad idea. You'll basically get a bunch of junk.

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k

Hi Brian, If any of those R1and R2 shows similarity with CDS, I will not collect them for de novo assembly.

I want to do this because I found that a large no. of genes transcripts were not present in reported CDS.

But I am not sure I can collect those R1 and R2 reads from bam file.

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1

In that case you should assemble only pairs in which both R1 and R2 are not mapped. You can gather them like this, with the BBMap package:

bbmap.sh in1=r1.fq in2=r2.fq outu=unmapped.fq ref=ref.fa

ADD REPLYlink written 2.4 years ago by Brian Bushnell16k
5
gravatar for Vijay Lakhujani
2.4 years ago by
Vijay Lakhujani4.0k
India
Vijay Lakhujani4.0k wrote:

Dear Yogesh,

You should be clear with your requirements. However, very likely you need the following (using samtools):

To get the unmapped reads from a bam file:

samtools view -f 4 file.bam > unmapped.sam

To get the output in bam:

samtools view -b -f 4 file.bam > unmapped.bam
ADD COMMENTlink written 2.4 years ago by Vijay Lakhujani4.0k

Thanks, After getting unmapped.bam, how can I separate paired end reads in two separate files.

ADD REPLYlink written 2.4 years ago by Bioinfonext140
2

using bedtools

bedtools bamtofastq [OPTIONS] -i <BAM> -fq <FASTQ>

These simple tasks have been already discussed on biostars. A simple way to search is googling in the following manner

convert bam to fastq :biostars.org

ADD REPLYlink written 2.4 years ago by Vijay Lakhujani4.0k

Can you please also suggest me how to count no. of mapped reads to each transcripts from bam file. When I am using using:samtool idstats it showing error: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.bam >readcount [bam_idxstats] fail to load the index.

Thanks

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1

From your questions, you seem to be newbie for this kind of stuff. For the benefit of you, I am not going to answer this question. I strongly recommend you to first understand the "very basics" of the "very basic" tools for analysing sequencing data.

Hence, please go through this page thoroughly. Try to understand the whys and whats of basic steps like sorting and indexing a bam file.

Don't get me wrong, but your error points to a key step before you can proceed for idxstats. So, first go through the help link above and then post the problem if you encounter any.

ADD REPLYlink written 2.4 years ago by Vijay Lakhujani4.0k

Thanks I am trying to understand samtools flags.

But when I used to convert bam to into files R1 and R2, it show warning on screen as below I posted: /home/yog/software/bedtools2/bin/bamToFastq -i Ca2._resorted -fq Ca2_read1.fastq -fq2 Ca2_read2.fastq

but it able to complete to write R1 and R2 reads in separate files: is it ok?

WARNING: Query PC168224:127:C98A1ANXX:1:2316:21235:23135 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. WARNING: Query PC168224:127:C98A1ANXX:1:2316:21236:13229 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. *WARNING: Query PC168224:127:C98A1ANXX:1:2316:21237:66490 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1

Looks like your bam file is not sorted by read names. Hence, do the following:

  1. Sort your bam file by read names:

    samtools sort -n file.bam file.sorted.bam

  2. Run bamtofastq on your new bam file

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Vijay Lakhujani4.0k

showing same after sorting. I was previously also sorted first and than run the bamtofastq. It can be If any of those R1and R2 shows similarity with CDS. How Can I filter this bam file to extract only pair end read?

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1

What about indexing the sorted bam ? Did you do it?

/home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.bam >readcount [bam_idxstats] fail to load the index.

Above error is related to indexing

ADD REPLYlink written 2.4 years ago by Vijay Lakhujani4.0k

If I Indexing on sorted file it shows error:

/home/yog/software/samtools-1.3.1/samtools index unmap_216_5W_Ca2.resorted.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 9170 -1 samtools index: "unmap_216_5W_Ca2.resorted.bam" is corrupted or unsorted

ADD REPLYlink written 2.4 years ago by Bioinfonext140

sorry .......

I am repeating this step: by mistake for read counting I used unmapped file.

ADD REPLYlink written 2.4 years ago by Bioinfonext140

Getting error when doing indexing of sorted bam file:

[root@psgl 216_raw_reads]# /home/yog/software/samtools-1.3.1/samtools sort -n 216_5W_Ca2.bam -O BAM -o 216_5W_Ca2.resorted.bam [bam_sort_core] merging from 14 files... [root@psgl 216_raw_reads]# /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca2.resorted.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 22591 -1 samtools index: "216_5W_Ca2.resorted.bam" is corrupted or unsorted

ADD REPLYlink written 2.4 years ago by Bioinfonext140

If I do sorting of bam fine by not using -n flag than its worting:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort 216_5W_Ca1.bam -O BAM -o 216_5W_Ca1.sort.bam

For indexing: /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca1.sort.bam

For read count: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.sort.bam > readcount[bam_idxstats]

But read count I am getting like this:

Rs025080 1341 239 27 Rs035250 621 0 0 Rs035280 408 0 0 Rs035290 318 0 0 Rs035300 456 87 0 Rs035320 1827 498 105 Rs035330 1119 18 0 Rs035340 924 317 56 Rs000040 321 0 0 Rs000940 948 0 0

Why there is three count column for each transcript.

ADD REPLYlink written 2.4 years ago by Bioinfonext140
1
Column 1: reference sequence name

Column 2: sequence length

Column 3: no. of mapped reads

Column 4: no. of unmapped reads.

The final line is the unmapped reads i.e. those reads which mapped nowhere (* instead of a reference sequence name, zero length).

ADD REPLYlink written 2.3 years ago by Vijay Lakhujani4.0k

Please also suggest about the error:I extracted unmapped reads in bam file.

But when I used to convert bam to into files R1 and R2, it show warning on screen as below I posted:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort -n unmap_216_5W_Ca1.bam -o unmap_216_5W_Ca1._resorted

For Fastq:

/home/yog/software/bedtools2/bin/bamToFastq -i Ca2._resorted -fq Ca2_read1.fastq -fq2 Ca2_read2.fastq

but it able to complete to write R1 and R2 reads in separate files: Here also for sorting I should use cammand without -n flag .is it ok?

WARNING: Query PC168224:127:C98A1ANXX:1:2316:21235:23135 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. WARNING: Query PC168224:127:C98A1ANXX:1:2316:21236:13229 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping. *WARNING: Query PC168224:127:C98A1ANXX:1:2316:21237:66490 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

ADD REPLYlink written 2.3 years ago by Bioinfonext140

If I do sorting of bam fine by not using -n flag than its worting:

For sorting: /home/yog/software/samtools-1.3.1/samtools sort 216_5W_Ca1.bam -O BAM -o 216_5W_Ca1.sort.bam

For indexing: /home/yog/software/samtools-1.3.1/samtools index 216_5W_Ca1.sort.bam

For read count: /home/yog/software/samtools-1.3.1/samtools idxstats 216_5W_Ca1.sort.bam > readcount[bam_idxstats]

But read count I am getting like this:

Rs025080 1341 239 27 Rs035250 621 0 0 Rs035280 408 0 0 Rs035290 31 0 0 Rs035300 456 87 0

Why there is three count column for each transcript.

ADD REPLYlink written 2.3 years ago by Bioinfonext140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1611 users visited in the last hour