Question

Subset reads from RNAseq .bam file

0

Entering edit mode

5.7 years ago

fabio.demartino2 • 0

Hello! I am pretty new to the bioinformatic field and I am dealing with some RNAseq data.

I am trying to run the protocol described in Callari et al. BMC Genomics 2018 in order to discriminate mouse and human reads into my PDX-derived samples.

In order to do so, as suggested in the paper and following some of their commands, I merged the human and mouse genome and run Tophat on that. I finally got a .bam file that is now containing reads aligned to both the human and the mouse genome.

In order to proceed with the remaining RNAseq analysis, I want to subselect only those reads that matched only to either the human's or to the mouse's genome.

The problems are that when I run (the following code was taken from the original code reported in the paper):

samtools view -h path/to/ICRG_aligned.bam
| grep -v $'\t''m.chr'
| grep -v 'SN:m.chr'
| samtools view -b > path/to/genome_human_only.bam

1) It returns me a .bam file in the standard-output.txt, even though I specify the path and the name of my desidered output file

2) This output file is even a bit larger than my original .bam file, even though I would expect it to be smaller in size since I should have gotten rid of all the lines containing the "m.chr" string.

Can anybody help me to figure out if there is a problem with this code?

Note: the chromosome's names of my fastq and gtf files were modified so that the human's ones are reported as "chr" whereas the mouse's as "m.chr".

Thank you in advance!

RNA-Seq UNIX • 1.3k views

ADD COMMENT • link 5.7 years ago by fabio.demartino2 • 0

1

Entering edit mode

I can certainly tell you that this command should work. For example this is what I tried at my end

samtools view -h sample_sorted.bam| grep -v SN:NW_004587740.1 | samtools view -b | samtools view -h | head -43465 > out.bam

Now my out.bam is a nice humble bam file which I checked like this

samtools quickcheck out.bam && echo 'all ok' || echo 'fail!'

output

all ok

The only problem I see is here $'\t''m.chr'. What is that $'\t' doing there?

PS: I am using samtools 1.4.1

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

The only problem I see is here $'\t''m.chr'. What is that $'\t' doing there?

The $'\t' greps for the tab. You could also use this command line:

grep -v -P '\tm.chr'

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hi Vijay, thank you very much for your nice and exhaustive reply, your suggestion let my code working now. Indeed, it was a problem related to the usage of an older version of samtools, now I updated it and it was fine. Anyway, I converted the resulting .bam file to .sam so that I could check that my lines containing the string "m.chr" were actually been filtered out and unfortunately I still see some. Would you have an idea of why this is happening and how I could get rid of them? Thank you again.

ADD REPLY • link 5.7 years ago by fabio.demartino2 • 0

0

Entering edit mode

Hello fabio.demartino2,

after updating have you also tried the original code? With the code Vijay Lakhujani provided there should not be left "some" reads but all reads. Why? Let's go through the code:

samtools view -h path/to/ICRG_aligned.bam

Converts bam to sam and print also the header.

grep -v $'\t''m.chr'

Skip all lines which contain a tab followed by m.chr. This is in your read data the case. In sam (and bam) each read have on line and consists of several columns delimited by tab. One of these columns is about the sequence id aka chromosome.

grep -v 'SN:m.chr'

Skip line which matches the pattern SN:m.chr. This is in your header the case where each chromosome is listed.

So the first grep is needed to remove the reads which are located on your mouse genome and the second one to remove the header information about it.

samtools view -b > path/to/genome_human_only.bam

Convert sam back to bam.

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hi Fabio

It is to be noted that what I provided above is NOT a final solution to your problem but my way of saying that what the code is doing should work after the changes suggested by @finswimmer. Executing the code what I gave will not give you what you want.

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hello,

1) It returns me a .bam file in the standard-output.txt, even though I specify the path and the name of my desidered output file

you mean a file with the name standard-output.txt is created? Then I would say you are not showing us your complete command/script.

fin swimmer

ADD REPLY • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Hi Fin, Yes, it creates a .bam file with the name standard-output.txt that is not where I would like my output file to be (I specify in my script the path of my wanted output .bam file), even though this is my whole command.

Thank you very much for your kind and prompt reply.

ADD REPLY • link 5.7 years ago by fabio.demartino2 • 0