Question: Extract uniquely mapped reads from one species
0
gravatar for PSB
5 months ago by
PSB0
European Union
PSB0 wrote:

Hi,

Using STAR and a reference fasta file with both the genomes of mouse and plasmodium I've mapped my fastq files (with reads from both mouse and plasmodium). I used a GFF annotation file too in the alignment. Now I have BAM files with reads mapped uniquely to mouse or plasmodium genome and multimapped to both mouse and plasmodium.

I'm interested on extracting the uniquely mapped reads of mouse, the uniquely mapped reads from plasmodium and the multimapped reads.

Since the uniquely mapped reads in the BAM file could be from mouse or plasmodium, I was wondering how can I extract the uniquely mapped reads from each specie. My first idea is to use bamtools/bedtools with the GFF file but I'm unsure about how to do this. Any ideas?

Thanks in advance and thanks for reading.

ADD COMMENTlink modified 5 months ago by goodez420 • written 5 months ago by PSB0
1

I suggest that you should try bbsplit.sh as noted in this thread (A: Tool to separate human and mouse ran seq reads ). It will make this process much simpler.

ADD REPLYlink written 5 months ago by genomax59k
1

You can filter your original fastq files using BBSplit.sh

ADD REPLYlink written 5 months ago by Antonio R. Franco3.9k

@Antonio: I am moving this to a comment since this is not an answer for the original question but an alternate way of achieving the end result OP wants.

ADD REPLYlink written 5 months ago by genomax59k

Many thanks genomax and Antonio,

I've came across with this tool which I found very useful. However, my end goal is to see hoy many reads across samples are from mouse and how many are from plasmodium from the total number of reads. That being so (maybe I'm wrong) but I thought filtering the fastq files before will be easier but won't allow me to see "the big picture" about the distribution of reads per species in my samples. Can I make this comparison with the fastq files filtered? Should I align afterwards with the same reference file and have two different bam files per sample (one with unique mouse reads and other with unique plasmodium reads)?

Again, many thanks for your help.

ADD REPLYlink modified 5 months ago • written 5 months ago by PSB0
1

However, my end goal is to see hoy many reads across samples are from mouse and how many are from plasmodium from the total number of reads.

You are not filtering the reads but actually binning them into pools by using bbsplit.sh (by aligning them to the references provided). This allows you to find uniquely mapping reads for each genome and get then in individual files. The program will also bin reads that multi-map across species in separate (ambiguous data) files. While it is possible to get bam files directly from bbsplit.sh it is better to do the split and then do alignments on split data files with bbmap.sh to avoid issues visualizing the data with a genome viewer like IGV.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax59k
2
gravatar for goodez
5 months ago by
goodez420
United States
goodez420 wrote:

I did this recently but I aligned with bowtie2. Please note I provide code but you would have to modify it slightly for your case.

To create the genome index, I edited the chromosome names in each genome fasta file. So instead of chr1, chr2, chr3... I made it hg_chr1, hg_chr2, hg_chr3. And likewise for the other genome fasta.

e.g.

zcat Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_hg38\2/" > hg38.tmp

zcat Mus_musculus.GRCm38.dna.primary_assembly.fa.gz | \
sed -E "s/>(.+)( dna:.+)/>chr\1_mm10\2/" > mm10.tmp

cat hg38.tmp | cat - mm10.tmp > hg38_mm10.fa

rm hg38.tmp
rm mm10.tmp

Then I combined both genome files with the unique chromosome IDs and used that as the index.

bowtie2-build --threads 8 -f ./hg38_mm10.fa hg_mm

After alignment, you can easily distinguish which genome each read is aligned to based on the chromosome name (hg_chr1 vs mm_chr1). I used the XS:i flag in the BAM file to only pull out uniquely mapped reads. It may sound like a lot of work but it was pretty simple to execute.

Here was the more complicated part of separating the BAM into each species with only unique reads.

for file in *.sam
do
        # filter out multimappers (reads with 'XS:i' inform us of multimapping)
        cat $file | grep -v 'XS:i' > tmp.unique.sam
        # filter for high quality reads and convert to bam (below)
        samtools view -b -@ 6 -q 35 *unique.sam > tmp.filt.unique.bam
        rm tmp.unique.sam
        # split into separate human and mouse sam files
        samtools view tmp.filt.unique.bam | grep -E $'\tchr.+hg38' > hg38.noheader.sam
        samtools view -H tmp.filt.unique.bam | grep -v 'mm10' | cat - hg38.noheader.sam | sed 's/_hg38//' | \
        samtools view -b -@ 6 > ./hg38/"${file:0:-4}".hg38.bam
        rm hg38.noheader.sam
        ##
        samtools view tmp.filt.unique.bam | grep -E $'\tchr.+mm10' > mm10.noheader.sam
        samtools view -H tmp.filt.unique.bam | grep -v 'hg38' | cat - mm10.noheader.sam | sed 's/_mm10//' | \
        samtools view -b -@ 6 > ./mm10/"${file:0:-4}".mm10.bam
        rm mm10.noheader.sam
        ##
        rm tmp.filt.unique.bam
done
ADD COMMENTlink modified 5 months ago • written 5 months ago by goodez420

Most of the ideas in this post are already in the original question. Unless you add actual code or more details this would be moved to a comment instead. If you have code snippets/workflow available you can host that via a gist on GitHub and include the link here.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax59k

It's unclear in the question how they did the alignment. If the chromosome names are unique as I'm suggesting, then it's simple to pull out reads from either organism with grep. I suggested using XS:i to only take uniquely mapped reads, which I don't see mentioned above. Move my post wherever you'd like though.

ADD REPLYlink written 5 months ago by goodez420

Thanks for adding the example code.

Give the bbsplit.sh method suggested above a try as a simpler alternative in future :-)

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax59k

Hi,

Thanks for your answer. That's an interesting approach and it'll be very helpful if you can provide explanation/code about how you did it.

Many thanks.

ADD REPLYlink written 5 months ago by PSB0

Sure I will edit my answer to include some code

ADD REPLYlink written 5 months ago by goodez420

Hi goodez,

Thank you for the answer and the code. I do have unique chromosome names for each genome, so the file looks like

7001289F:143:CBY2FANXX:1:1107:2105:1964 99      chr11   96938063       255     1S124M  =       96938130        192     NGACCCAGGCTGAAGTGGGGGATCAACACAGGACAAGGTTGCCTCGGGAGGAGCCTTTGTTAGTGACAGAGGCTGAAGCACACAGCCACAAAGTCAAAACAAGGCAAGTGTGTCCAGGGTGATAA
#<<@AF;EGEGGGGGGG>FGGGGGEGGGGGEGGGGGGGGGFGGGFGGGGGGDGGGGGGGGGGGGEGGF>GEGGGBGGGGGE>DBGGGBGBGGG0;0F@DGGGE=@FGGGE@@DBGEGGG.@D6EG NH:i:1  HI:i:1  AS:i:247        nM:i:0

My idea us to use chr*, 255 (MAPQ) and NH:i:1 to extract the unique mapped reads of mouse, and the same but changing the chr for the appropriate name to extract the unique mapped reads of plasmodium. I was thinking to use grep or samtools. Do you think this is a good approach? To filter for both MAPQ and NH:i:x?

Thanks for your time and help.

ADD REPLYlink modified 5 months ago • written 5 months ago by PSB0

Yes that sounds good to me

ADD REPLYlink written 5 months ago by goodez420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1489 users visited in the last hour