Question: Extracting chimeric reads from mapping
0
gravatar for Anushka
2.2 years ago by
Anushka20
France
Anushka20 wrote:

Hello, I am struggling to processing and analyse bam files (from bwa alignment), to extracting the chimeric read alignment. I am aligning human cell line RNA-seq data (paired end) to virus, aimed to find the viral integration sites in the genome. For that, after reading a bit here from following links like: Confirming Viruses detected from NGS data

RNA Seq analysis of virally infected cells

For find out the chimeric reads (partially mapping to one and partially to other genome),I have concatenated the two genomes (human + viral) and prepared the index using bwa index

bwa index -a bwtsw human_virus.fa

and then performed the mapping using bwa-mem and piping the output samtools of alignment to

bwa mem -R '@RG\tID:ID\tSM:SM\tPL:Illumina' index/human_virus.fa input_pair_1.fastq input_pair_2.fastq | samtools view -bS > input.bam

I have sorted and indexed the resulting input.bam (input_sorted.bam and input.bai)

Now for the post alignment processing, I am trying samtools view -f 0x04 (for unmapped reads) and flagstat to find other statistics.

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

Plus I have approx 50 samples how could I evaluate how many reads have mapped to the viral genome in each sample.

I would really appreciate your inputs.

bwa rna-seq samtools alignment • 2.5k views
ADD COMMENTlink modified 2.2 years ago by Pierre Lindenbaum124k • written 2.2 years ago by Anushka20
1
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html .

For paired end we're searching for reads mapping one organism, and the mate mapping another. We also look for split reads.

java -jar dist/samjdk.jar -e '
    final Set<String> virus_chrom = new HashSet<>(Arrays.asList("viral_seg1", "viral_seg2"));
    if(record.getReadUnmappedFlag()) return false;
    if(record.getReadPairedFlag() && !record.getMateUnmappedFlag())
        {
        if( virus_chrom.contains(record.getReferenceName()) != 
            virus_chrom.contains(record.getMateReferenceName())
            )
            {
            return true;
            }
        }
    for(final SAMRecord other: SAMUtils.getOtherCanonicalAlignments(record))
        {
        if( virus_chrom.contains(record.getReferenceName()) != 
            virus_chrom.contains(other.getReferenceName())
            )
            {
            return true;
            }
        }
    return false;' input.bam
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum124k

Thanks Pierre for your help . I am trying to understand the output of SamJdk, basically it has given me all he SQ tag with reference sequence name and corresponding length.

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options. As in the manual states definition of a chimeric alignment is:

Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for 0x40 and 0x80 flags

ADD REPLYlink written 2.2 years ago by Anushka20
1

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options

that's the purpose of SAMUtils.getOtherCanonicalAlignments . But you need some code to get the difference between virus/host.

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum124k

yes that make sense. Thank you so much for your help

ADD REPLYlink written 2.2 years ago by Anushka20

Hi Pierre, thank you for the answer, but I still don't get the idea. I mapped the paired-end reads to the host genome to get a bam file, but I don't know how to input the virus genome (or mapped result) to the expression? Thank you.

ADD REPLYlink written 6 months ago by junghaowang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1739 users visited in the last hour