1
0
Entering edit mode
4.8 years ago
Anushka ▴ 20

Hello, I am struggling to processing and analyse bam files (from bwa alignment), to extracting the chimeric read alignment. I am aligning human cell line RNA-seq data (paired end) to virus, aimed to find the viral integration sites in the genome. For that, after reading a bit here from following links like: Confirming Viruses detected from NGS data

RNA Seq analysis of virally infected cells

For find out the chimeric reads (partially mapping to one and partially to other genome),I have concatenated the two genomes (human + viral) and prepared the index using bwa index

bwa index -a bwtsw human_virus.fa

and then performed the mapping using bwa-mem and piping the output samtools of alignment to

bwa mem -R '@RG\tID:ID\tSM:SM\tPL:Illumina' index/human_virus.fa input_pair_1.fastq input_pair_2.fastq | samtools view -bS > input.bam

I have sorted and indexed the resulting input.bam (input_sorted.bam and input.bai)

Now for the post alignment processing, I am trying samtools view -f 0x04 (for unmapped reads) and flagstat to find other statistics.

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

Plus I have approx 50 samples how could I evaluate how many reads have mapped to the viral genome in each sample.

I would really appreciate your inputs.

RNA-Seq alignment samtools bwa • 5.5k views
1
Entering edit mode
4.8 years ago

My question question is how can I extract the numbers of reads that are partially mapped to either of the two genome, how many reads mapped to viral genome and which to human. Conceptually, I might not be understanding this problem, that is why I cab not implement proper samtool view options.

using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html .

For paired end we're searching for reads mapping one organism, and the mate mapping another. We also look for split reads.

java -jar dist/samjdk.jar -e '
final Set<String> virus_chrom = new HashSet<>(Arrays.asList("viral_seg1", "viral_seg2"));
{
if( virus_chrom.contains(record.getReferenceName()) !=
virus_chrom.contains(record.getMateReferenceName())
)
{
return true;
}
}
for(final SAMRecord other: SAMUtils.getOtherCanonicalAlignments(record))
{
if( virus_chrom.contains(record.getReferenceName()) !=
virus_chrom.contains(other.getReferenceName())
)
{
return true;
}
}
return false;' input.bam

0
Entering edit mode

Thanks Pierre for your help . I am trying to understand the output of SamJdk, basically it has given me all he SQ tag with reference sequence name and corresponding length.

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options. As in the manual states definition of a chimeric alignment is:

Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag. All the SAM records in a chimeric alignment have the same QNAME and the same values for 0x40 and 0x80 flags

1
Entering edit mode

Additionally is it possible to retrieve the chimeric reads using samtools FLAG options

that's the purpose of SAMUtils.getOtherCanonicalAlignments . But you need some code to get the difference between virus/host.

0
Entering edit mode

yes that make sense. Thank you so much for your help

0
Entering edit mode

Hi Pierre, thank you for the answer, but I still don't get the idea. I mapped the paired-end reads to the host genome to get a bam file, but I don't know how to input the virus genome (or mapped result) to the expression? Thank you.