RNASeq with mixed tissues
3
0
Entering edit mode
5.1 years ago
ddzhangzz ▴ 90

I got some RNASeq fastq data from a customer, and he told me the samples were mainly from human cell lines but may have some contamination with mouse cells. My question is whether I should align those sequences against both human genome reference and mouse genome reference or just humna's. Any suggestions?

RNA-Seq • 1.8k views
3
Entering edit mode
5.1 years ago
1. First align all the samples to the human genome.
2. Then align the un-mapped reads to the mouse genome.

If you get a large portion of (un-mapped) reads mapping to mouse, then it's very likely the sample was contaminated.

0
Entering edit mode

Thanks @Lando Ringel. One problem could be that (maybe very likely) a sequence was actually from mouse but it can be mapped to both human and mouse.

0
Entering edit mode

That is true, but many of the mouse reads will remain un-mapped, you can use BLAST (or SNAP) to look at the unmapped reads more closely (i.e. determine which organism they belong to).

Do you plan on trying to using the contaminated samples? I personally would advise against that.

0
Entering edit mode

In this setting wouldn't t make more sense to align against a conjoined human/mouse reference, or to separately align to both human and mouse and select the species origin of the reads based on the quality of alignment in sp1 vs sp2

3
Entering edit mode
5.1 years ago

First subset the files (seqtk) and then use fastq_screen to get an idea what the contamination rate is. I've found it useful to only pay close attention to the "single alignment in a single organism" (or whatever that's called) category, since the others are more an indicator of sequence complexity. I happen to do this with all sequencing runs produced at our institute, since it immediately allows us to flag problematic samples (anything over 0.5% off-species unique alignment is a problem).

Ideally you won't have much contamination and if you do you can just exclude the sample. If you can't exclude the sample, then you'll need to simultaneously align to both genomes (get one from Ensembl and the other from UCSC, so the chromosome names differ, and then concatenate them). Align against the concatenated genome and then extract only the human reads with some meaningful MAPQ threshold. One can get more elegant with this, but that should suffice 99.9% of the time.

1
Entering edit mode
5.1 years ago
GenoMax 101k

BBSplit from BBMap has been designed to address this kind of a situation for binning reads (to best extent they can be assigned by alignment). It is a one step process.

0
Entering edit mode

yes, but after I use BBsplit, I obtain the fastq file and then I can to remap this with STAR but the count? FeatureCount doesn't work well with bbsplit. So my question is: After BBsplit What can I use to map and to calculate the count? Thanks

0
Entering edit mode

FeatureCount doesn't work well with bbsplit.

There should be no direct relation with the splitting. If you reads are not aligning to exons then you can have issues with counting (assuming you are using a reference/annotation where chromosome ID's match).

0
Entering edit mode

When I use featureCount I get reads % that map to exons that are too low, while STAR percentages are greater than 80%, Assuming the same annotation file. WHY?

0
Entering edit mode

Can you describe the processing you are doing step wise for both BBTools and STAR?

As I said before there is no technical reason why this should not work with reads that have been bbsplit.

0
Entering edit mode
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=human.fa,mouse.fa ambiguous2=toss basename=out_%.fq refstats=Statistics_%.txt


then I used directly STAR:

STAR --runMode alignReads --runThreadN 12 --genomeDir Genomes/STARversion2.70f/STARversion2.70f_INDEX_HG38_GenCode_v29/ --sjdbGTFfile Genomes/GenCode_HG38_release_29/GTF/gencode.v29.chr_patch_hapl_scaff.annotation.gtf --readFilesIn 2_Esperimento/6_BBSPlit/1_CTRL/out_HG38.fq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix 1_CTRL/


then, I used FeatureCount:

featureCounts -F GTF -T 12 -s 2 -g gene_name -a Genomes/GenCode_HG38_release_29/GTF/gencode.v29.chr_patch_hapl_scaff.annotation.gtf -o GeneName_Count.txt 1_CTRL/Aligned.sortedByCoord.out.bam


In this case I obtained only the 11.3% of successfully assigned alignments. WHY?

Can you help me?

0
Entering edit mode

Is this library "reverse stranded" for sure? What happens if you try -s 0 or -s 1 with featureCounts? Does the assignment % go up significantly?

Just to be redundant, let me make sure you have tried three different things:

1. Use bbsplit.sh to split the reads into two pools.
2. Use bbmap.sh to map the reads back to human genome (what is the alignment % here?)
3. Use featureCounts to count.

Other workflow

1. Use bbsplit.sh to split the reads into two pools.
2. Use STAR to map to genome and count at the same time?

Alternative that you have tried:

1. Use bbmap.sh to map the reads to human genome.
2. Use featureCounts to count.