Question

RNASeq with mixed tissues

0

Entering edit mode

8.1 years ago

ddzhangzz ▴ 90

I got some RNASeq fastq data from a customer, and he told me the samples were mainly from human cell lines but may have some contamination with mouse cells. My question is whether I should align those sequences against both human genome reference and mouse genome reference or just humna's. Any suggestions?

RNA-Seq • 3.2k views

ADD COMMENT • link updated 8.1 years ago by Manuel Landesfeind ★ 1.4k • written 8.1 years ago by ddzhangzz ▴ 90

3

Entering edit mode

8.1 years ago

Devon Ryan 104k

First subset the files (seqtk) and then use fastq_screen to get an idea what the contamination rate is. I've found it useful to only pay close attention to the "single alignment in a single organism" (or whatever that's called) category, since the others are more an indicator of sequence complexity. I happen to do this with all sequencing runs produced at our institute, since it immediately allows us to flag problematic samples (anything over 0.5% off-species unique alignment is a problem).

Ideally you won't have much contamination and if you do you can just exclude the sample. If you can't exclude the sample, then you'll need to simultaneously align to both genomes (get one from Ensembl and the other from UCSC, so the chromosome names differ, and then concatenate them). Align against the concatenated genome and then extract only the human reads with some meaningful MAPQ threshold. One can get more elegant with this, but that should suffice 99.9% of the time.

ADD COMMENT • link 8.1 years ago by Devon Ryan 104k

1

Entering edit mode

8.1 years ago

GenoMax 141k

BBSplit from BBMap has been designed to address this kind of a situation for binning reads (to best extent they can be assigned by alignment). It is a one step process.

ADD COMMENT • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

yes, but after I use BBsplit, I obtain the fastq file and then I can to remap this with STAR but the count? FeatureCount doesn't work well with bbsplit. So my question is: After BBsplit What can I use to map and to calculate the count? Thanks

ADD REPLY • link 4.9 years ago by GiV17 ▴ 50

0

Entering edit mode

FeatureCount doesn't work well with bbsplit.

There should be no direct relation with the splitting. If you reads are not aligning to exons then you can have issues with counting (assuming you are using a reference/annotation where chromosome ID's match).

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

When I use featureCount I get reads % that map to exons that are too low, while STAR percentages are greater than 80%, Assuming the same annotation file. WHY?

ADD REPLY • link 4.9 years ago by GiV17 ▴ 50

0

Entering edit mode

Can you describe the processing you are doing step wise for both BBTools and STAR?

As I said before there is no technical reason why this should not work with reads that have been bbsplit.

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

bbsplit.sh in1=reads1.fq in2=reads2.fq ref=human.fa,mouse.fa ambiguous2=toss basename=out_%.fq refstats=Statistics_%.txt

then I used directly STAR:

STAR --runMode alignReads --runThreadN 12 --genomeDir Genomes/STARversion2.70f/STARversion2.70f_INDEX_HG38_GenCode_v29/ --sjdbGTFfile Genomes/GenCode_HG38_release_29/GTF/gencode.v29.chr_patch_hapl_scaff.annotation.gtf --readFilesIn 2_Esperimento/6_BBSPlit/1_CTRL/out_HG38.fq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix 1_CTRL/

then, I used FeatureCount:

featureCounts -F GTF -T 12 -s 2 -g gene_name -a Genomes/GenCode_HG38_release_29/GTF/gencode.v29.chr_patch_hapl_scaff.annotation.gtf -o GeneName_Count.txt 1_CTRL/Aligned.sortedByCoord.out.bam

In this case I obtained only the 11.3% of successfully assigned alignments. WHY?

Can you help me?

ADD REPLY • link updated 4.9 years ago by GenoMax 141k • written 4.9 years ago by GiV17 ▴ 50

0

Entering edit mode

Is this library "reverse stranded" for sure? What happens if you try -s 0 or -s 1 with featureCounts? Does the assignment % go up significantly?

Just to be redundant, let me make sure you have tried three different things:

Use bbsplit.sh to split the reads into two pools.
Use bbmap.sh to map the reads back to human genome (what is the alignment % here?)
Use featureCounts to count.

Other workflow

Use bbsplit.sh to split the reads into two pools.
Use STAR to map to genome and count at the same time?

Alternative that you have tried:

Use bbmap.sh to map the reads to human genome.
Use featureCounts to count.

ADD REPLY • link 4.9 years ago by GenoMax 141k

score 3 · Accepted Answer · 2016-03-31

3

Entering edit mode

8.1 years ago

informatics bot ▴ 760

First align all the samples to the human genome.
Then align the un-mapped reads to the mouse genome.

If you get a large portion of (un-mapped) reads mapping to mouse, then it's very likely the sample was contaminated.

ADD COMMENT • link 8.1 years ago by informatics bot ▴ 760

0

Entering edit mode

Thanks @Lando Ringel. One problem could be that (maybe very likely) a sequence was actually from mouse but it can be mapped to both human and mouse.

ADD REPLY • link 8.1 years ago by ddzhangzz ▴ 90

0

Entering edit mode

That is true, but many of the mouse reads will remain un-mapped, you can use BLAST (or SNAP) to look at the unmapped reads more closely (i.e. determine which organism they belong to).

Do you plan on trying to using the contaminated samples? I personally would advise against that.

ADD REPLY • link 8.1 years ago by informatics bot ▴ 760

0

Entering edit mode

In this setting wouldn't t make more sense to align against a conjoined human/mouse reference, or to separately align to both human and mouse and select the species origin of the reads based on the quality of alignment in sp1 vs sp2

ADD REPLY • link 8.1 years ago by russhh 5.7k