1
0
Entering edit mode
18 months ago
pkallurkar • 0

Hello,

patch-seq RNA-Seq STAR FastQC • 884 views
0
Entering edit mode

Some undoubtedly are but they are not all likely to be duplicates.

SMRT-seq is the full length RNAseq kit correct? Have you taken some of the unmapped reads and blasted them at NCBI to see what they are. Specifically to rule out problem with contamination of some sort.

0
Entering edit mode

I am very new to this. I tried blasting only the unmapped reads. I keep getting Server error on NCBI. So I tried the command line version of BLAST which generated a ~274GB text file. I don't know how to visualize this file or even open this file as my Mac keeps crashing whenever I try. Any ideas?

0
Entering edit mode

When doing diagnostic testing you should take a few reads (< 10) and then use the web blast interface to do the search. Convert the fastq sequence to fasta format when you paste them into blast web search.

0
Entering edit mode

I tried for the top 10 sequences of the unmapped reads for my first Sample. The percent identity hit for Mus Musculus is >99.13% e-value <3e-50. The rest are with rats and other members of the rodent family plus primates. This is very confusing for me now, if the e-value is so low why is STAR assigning it as unmapped?

0
Entering edit mode

Are you sure you are selecting unmapped reads? Did you use --outReadsUnmapped Fastx option to get reads that are unmapped?

Another possibility. If these reads are multi-mapping (I think STAR allows max of 10 locations by default) STAR may be marking them as unmapped.

0
Entering edit mode

Yes, I did use --outReadsUnmapped Fastx to get the unmapped reads. The amount of multi-mapped reads is relatively low (4.62 ± 1.46 %).

0
Entering edit mode

This is puzzling. You have very high % of unmapped reads but the reads when spot checked at NCBI are showing up as aligning to correct genome. Did you run STAR with more or less default options? If you did not then could you do that and see if the % unmapped number goes down?

0
Entering edit mode

I mostly used the default options: For indexing, I did set the --sjdbOverhang 149 as I have 2*150bp reads. And set the genomeSAsparseD to 2 to save computing power. The code line used:  STAR --runThreadN 4 --runMode genomeGenerate --genomeDir {path to genome folder} --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile Mus_musculus.GRCm38.102.gtf --sjdbOverhang 149 genomeSAsparseD 2 

For alignment, I used default settings:  STAR --runThreadN 4 --runMode alignReads --readFilesCommand gunzip -c --genomeDir {path_to_Dir} --outFileNamePrefix Sample1_R1 --readFilesIn {read_1}, {read_2} --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx 

Does the above code look alright or what parameters should I play around with to improve my results?

0
Entering edit mode

I am not a regular STAR user but those are more or less default options for alignments. I think that

--outFilterMultimapNmax
default: 10
int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.


is causing these reads to end up in "unmapped files". Even if you are not explicitly changing this parameter it is getting applied when a read maps to more than 10 locations.

Does your data have rRNA contamination? Is it from FFPE type samples where you have degraded short RNA's?

0
Entering edit mode

No, we don't have FFPE type samples. We extracted samples from fresh tissue. I'm mapping the samples to rRNA FASTA from NCBI to check for rRNA contamination. I just did Qualimap on the sample and see a relatively high percentage of introns in some of the samples which also points to rRNA contamination.

1
Entering edit mode

Let us know how that goes. rRNA contamination seems like the best bet for now.

0
Entering edit mode

I'm still running mapping for rRNA fasta. But I noticed something in one of the unmapped reads that there is a sequence, primarily comprising of A's (with N) occurring multiple times. I have attached the image with these sequences in the red box. I think it is crap and I don't get any hits on BLAST. Any idea what does this mean and why this is appearing:

**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:54717/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAACAAACAAAAAAAATTAAAAGAAA

MG01HX01:1039:HCTMGCCX2:8:1101:1255:54717/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTGTTTTTTTTTGTTTTTT** MG01HX01:1039:HCTMGCCX2:8:1101:1255:57213/1 TGTNGTTCTGTGTAGTTCATATAAAATGGAAAGGGCAAAGACAAGATTCAGATTAAATATTAAATATTATTCTGGAAGCTGGCTAGACTCTCAGAGGGTCAGGAACACAGAATTAGTTTGGAACTTTTGAGAGGAGATCGGAAGAGCACAC MG01HX01:1039:HCTMGCCX2:8:1101:1255:57213/2 NCTCTCAAAAGTTCCAAACTAATTCTGTGTTCCTGACCCTCTGAGAGTCTAGCCAGCTTCCAGAATAATATTTAATATTTAATCTGAATCTTGTCTTTGCCCTTTCCATTTTATATGAACTACACAGAACTACAAGATCGGAAGAGCGTCG >MG01HX01:1039:HCTMGCCX2:8:1101:1255:61784/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1255:61784/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

**>MG01HX01:1039:HCTMGCCX2:8:1101:1255:71735/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

MG01HX01:1039:HCTMGCCX2:8:1101:1255:71735/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTGTTTTGTTTTTTTTTTTT** MG01HX01:1039:HCTMGCCX2:8:1101:1255:72332/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAACAACACCAAAACAAAGAATAAAAAAAACAAAAACTAAAAAGACAAACAGAGATAC MG01HX01:1039:HCTMGCCX2:8:1101:1255:72332/2 NTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTGTTTTTGTGTTTTCTCTTTTCTTTTC MG01HX01:1039:HCTMGCCX2:8:1101:1255:73387/1 CCANTGCTTAAAAAAAAAAAAAAAAAAAAAAAAAACTCTGCGATGAAACCAAAAATTAAAAAAAAAAAAAAAAAAAAAAAAAACTTCGATGATAAAACTCCTATAAAAAAAAAAAAAAAGCTTAAAAAAAAAAAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1255:73387/2 NTTTTTTTTTTTTTTTATCAGTTGTATCAACGCTTTGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTCTTTTTTTTGCACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTTTTTCTTTGTGTGTT >MG01HX01:1039:HCTMGCCX2:8:1101:1265:31599/1 AAANAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAAA MG01HX01:1039:HCTMGCCX2:8:1101:1265:31599/2 NTCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

1
Entering edit mode

Those reads are not useful. Perhaps they are a side effect of SMRTseq. So your STAR result may be ok after all. STAR was able to align whatever reads it could and the rest are .. you know.

0
Entering edit mode

That's a bummer. Thanks for the reply. Is the data still usable though?

0
Entering edit mode

Some samples have very high (70% unmapped) reads so not sure how that is going to affect your downstream analysis. You may have to try that out and see what happens. Your power to detect changes may be limited by amount of data that aligned.

What genome is the data from? How many million reads have aligned for the sample where 70% reads are unmapped.

You could use bbduk.sh and scan/remove these polyA/polyT containing reads. My hunch is that remaining reads would likely equate to what STAR was able to align.

0
Entering edit mode

The genome is the mouse genome (mm10). The sample with ~70% unmapped reads has ~21.5M uniquely aligned reads (number of raw reads >70M), so definitely, it is not the best sample.

I will try to scan and remove the polyA/polyT containing reads as you suggested. Hopefully, it helps in making a call on which sample is usable and which is not.

1
Entering edit mode

If you have ~22 M mapped reads in that worst sample then all data should be OK. That is good news. I would say instead of spending time on the unmapped reads go forward with your analysis.

0
Entering edit mode

You just made my day. Thanks a ton.

1
Entering edit mode
18 months ago

unmapped reads have not been mapped on the REFerence genome

Duplicates can arise during sample preparation e.g. library construction using PCR. See also EstimateLibraryComplexity for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

0
Entering edit mode

0
Entering edit mode

Yes.

0
Entering edit mode

Thank you, really appreciate the prompt replies.