Hi all,
My major is data science and I know little about RNA-seq. However, I'm analyzing RNA-sequencing data for a lab and I found that the CER_read_count for a CTL sample is much lower than other samples. Then, I traced back to bam file, picked a gene whose CER_read_count is very low, extracted all records based on gene coordinate, and I found that most of the sequence of the second part in a pair seems very weird, and the number of records with flag 99, 147, 83 or163 is small, like 6.
SRR6761461.31668309 81 chr10 14878616 50 51M = 14877998 -669 AGAAAGCCAGGCTCCAATTATTAACACTTTAAGGACTGGGCGCTGTGGCTC GGGGGFGBGGGGGGGGGFGGCF1>F@FGGDEGGGFC<GGGEGF@GGCCCCC AS:i:0 XN:i:0 XM:i:0XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
SRR6761461.1483110 337 chr10 14878664 0 5M1D46M chr15 84489107 0 CTCACCCTGTAATCCCAGCACTTTGGGTGGCTGAGGCGGGCCGATCACCTG :FE1BF:F@11GDGGGGFFBGF:0C0E0F@=1GFC/<GFE/DGGGFA0:AA AS:i:-11 XN:i:0XM:i:1 XO:i:1 XG:i:1 NM:i:2 MD:Z:5^G22A23 YT:Z:UU NH:i:8 CC:Z:= CP:i:105635355
HI:i:1
And then I took a look the the fastq file, the sequence of second is also abnormal:
@SRR6761461.1 1 length=47
TGTGATTCTCATCCACAGGCACCAGAAGAAGGGGACTTTCTAAGCAC
+SRR6761461.1 1 length=47
<:<AGGGGGGGGGGGGEGGEFGGGGGDGGGEGDGGGGGGGGGGGGGF
Does that mean the data quality is very low and it might be unusable?
Thank you.
Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.Got it. Will do next time. Thank you.
What is
CER_read_count
?That's the output of IRTools, a tool to extract exon read and intron read. CER_read_count represents exon read count.
A quality string with lots of G's is high quality, not low quality. What does the region look like in IGV?
I'm still figuring out how to insert img here. Will upload IGV screen shot later.
Can you please also take a look at this record:
The second read in this pair contains lots characters beside ACGT. Is this one low quality? Thank you.
How to add images to a Biostars post
That is not the second read. It is quality scores of read in that SAM record. Have you checked SAM file format (Page 6)?
Oh, I see. That's the problem. Thank you for your help.
That read and the one above with the flag of 81 are very close to being 'proper'. What aligner and command line did you use? Maybe your aligner is being too stringent about 'proper' pairs.
Hi,
This is how HSPA12 looks like in IGV after zoom in a little bit. you can see that reads in intron area are much more than that in exon area which should not happen.