sequence in bam file looks weird
0
0
Entering edit mode
4.3 years ago

Hi all,

My major is data science and I know little about RNA-seq. However, I'm analyzing RNA-sequencing data for a lab and I found that the CER_read_count for a CTL sample is much lower than other samples. Then, I traced back to bam file, picked a gene whose CER_read_count is very low, extracted all records based on gene coordinate, and I found that most of the sequence of the second part in a pair seems very weird, and the number of records with flag 99, 147, 83 or163 is small, like 6.

SRR6761461.31668309 81  chr10   14878616    50  51M =   14877998    -669    AGAAAGCCAGGCTCCAATTATTAACACTTTAAGGACTGGGCGCTGTGGCTC GGGGGFGBGGGGGGGGGFGGCF1>F@FGGDEGGGFC<GGGEGF@GGCCCCC AS:i:0  XN:i:0  XM:i:0XO:i:0    XG:i:0  NM:i:0  MD:Z:51 YT:Z:UU NH:i:1
SRR6761461.1483110  337 chr10   14878664    0   5M1D46M chr15   84489107    0   CTCACCCTGTAATCCCAGCACTTTGGGTGGCTGAGGCGGGCCGATCACCTG :FE1BF:F@11GDGGGGFFBGF:0C0E0F@=1GFC/<GFE/DGGGFA0:AA AS:i:-11    XN:i:0XM:i:1    XO:i:1  XG:i:1  NM:i:2  MD:Z:5^G22A23   YT:Z:UU NH:i:8  CC:Z:=  CP:i:105635355

HI:i:1

And then I took a look the the fastq file, the sequence of second is also abnormal:

@SRR6761461.1 1 length=47
TGTGATTCTCATCCACAGGCACCAGAAGAAGGGGACTTTCTAAGCAC
+SRR6761461.1 1 length=47
<:<AGGGGGGGGGGGGEGGEFGGGGGDGGGEGDGGGGGGGGGGGGGF

Does that mean the data quality is very low and it might be unusable?

Thank you.

samtools sequence • 1.0k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

Got it. Will do next time. Thank you.

ADD REPLY
0
Entering edit mode

What is CER_read_count?

ADD REPLY
0
Entering edit mode

That's the output of IRTools, a tool to extract exon read and intron read. CER_read_count represents exon read count.

ADD REPLY
0
Entering edit mode

A quality string with lots of G's is high quality, not low quality. What does the region look like in IGV?

ADD REPLY
0
Entering edit mode

I'm still figuring out how to insert img here. Will upload IGV screen shot later.

Can you please also take a look at this record:

SRR6761461.11957243 161 chr10   14878790    50  51M =   14879032    293 GCGTGGTGGTGGTGCCTTTAGTACCAGCTACTTGGGAGGCTGAGGCAGGAG 33:0</>BDGGB/FG@:FB>@11=DGGG1CE1=1E0EF/=0=:FG0/=/EF AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:51 YT:Z:UU NH:i:1

The second read in this pair contains lots characters beside ACGT. Is this one low quality? Thank you.

ADD REPLY
2
Entering edit mode

How to add images to a Biostars post

The second read in this pair contains lots characters beside ACGT

That is not the second read. It is quality scores of read in that SAM record. Have you checked SAM file format (Page 6)?

ADD REPLY
0
Entering edit mode

Oh, I see. That's the problem. Thank you for your help.

ADD REPLY
0
Entering edit mode

That read and the one above with the flag of 81 are very close to being 'proper'. What aligner and command line did you use? Maybe your aligner is being too stringent about 'proper' pairs.

ADD REPLY
0
Entering edit mode

Hi,

This is how HSPA12 looks like in IGV after zoom in a little bit. you can see that reads in intron area are much more than that in exon area which should not happen. enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 2705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6