Question: About The 10Th Column Of The Output Sequence From Samtools View Command
0
gravatar for Zhshqzyc
6.2 years ago by
Zhshqzyc420
Zhshqzyc420 wrote:

Hello, I used samtools to get the data for chromosome 22. The command is:

samtools view input.bam 'chr22'  >raw_chr22.txt

Then I extracted the 3,4 and 10th columns from it. The partial data for demo purpose like

chr22 14430092 NNNNCNAGCNGAGCGNNTCTGGGNACCTCGAAGGCAGACATG
chr22 14430092 NNNNNNNNNNNTNNNNNNNNANNGNNTNNNNNNNNNNNNNNN
chr22 14430092 CCTCGCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAA

The data has three columns, chromosome, position and sequence. My question is that they all have the same position but different sequence. So what is the real sequence in this specific position.

Thanks

sequence samtools • 2.6k views
ADD COMMENTlink written 6.2 years ago by Zhshqzyc420
1
gravatar for brentp
6.2 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

It might help if you post the output of all the columns. What aligner did you use. You can check the actual sequence within something like:

samtools faidx $REF chr22:14430092-14430142

replacing $REF with your reference fasta file. When I try this with hg19, it gives:

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

So maybe whatever aligner you're using is mapping to a region of N's? Again, it would be useful if you showed all the columns, then we could see the flag.

Update:

Based on the file you posted, those all have mapping quality 0, so you should disregard those alignments unless you have reason not to. Also, one of them has 1113 in the bit flag. If you go here, you can see that it's an optical duplicate with an unpaired mate--further reason not to consider it a valid alignment.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by brentp22k

The file(test_chr22.txt) is here: https://skydrive.live.com/?cid=03a8bb9eaeeff1db#!/?cid=03a8bb9eaeeff1db&sc=documents&uc=1&id=3A8BB9EAEEFF1DB%21234

Thanks.

ADD REPLYlink written 6.2 years ago by Zhshqzyc420

out of curiosity, if you down-vote, could you also let me know why. I'm happy to be enlightened.

ADD REPLYlink written 6.2 years ago by brentp22k

Sorry, it was a mistakenly hit. I supposed to hit up-vote. But my brain is not working well today. I will correct it. So can you get the real sequence within the region? Any language is fine.

ADD REPLYlink written 6.2 years ago by Zhshqzyc420

zhs... my answer already shows how to get the real sequence with samtools faidx.

ADD REPLYlink written 6.2 years ago by brentp22k

But it is from gh18/hg19 reference, my bam file is from a patient. Can I get the sequence from the patient based on my output file? Maybe it is a wrong question because my ignorance of genome area?

ADD REPLYlink written 6.2 years ago by Zhshqzyc420

In that case, ask the patient if they have the reference file that they used to do the alignment.

ADD REPLYlink written 6.2 years ago by brentp22k

I still don't understand it. I want to write C# code to extract the sequence by "FLAG" and start position from the output file. Suppose a region is given, can I use substring method in c#/java from 1113 14430092 CCTCGCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAA Such as string s = CCTCGCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAA; if FLAG ==some condition Then sequence = s.Substring(start,end);

ADD REPLYlink written 6.2 years ago by Zhshqzyc420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 818 users visited in the last hour