Question

chrY.fa parsing problem

0

Entering edit mode

7.4 years ago

michael • 0

The .fai is straightforward...

chrY 57772954 6 50 51

And the header is straightforward...

chrY ctaaccctaaccctaaccctaaccctaaccctaaccctCTGaaagtggac

Do I have it right that the first value listed is position 1 and that all values proceed sequentially from there?

According to ybrowse.org, chrY:1..1 is Y-SNP CTS3047, T>C. If that's correct, why does this sequence start with C?

I've searched the .fa for long, unique strings, found the start position, then calculated an offset. But it's good only just on either side of the string. When I search for something else using the same offset, the result is not correct. I've tried this 3 or 4 times.

What am I missing?

BTW, I'm positive I posted this earlier but I don't see it anywhere. My apologies if this turns out to be a dupe.

-Michael

SNP parsing fasta • 1.6k views

ADD COMMENT • link 7.4 years ago by michael • 0

0

Entering edit mode

Well, we don't really agree on numbering, so dependent on which source you use the first nucleotide might be the 0 or 1. Welcome to bioinformatics.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks for your response, Wouter.

However, there is no position 0 on the Y chromosome. It's true that the next one is a T, but I've had to use offsets in the thousands to properly locate strings.

For example, here's a clean line from a SAM, no insertions, deletions, mismatches, softclips, etc. It starts at position 2649803 AAAAAAAAAAAAAAAAAAAAAAAAGTTAATCATTTATGTCGTTACATAAATGGGCACTTACACATAGACGTATAGCTCAGAAGGTATATAAGCTCTAT AAAACTTTATC

Remember, I've eliminated the header in chrY.fa and pulled the entire sequence into one line. Position and byte should line up. But if I go to 2649803 I get a T, not A. If I take into account the Y2K problem, as I call it, position 2649804 does give me an A but 2649805 gives me a C. Conversely, 2649802 returns a C. So, clearly, I'm not in the range.

Now, if I search for the sting in chrY.fa using a simple text editor, I find that it starts at byte 2709803. That's an offset of 60000 ! Then I subtract 1 for the Y2K issue and I find it. But if I look for position 1, I get N, not a C and all other searches fail.

That's at least the 4th time I've done this. With each large string I try, I end up with a different offset and most other searches, but that one, fails.

Any help would be appreciated.

Thanks, Michael

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

I was a little too brief earlier, so I'll try to suggest some causes of your discrepancy.

I don't believe that we know the sequences at the telomere and centromere of the Y chromosome. More exactly, we don't know those from any chromosome because those are variable and very repetitive. So chrY:1 is definitely a 'N' nucleotide. You can confirm this for yourself using the UCSC genome browser. Also have a look at the centromere: long stretche of N without unknown exact length and definitely unknown sequence.

You can also use the blat tool from UCSC to check where your sequence can be found. For example, AAAAAAAAAAAAAAAAAAAAAAAAGTTAATCATTTATGTCGTTACATAAATGGGCACTTACACATAGACGTATAGCTCAGAAGGTATATAAGCTCTAT maps on the Y chromosome from 2649803-2649900 (100% exact match). If you go to the genome browser, you'll indeed see the string of A's where you expect them.

I'm not sure where you got that chrY.fa, but something seems off there.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I'm pretty sure I got it off UCSC, but I've downloaded a new one. It's different by a long shot, but I'm still having problems. In looking at positions 2649800-2649808 (both sides of the target) I'm getting the sequence aatccactt. If I search my revised .fa I find this same string at byte 2781762. So I've determined an offset of 131959-1, and that works -- to a point. I have good results using random selections up to 8000012, but it's failing by 9000004. Also, the last position in my test SAM is 59033981, but the .fa goes only to 57227415. So, I'm still missing an important component.

The time stamp on chrY.fa.gz is 23-Jan-2014.

I guess I'm wrong in thinking that things should be this unclear. :-)

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

Yeah, by the time I get to 9000004 the offset has increased from 131959 to 162391. :(

-Michael

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

I can't find aatccactt in the genome browser at chrY:2649800-2649808 for both hg38 or hg19. I'm not sure which genome build you are looking at (but that makes quite a difference!). If it's 23/1/2014, probably it's hg38 (which is the latest, so that's good). [In addition, that string is quite short so you can find it at many many spots in the genome.]

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I worked on this all day yesterday and made a lot of progress. However, as is almost always the case, the easiest answer is generally the best. FTDNA uses build 37.3 ! I'll look today for the earlier chrY.fa version.

I'm coming at this bass-ackwards, starting with genetic genealogy and working back. I would have started with the existing tools, but haven't been able to find good tutorials, and least those that don't assume the user has a background in the field. But, as I said, I love parsing things anyway. Still, I need to take a few good classes. But not now. At 66 I'm working on my MA in English (emphasis in creative nonfiction). Still, I have plenty of time ahead of me!

Thanks for sticking with me on this! But I'll likely be back. :)

-Michael

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

I'm not sure what you are trying to achieve, but good luck either way :-)

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I just wanted to write a perl subroutine that will extract the value at a given position. As those who like the word would say, "That's trivial." Well it would be if one had the correct files! With the routine, I can batch calls and easily verify results. Again, thanks for sticking with me.

I tried pasting the code in but it doesn't display properly,

-Michael

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

I see I didn't reply properly to this. The problem isn't so much in determining where on the chromosome a strand is located--the SAM file tells me that--but to verify a potential SNP. The BAMs / SAMs I've looked at a loaded with false mismatches, denoted by an X.

I know that there are tools out there that will do that, but I've successfully written several scripts that parse SAMs and successfully extracts all known SNPs. I'd now like to take the resultant data set and compare it to the hg19 Y for new SNP discovery.

Thanks, Michael

ADD REPLY • link 7.4 years ago by michael • 0

0

Entering edit mode

And now I read you are looking at hg19? Sure you aren't mixing up genome builds?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k