Question: why reads align to other coordinates?
0
gravatar for star
8 weeks ago by
star130
Netherlands
star130 wrote:

I have downloaded a specified region as Reference Genome from UCSC (because only this region is important for me) and aligned my reads against it using Bowtie2.

But I do not know why in my .bam file there are other regions with start position of 150864, 539750 and ... .

The Ref Genome (specified coordinate) that downloaded from UCSC.

hg19_dna range=chrX:146949598-147161084 5'pad=0 3'pad=0 strand=+ repeatMasking=none

2 first line of bam file (output of bowtie2)

D00823:89:HMHHMBCX2:1:2111:4641:60710   99  hg19_dna    150864  40  48M =   150946  130 CCTGCCTCAGCCTCCTGAGTAGCTGGGACTATAGGTGCCCGCCACCAT    DDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:31C15C0    YS:i:-5 YT:Z:CP
D00823:89:HMHHMBCX2:2:1116:19382:32083  99  hg19_dna    539750  42  48M =   540015  313 CCTATGGTAGCTCTGGCCAGCGGAACTTCCTGCATCCATGGAGCAATG    DDIIHIHIIHIIIIFEHIIIIIIIHIIIIIIHHHIIIHIIIIGIIII?    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
ADD COMMENTlink written 8 weeks ago by star130
2

You've downloaded the chrX reference from position 146949598 to 147161084. This is your reference sequence. Bowtie2 does not know that you cropped the reference sequence, the reference start at 1 for the software. If you want the position on the chrX you have to take the full chrX sequence or to add 146949598 to every hit

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Bastien Hervé3.7k

Thanks @ Bastien for your reply.

you mean the first position in my data is 146949598 + 150864 = 147100462

ADD REPLYlink written 8 weeks ago by star130
2
D00823:89:HMHHMBCX2:1:2111:4641:60710   99  hg19_dna    147100462  40  48M =   147100544  130 CCTGCCTCAGCCTCCTGAGTAGCTGGGACTATAGGTGCCCGCCACCAT    DDHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII    AS:i:-12    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:31C15C0    YS:i:-5 YT:Z:CP
D00823:89:HMHHMBCX2:2:1116:19382:32083  99  hg19_dna    147489348  42  48M =   147489613  313 CCTATGGTAGCTCTGGCCAGCGGAACTTCCTGCATCCATGGAGCAATG    DDIIHIHIIHIIIIFEHIIIIIIIHIIIIIIHHHIIIHIIIIGIIII?    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Bastien Hervé3.7k

but why my last start position when I add 146949598 to it is bigger than my Ref Genome range?

my last start position is 2390138 + 146949598 = 149339736

D00823:89:HMHHMBCX2:1:1201:1895:19391   147 hg19_dna    2390138 42  48M =   2389844 -342    CTGTTTTGTCGACACTGAAAACCTGTGTATGAAGTCACCGTCATCAAA    HIIIIHIIIIIIGIHIIIGIIIIIGIIIIIIIIIIIHIIIHIIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
ADD REPLYlink written 8 weeks ago by star130

Could you please give me the length of your reference sequence ?

ADD REPLYlink written 8 weeks ago by Bastien Hervé3.7k

it is 4231 lines. (if I did right: cat ref.fa | wc -l)

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by star130
1

I was expecting the number of nucleotide in your reference sequence... Anyway, let's assume you have a fasta format with 60 character per line, minus the header 4230 * 60 = 253 800 bases

You've downloaded a sequence that is 211 486 bases long

How did you end up with a hit around 2 000 000 ??

Could you please try to download the full chrX and restart your alignment, this will avoid the misunderstanding

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Bastien Hervé3.7k

Thanks. I have done it and the head and tail of mt output are as below:

Head:

D00823:89:HMHHMBCX2:1:2114:9942:3900    147 X   2479520 42  48M =   2479476 -92 TGACAGGTGTGAGCCACCATGACTGGCCTGCAATGACACATTTGTATT    IIIIIIIIIIHIIIIIIIIIIIIIIIIIFFGIIIIIHIIGIIIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:-6 YT:Z:CP
D00823:89:HMHHMBCX2:2:2111:10601:89000  163 X   3042754 42  48M =   3042949 243 TTTCTTTCCTTAACCTTAGGGTCCGTTTTAGTTGCTAAAGGGGCATTT    DDIIIIIIIIIIIIIIIHIIIFH1<FHIHHIIIHIIIIIIIIHHHHHI    AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23T24  YS:i:0  YT:Z:CP

tail:

D00823:89:HMHHMBCX2:2:2105:10080:21276  147 X   155118974   42  48M =   155118837   -185    CTGCTCTTAAGTCTGTTGTTAAGTTTCATGTTATAGTGCCTCGTTAGA    IIIIIIIIIIIHHIIIIIIIHIIIIIHIIIIIIIIIHIHHHHIIIIDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP
D00823:89:HMHHMBCX2:1:2104:1703:56241   99  X   155125479   42  48M =   155125618   187 CCACATAGCAGGAGGTGAGTGGCCTATGAGCATTACCACCTGAGTTCT    CDHHIIIIIGHIIIIHIIIIIGIIGIIIIIIIHHHIGIIHHIGHIIIH    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:0  YT:Z:CP

Now, would you please let me know how can I crop my specific regions?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by star130
1
samtools view input.bam "ChrX:3000000-3500000" > output.bam
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by Bastien Hervé3.7k

Thanks.

i run it but I got errore :

samtools view uniqu_fmr.bam "ChrX:146949598-147161084" > output.bam
[main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

Then I run it: samtools index uniqu_fmr.bam

but I got this error:

samtools index: "uniqu_fmr.bam" is in a format that cannot be usefully indexed
ADD REPLYlink written 8 weeks ago by star130
4
gravatar for Bastien Hervé
8 weeks ago by
Bastien Hervé3.7k
Limoges, CBRS, France
Bastien Hervé3.7k wrote:

If you have a sam file

samtools sort uniqu_fmr.sam > uniqu_fmr.sorted.bam
samtools index uniqu_fmr.sorted.bam
samtools view uniqu_fmr.sorted.bam "ChrX:146949598-147161084" > uniqu_fmr_ChrX_146949598_147161084.bam

If you have a bam file change the first line to :

samtools sort -o uniqu_fmr.sorted.bam uniqu_fmr.bam
ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Bastien Hervé3.7k

Thanks @ Bastein, it works!

ADD REPLYlink written 8 weeks ago by star130

Please follow-up on your threads. If an answer was helpful, upvote it, if it solved the issue, mark it as accepted.

enter image description here

ADD REPLYlink written 7 weeks ago by ATpoint14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1219 users visited in the last hour