Removing a chromosome from a fasta file
0
3
Entering edit mode
6.1 years ago
t86dan ▴ 30

Hello,

I have a fasta file for human Genome GRCh37 Reference Assembly, but for some reason chromosome 20 is repeated and I would like to remove it. I know its repeated because when I use grep to look for '>' it shows the following which is the list of all the chromosomes (20 appears 2 times):

chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr20 chr21 chr22 chrX chrY chrM

Any help on removing the repeated chromosome 20 would be appreciated.

Thanks in advance!

fasta chromosome sequence • 2.8k views
ADD COMMENT
3
Entering edit mode

I would also double check that the sequence is the same.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode
awk 'BEGIN {write=1;} /^>/ { if ($1 in seen) {write=0; next;} else {seen[$1]=1; write=1;}} {if (write==1) print $0;}' your_fasta_file.fa
ADD REPLY
0
Entering edit mode

Is there a reason why this happens? dups sequences are understandable, but having one chromosome dups? are you working with cancer data?

ADD REPLY

Login before adding your answer.

Traffic: 1983 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6