Question: How to generate a new FASTA from an assembly-assembly mapping ?
0
gravatar for maxime.policarpo
7 months ago by
France, Paris
maxime.policarpo20 wrote:

Hi everyone,

As you may know, the NCBI offer a "assembly-assembly remapping service" which basically map contigs of one genome (let's call it Genome 1) to the assembly of another genome (lets call it GenomeRef) My purpose is to generate a new file (FASTA) which consists on all the sequences of my mapped contigs (in the right order depending on the mapping) and replace the GenomeRef positions which have not been mapped by N .

I downloaded the GFF file provided on the NCBI website :

Example of a line :

Contig1 RefSeq match 49000 49020 . + . ID=xxxxxxxxxxxx;Target=GenomeRefContig5 6467734 6467754 +;best_on_query=1;best_on_query_same_unit=1;best_on_subject=1;best_on_subject_same_unit=1;gap_count=0;genomic_to_genomic=1;num_ident=21;num_mismatch=0;pct_coverage=0.000282069;pct_coverage_hiqual=0.000282069;pct_ident_quantized=98;pct_identity_gap=100;pct_identity_gapopen_only=100;pct_identity_ungap=100;reciprocity=3;same_unit_reciprocity=3

The problems i am facing while doing it are :

  • The GFF file is not sorted depending on the GenomeRef sequences but on the Genome 1 contigs

  • I think i would be able to use samtools faidx and a for loop to retrieve all my contigs sequences from the Genome 1 fasta file but how to replace gaps between those contigs by N depending on the gap length of the GenomeRef ?

[There will be a lot of gaps since the Genome 1 is a very poorly sequenced genome while the GenomeRef is a greatly sequenced genome (PacBio)]

I hope it is clear,

Thanks a lot !

ADD COMMENTlink modified 6 months ago by tdmurphy110 • written 7 months ago by maxime.policarpo20

I have already been able to do ->

Remove the extra informations i don't want :

cut -f1,4,5,9 -d$'\t'  file.gff | sed 's/ID.*Target=//g' | sed 's/-.*\|+.*//g'  > simplified.gff

Sort the simplified file depending on the Genome ref scaffold and remove headers :

sort -k4 simplified_gff > final.gff

sed 's/#!gff//g' final_gff.gff | sed 's/##gff//g' | sed 's/#!processor NCBI annotwriter//g' >  MyGFF.gff

sed -i '/^$/d'  MyGFF.gff

Then i seperated every informations in separated files :

while read line ; do echo "$line" | cut -f1 ; done < MyGFF.gff >> line1                       #Name of Genome 1 contig

while read line ; do echo "$line" | cut -f2 ; done < MyGFF.gff >> line2                       #Start pos of genome 1 contig

while read line ; do echo "$line" | cut -f3 ; done < MyGFF.gff >> line3                       #End pos of genome 1 contig

while read line ; do echo "$line" | cut -f4 ; done < MyGFF.gff >> ligneINTER

while read line ; do echo "$line" | cut -f1 -d " " ; done < ligneINTER >> line4               #Name of GenomeRef scaffold

while read line ; do echo "$line" | cut -f2 -d " " ; done < ligneINTER >> line5               #Start pos of GenomeRef scaffold

while read line ; do echo "$line" | cut -f3 -d " " ; done < ligneINTER >> line6               #End pos of GenomeRef scaffold

Then for each line, i can make a samtools to catch the sequence of my Genome1 and add it to a file that have the GenomeRef scaffold as name :

####wc -l file.gff = X
####END = X

for i in $(seq 1 $END) ; do samtools faidx Genome1.fasta `sed "${i}q;d" line1`:`sed "${i}q;d" line2`-`sed "${i}q;d" line3` >> `sed "${i}q;d" line4`.fasta ;done

But i really don't see how i can add the N's ....

ADD REPLYlink modified 7 months ago by genomax55k • written 7 months ago by maxime.policarpo20
1
gravatar for tdmurphy
6 months ago by
tdmurphy110
tdmurphy110 wrote:

On the NCBI Remap FTP site, the alignment files for each assembly pair are provided twice (assm1/assm2 and assm2/assm1). The alignments themselves are flip-flopped for query and target, which would take care of most of your sorting problem.

ADD COMMENTlink written 6 months ago by tdmurphy110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 565 users visited in the last hour