Question: How to split a multi fasta file into individual chromosomes
1
gravatar for sunnykevin97
8 days ago by
sunnykevin9710
sunnykevin9710 wrote:

Hello, I had a multi line fasta file with contigs for each chromosome. How to split the fasta file in to individual chromosomes ?

here it looks ---

  head test.fa
>chr1:3054106-3054192
ATTCAGCTGAGAAAAATGCAGACCATAAAAAAGCAGCAGGCACTGCTCGAAACTAGTAGCAATGTAGACGAGATGATAGTACTTAA
>chr1:3054204-3054296
TGAAGTATCTGAAGACTTGACGACATGGGAAGACCCACTACTAAATGAAGGAAGTCTGAGGAAAAACAAATCTTCAGCACGCTGGAGAAAGT
>chr1:3054311-3054381
ATGAAAACATAGATGCTGTGTCCTGGGAAAAAAGGAGGAAAATAACAAAGCTGCCAAAATATCTCGTGAG
>chr1:3054409-3054509
AGAACTAACTGTACTGAGAAAAGAAAATTCCGCTTTAAAAGCTGAACTGTTTTTGCTAAAATTAAAGTTTGGTTTAATTAGCTCCATAGCATATGCTCAG

>chrX:119973608-119973699
AAGCATCACTCCGAGAATAAGGGCCTAGACAAAGTGGTGGAGACTCAAGCCCAAGTGGATGAACTGAAAGGAATCATGGTCAGAAACATAG
>chrX:119980524-119980592
ATCTTGTAGCTCAGCGTGGAGAAAGATTGGAATTATTGATAGACAAAACAGAAAACCTCGTGGATTCA
>chrX:119989360-119989453
TCTGTCACCTTCAAAACCACCAGCAGAAACCTTGCCCGAGCCATGTGTATGAAGAATCTCAAGCTCACTATTATCATCATCATTGTATCAATT
>chrX:119993756-119993825
GTGTTCATCTATATCATTGTCTCTCCTGCCTGTGGTGGATTTACATGGCCAAGCTGTATAAAGAAATGA
sequence alignment • 148 views
ADD COMMENTlink modified 8 days ago by finswimmer8.9k • written 8 days ago by sunnykevin9710

Split the multiple sequences file into a separate files and plenty of other solutions.

I would personally recommend faSplit from Jim Kent's (UCSC) utils. Unlikely that there is a more performant solution.

Edit: On a closer read you want to split into chunks at chromosome level. Take a look at thread above to see if you can modify one of the answers. We will look into this some more.

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax60k

What have you tried? This is something that could be done in biopython.

ADD REPLYlink written 8 days ago by WouterDeCoster35k

I don't have much programming knowledge, quite new to NGS data analysis.

I tried it

./faSplit byname Arf.fasta /home/

It splits each contig in to a separate file. (I don't want this) can It group all sequences starting with chr1, chr2... in to separate individual chromosome files ??

suggestions please!

ADD REPLYlink modified 8 days ago • written 8 days ago by sunnykevin9710
3

I don't have much programming knowledge

None of us was born with programming skills, this would be a fine practice exercise. The first script you write is going to take a long time and will be painful. It will be easier for every script you write afterward.

ADD REPLYlink written 8 days ago by WouterDeCoster35k

Please take a look at this code I wrote, it is not that far from what you want to achieve. Try to catch how it works and change some lines

A: fastq parsing inefficiencies with python

ADD REPLYlink written 8 days ago by Bastien Hervé2.8k

sunnykevin97 : You can cheat a bit.

faSplit byname test.fa ./

Then cat the files per chromosome together. cat chr1:*.fa > chr1_final.fa through cat chrMT:*.fa > chrMT_final.fa

ADD REPLYlink modified 8 days ago • written 8 days ago by genomax60k

Exactly I did the same this.

ADD REPLYlink written 8 days ago by sunnykevin9710

As a general hint, related to Genomax's other comment. If you're already achieved splitting all of the sequences in to individual files, you can put them back together again using cat

For instance if you have:

$ ls chr*
chr1:3054106-3054192.fasta  chr1:3054311-3054381.fasta  chrX:119973608-119973699.fasta  chrX:119989360-119989453.fasta
chr1:3054204-3054296.fasta  chr1:3054409-3054509.fasta  chrX:119980524-119980592.fasta  chrX:119993756-119993825.fasta

You can just do cat chr1:*.fasta > chr1.fasta, cat chrX:*.fasta > chrX.fasta and so on. If you have lots to do, you can work out how to incorporate this in to a loop, or parallel process.

ADD REPLYlink written 8 days ago by jrj.healey9.8k
5
gravatar for jrj.healey
8 days ago by
jrj.healey9.8k
United Kingdom
jrj.healey9.8k wrote:

Tweaking my answer from Genomax's linked thread:

#!/bin/bash

while read line ; do
  if [ ${line:0:1} == ">" ] ; then
    filename=$(echo "$line" | cut -d ":" -f1 | tr -d ">")
    touch ./"$filename".fasta
    echo "$line" >> ./"${filename}".fasta
  else
    echo "$line" >> ./"${filename}".fasta
  fi
done < $1

Input:

$ cat test.fa

>chr1:3054106-3054192
ATTCAGCTGAGAAAAATGCAGACCATAAAAAAGCAGCAGGCACTGCTCGAAACTAGTAGCAATGTAGACGAGATGATAGTACTTAA
>chr1:3054204-3054296
TGAAGTATCTGAAGACTTGACGACATGGGAAGACCCACTACTAAATGAAGGAAGTCTGAGGAAAAACAAATCTTCAGCACGCTGGAGAAAGT
>chr1:3054311-3054381
ATGAAAACATAGATGCTGTGTCCTGGGAAAAAAGGAGGAAAATAACAAAGCTGCCAAAATATCTCGTGAG
>chr1:3054409-3054509
AGAACTAACTGTACTGAGAAAAGAAAATTCCGCTTTAAAAGCTGAACTGTTTTTGCTAAAATTAAAGTTTGGTTTAATTAGCTCCATAGCATATGCTCAG
>chrX:119973608-119973699
AAGCATCACTCCGAGAATAAGGGCCTAGACAAAGTGGTGGAGACTCAAGCCCAAGTGGATGAACTGAAAGGAATCATGGTCAGAAACATAG
>chrX:119980524-119980592
ATCTTGTAGCTCAGCGTGGAGAAAGATTGGAATTATTGATAGACAAAACAGAAAACCTCGTGGATTCA
>chrX:119989360-119989453
TCTGTCACCTTCAAAACCACCAGCAGAAACCTTGCCCGAGCCATGTGTATGAAGAATCTCAAGCTCACTATTATCATCATCATTGTATCAATT
>chrX:119993756-119993825
GTGTTCATCTATATCATTGTCTCTCCTGCCTGTGGTGGATTTACATGGCCAAGCTGTATAAAGAAATGA

This creates a new file in the current dir for every sequence which has >chr# at the start:

$ bash splitfasta.sh test.fa

$ ls chr*

chr1.fasta  chrX.fasta

Then,

$ cat chr1.fasta yields:

>chr1:3054106-3054192
ATTCAGCTGAGAAAAATGCAGACCATAAAAAAGCAGCAGGCACTGCTCGAAACTAGTAGCAATGTAGACGAGATGATAGTACTTAA
>chr1:3054204-3054296
TGAAGTATCTGAAGACTTGACGACATGGGAAGACCCACTACTAAATGAAGGAAGTCTGAGGAAAAACAAATCTTCAGCACGCTGGAGAAAGT
>chr1:3054311-3054381
ATGAAAACATAGATGCTGTGTCCTGGGAAAAAAGGAGGAAAATAACAAAGCTGCCAAAATATCTCGTGAG
>chr1:3054409-3054509
AGAACTAACTGTACTGAGAAAAGAAAATTCCGCTTTAAAAGCTGAACTGTTTTTGCTAAAATTAAAGTTTGGTTTAATTAGCTCCATAGCATATGCTCAG

and $ cat chrX.fasta yields:

>chrX:119973608-119973699
AAGCATCACTCCGAGAATAAGGGCCTAGACAAAGTGGTGGAGACTCAAGCCCAAGTGGATGAACTGAAAGGAATCATGGTCAGAAACATAG
>chrX:119980524-119980592
ATCTTGTAGCTCAGCGTGGAGAAAGATTGGAATTATTGATAGACAAAACAGAAAACCTCGTGGATTCA
>chrX:119989360-119989453
TCTGTCACCTTCAAAACCACCAGCAGAAACCTTGCCCGAGCCATGTGTATGAAGAATCTCAAGCTCACTATTATCATCATCATTGTATCAATT
>chrX:119993756-119993825
GTGTTCATCTATATCATTGTCTCTCCTGCCTGTGGTGGATTTACATGGCCAAGCTGTATAAAGAAATGA

Disclaimer:

This approach requires linearised sequences (it looks like you already have this, so no need to worry, but incase people in future try to use this code). Linearisation can be achieved with the following code:

# Usage: bash scriptname.sh wrapped.fasta > linear.fasta
while read line; do
    if [ "${line:0:1}" == ">" ]; then
        echo -e "\n"$line;
    else
        echo $line | tr -d '\n';
    fi;
done < $1 | tail -n+2

Or as a one-liner (replace infile.fa and outfile.fa to suit)

cat infile.fasta | while read line ; do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line ; else  echo $line | tr -d '\n' ; fi ; done | tail -n+2 > outfile.fa
ADD COMMENTlink modified 8 days ago • written 8 days ago by jrj.healey9.8k
0
gravatar for finswimmer
8 days ago by
finswimmer8.9k
Germany
finswimmer8.9k wrote:

With awk:

$ awk '$0 ~ "^>" { match($1, /^>([^:]+)/, id); filename=id[1]} {print >> filename".fa"}' input.fa

fin swimmer

ADD COMMENTlink written 8 days ago by finswimmer8.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1318 users visited in the last hour