How to get length of selected contigs within a fasta file generated by transcriptome assembly?
5
3
Entering edit mode
8.8 years ago
seta ★ 1.9k

Dear all,

I would like to have length of some selected contigs within a fasta file resulted from de novo transcriptome assembly. I have names of selected contigs in the txt file like here, could you please help me out to get it?

>contig1
>contig2
>contig3

Thanks

Assembly alignment RNA-Seq sequencing • 14k views
ADD COMMENT
0
Entering edit mode

One of the easiest ways for this is to use samtools faidx contigs.fasta. The .fai file has the necessary information

If you need to extract prior then, remove the ">" and -

xargs samtools faidx ref.fasta < list >contigs.fasta
ADD REPLY
5
Entering edit mode
8.8 years ago
Sej Modha 5.3k

Calculate length of each contig using following script and save the output of this script to an output file.

#!/usr/bin/perl -w

use strict;
#contig_length_fasta.pl

my $largest = 0;
my $contig = '';

if (@ARGV != 1) {
    print "contig_length_fastq fasta\n\n";
    exit;
}

my $filenameA = $ARGV[0];

open (IN, "$filenameA") or die "oops!\n";

my $read_name = '';
my $read_seq = '';

while (<IN>) {
    if (/^>(\S+)/) {
        $read_name = $1;
        $read_seq = "";

        while (<IN>) {
            if (/^>(\S+)/) {
                print "$read_name\t" . length($read_seq) . "\n";
                $read_name = $1;
                $read_seq = "";
            }
            else {
                chomp;
                $read_seq .= $_;
            }
        }
    }
}

close(IN);

print "$read_name\t" . length($read_seq) . "\n";
contig_length_fasta.pl input_file.fa > contig_length_out

Use simple grep to get length of the contigs of interest.

grep -f my_contig_list contig_length_out
ADD COMMENT
3
Entering edit mode
8.8 years ago

Use pyfaidx:

$ (sudo) pip install pyfaidx
$ xargs faidx --transform chromsizes contigs.fasta < contigs.txt
ADD COMMENT
2
Entering edit mode
8.8 years ago
Lesley Sitter ▴ 600

Hi,

I just adapted a short python script I have for these types of things. Is this something you want? It produces a file with the contigs, a tab and then the length of the sequences.

It takes two user arguments, the first is the fasta file, the second the name of a output file.

So just start it like this on a linux command line (just store the script as script.py or something)

$ python script.py fasta.fa seq_length.txt
#!usr/bin/env python

import sys 

file_open = open(sys.argv[1],'r').readlines() 
file_out = open(sys.argv[2],'w') 

switch = 0 
for lines in file_open:
    if '>' in lines:
        if switch == 0:
            file_out.write(lines.strip() + "\t")
            seq_length = 0
            switch = 1
        elif switch == 1:
            file_out.write(str(seq_length) + "\n" + lines.strip() + "\t")
            seq_length = 0
    else:
        seq_length+=len(lines.strip())
file_out.write(str(seq_length))
ADD COMMENT
0
Entering edit mode

Many thanks for all your comments. I try them

ADD REPLY
0
Entering edit mode
8.8 years ago

Have you looked at this?

ADD COMMENT
0
Entering edit mode
8.8 years ago
5heikki 11k

If there are no linebreaks in sequences then, e.g.

awk '/^>contig1/{getline; print length($0)}' file.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6