Question: Calculate length of sequences from fasta file
1
gravatar for KVC_bioinfo
2.5 years ago by
KVC_bioinfo440
Boston
KVC_bioinfo440 wrote:

Hello all,

I want to calculate the length of sequences from a huge FASTA file. I have read names of which I need to calculate the length from FASTA file. I know how to do it all reads using the following command.

awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' file.fasta

However, I am not sure, how to do it for a specific number of reads. Could someone help me here?

fasta • 5.8k views
ADD COMMENTlink modified 22 months ago by Biostar ♦♦ 20 • written 2.5 years ago by KVC_bioinfo440

it's not clear what is a "read" in this context.

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum128k

Read: meaning sequence name

>read1

atgctatcat

>read2

atgctactagtcga

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by KVC_bioinfo440

However, I am not sure, how to do it for a specific number of reads

just use head ?

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum128k

The simple way is like ./seqkit fx2tab -l file.fasta > file_seq_with_length.txt

And further you can extract particular column using awk e.g. awk '{print $1, $3}' file_seq_with_length.txt > extrcted_length.txt Usually, your first column will contain Ids and the last column will contain sequence length.

Hoping it may help someone.... :)

ADD REPLYlink written 13 months ago by archana.bioinfo87180

OP wants % GC content in addition to length @ archana.bioinfo87. Probably following command with seqkit would work:

$ seqkit fx2tab -lg test.fa | grep -f ids.txt

test.fa being input fasta and ids.txt being text file with ids without ">".

ADD REPLYlink modified 13 months ago • written 13 months ago by cpad011213k
2
gravatar for Joe
2.5 years ago by
Joe16k
United Kingdom
Joe16k wrote:

I would just subset the sequences you need by name using any one of dozens of tools that have been posted on here before, in to a different file, and then just calculate all the lengths in the new file.

ADD COMMENTlink written 2.5 years ago by Joe16k
1

Thanks. It made my job easy

ADD REPLYlink written 2.5 years ago by KVC_bioinfo440

Be sure to select something as an answer if it helped solve your problem.

ADD REPLYlink written 2.5 years ago by Joe16k

sure. Thank you. will do that

ADD REPLYlink written 2.5 years ago by KVC_bioinfo440
3
gravatar for Istvan Albert
2.5 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Use bioawk, it treats a FASTA/FASTQ record a single awk record, that way you can count the sequences while computing their cumulative length

https://github.com/lh3/bioawk

then counting the cumulative size of the first ten sequences would be

cat reads.fq | bioawk -c fastx 'NR < 11 { size += length($seq) } END { print size }'
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Istvan Albert ♦♦ 84k
3
gravatar for cpad0112
2.5 years ago by
cpad011213k
India
cpad011213k wrote:

output:

$ seqtk subseq test.fa ids.txt | infoseq stdin -only -name -length -pgc
Display basic information about sequences
Name           Length %GC    
read2          14     42.86  
read3          6      50.00

Input : - huge fasta file: test.fa

$ cat test.fa 
>read1
atgctatcat
>read2
atgctactagtcga
>read3
atgctg
>read4
atgc

Sequences of interest (ids.txt)

$ cat ids.txt 
read3
read2

BTw, what do you mean by "specific number of reads" ? EMBOSS tools doesn't recognize n esp when calculating gc content. However it does fairly good job in counting characters.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by cpad011213k
2
gravatar for Kevin Blighe
23 months ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

Another solution (just thought to post it since I did it now):

awk -F "" '/^>/{ if(NR!=1) {print len}; printf $0"\t"; len=0}; \
  !/^>/{len+=NF}' \
  gencode.v28.transcripts.fa |\
  head -35

>ENST00000456328.2|ENSG00000223972.5|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|    1657
>ENST00000450305.2|ENSG00000223972.5|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|   632
>ENST00000488147.1|ENSG00000227232.5|...RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|    1351
>ENST00000619216.1|ENSG00000278267.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|  68
>ENST00000473358.1|ENSG00000243485.5|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| 712
>ENST00000469289.1|ENSG00000243485.5|...|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| 535
>ENST00000607096.1|ENSG00000284332.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| 138
>ENST00000417324.1|ENSG00000237613.2|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| 1187
>ENST00000461467.1|ENSG00000237613.2|...|RP11-34P13.4-002|FAM138A|590|lincRNA|  590
>ENST00000606857.1|ENSG00000268020.3|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|   840
>ENST00000642116.1|ENSG00000240361.2|...|RP11-34P13.6-002|OR4G11P|1414|processed_transcript|    1414
>ENST00000492842.2|ENSG00000240361.2|...|RP11-34P13.6-001|OR4G11P|939|transcribed_unprocessed_pseudogene|   939
>ENST00000641515.2|ENSG00000186092.6|...|RP11-34P13.5-001|OR4F5|2618|protein_coding|    2618
>ENST00000335137.4|ENSG00000186092.6|...|OR4F5-201|OR4F5|1054|protein_coding|   1054
>ENST00000466430.5|ENSG00000238009.6|...|RP11-34P13.7-001|RP11-34P13.7|2748|lincRNA|    2748
>ENST00000477740.5|ENSG00000238009.6|...|RP11-34P13.7-003|RP11-34P13.7|491|lincRNA| 491
>ENST00000471248.1|ENSG00000238009.6|...|RP11-34P13.7-002|RP11-34P13.7|629|lincRNA| 629
>ENST00000610542.1|ENSG00000238009.6|...|AL627309.1-205|RP11-34P13.7|723|lincRNA|   723
>ENST00000453576.2|ENSG00000238009.6|...|RP11-34P13.7-004|RP11-34P13.7|336|lincRNA| 336
>ENST00000495576.1|ENSG00000239945.1|...|RP11-34P13.8-001|RP11-34P13.8|1319|lincRNA|    1319
>ENST00000442987.3|ENSG00000233750.3|...|RP11-34P13.10-001|RP11-34P13.10|3812|processed_pseudogene| 3812
>ENST00000494149.2|ENSG00000268903.1|...|RP11-34P13.15-001|RP11-34P13.15|755|processed_pseudogene|  755
>ENST00000595919.1|ENSG00000269981.1|...|RP11-34P13.16-001|RP11-34P13.16|284|processed_pseudogene|  284
ADD COMMENTlink modified 13 months ago • written 23 months ago by Kevin Blighe59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1691 users visited in the last hour