Question: Calculate length of sequences from fasta file
1
gravatar for KVC_bioinfo
18 months ago by
KVC_bioinfo380
Boston
KVC_bioinfo380 wrote:

Hello all,

I want to calculate the length of sequences from a huge FASTA file. I have read names of which I need to calculate the length from FASTA file. I know how to do it all reads using the following command.

awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}' file.fasta

However, I am not sure, how to do it for a specific number of reads. Could someone help me here?

fasta • 2.7k views
ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 18 months ago by KVC_bioinfo380

it's not clear what is a "read" in this context.

ADD REPLYlink written 18 months ago by Pierre Lindenbaum120k

Read: meaning sequence name

>read1

atgctatcat

>read2

atgctactagtcga

ADD REPLYlink modified 18 months ago • written 18 months ago by KVC_bioinfo380

However, I am not sure, how to do it for a specific number of reads

just use head ?

ADD REPLYlink written 18 months ago by Pierre Lindenbaum120k

The simple way is like ./seqkit fx2tab -l file.fasta > file_seq_with_length.txt

And further you can extract particular column using awk e.g. awk '{print $1, $3}' file_seq_with_length.txt > extrcted_length.txt Usually, your first column will contain Ids and the last column will contain sequence length.

Hoping it may help someone.... :)

ADD REPLYlink written 5 weeks ago by archana.bioinfo87120

OP wants % GC content in addition to length @ archana.bioinfo87. Probably following command with seqkit would work:

$ seqkit fx2tab -lg test.fa | grep -f ids.txt

test.fa being input fasta and ids.txt being text file with ids without ">".

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by cpad011211k
2
gravatar for jrj.healey
18 months ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

I would just subset the sequences you need by name using any one of dozens of tools that have been posted on here before, in to a different file, and then just calculate all the lengths in the new file.

ADD COMMENTlink written 18 months ago by jrj.healey12k
1

Thanks. It made my job easy

ADD REPLYlink written 18 months ago by KVC_bioinfo380

Be sure to select something as an answer if it helped solve your problem.

ADD REPLYlink written 18 months ago by jrj.healey12k

sure. Thank you. will do that

ADD REPLYlink written 18 months ago by KVC_bioinfo380
3
gravatar for Istvan Albert
18 months ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Use bioawk, it treats a FASTA/FASTQ record a single awk record, that way you can count the sequences while computing their cumulative length

https://github.com/lh3/bioawk

then counting the cumulative size of the first ten sequences would be

cat reads.fq | bioawk -c fastx 'NR < 11 { size += length($seq) } END { print size }'
ADD COMMENTlink modified 18 months ago • written 18 months ago by Istvan Albert ♦♦ 80k
3
gravatar for cpad0112
18 months ago by
cpad011211k
India
cpad011211k wrote:

output:

$ seqtk subseq test.fa ids.txt | infoseq stdin -only -name -length -pgc
Display basic information about sequences
Name           Length %GC    
read2          14     42.86  
read3          6      50.00

Input : - huge fasta file: test.fa

$ cat test.fa 
>read1
atgctatcat
>read2
atgctactagtcga
>read3
atgctg
>read4
atgc

Sequences of interest (ids.txt)

$ cat ids.txt 
read3
read2

BTw, what do you mean by "specific number of reads" ? EMBOSS tools doesn't recognize n esp when calculating gc content. However it does fairly good job in counting characters.

ADD COMMENTlink modified 18 months ago • written 18 months ago by cpad011211k
2
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe42k
Kevin Blighe42k wrote:

Another solution (just thought to post it since I did it now):

awk -F "" '/^>/{ if(NR!=1) {print len}; printf $0"\t"; len=0}; \
  !/^>/{len+=NF}' \
  gencode.v28.transcripts.fa |\
  head -35

>ENST00000456328.2|ENSG00000223972.5|...|RP11-34P13.1-002|DDX11L1|1657|processed_transcript|    1657
>ENST00000450305.2|ENSG00000223972.5|...|RP11-34P13.1-001|DDX11L1|632|transcribed_unprocessed_pseudogene|   632
>ENST00000488147.1|ENSG00000227232.5|...RP11-34P13.2-001|WASH7P|1351|unprocessed_pseudogene|    1351
>ENST00000619216.1|ENSG00000278267.1|...|MIR6859-1-201|MIR6859-1|68|miRNA|  68
>ENST00000473358.1|ENSG00000243485.5|...|RP11-34P13.3-001|RP11-34P13.3|712|lincRNA| 712
>ENST00000469289.1|ENSG00000243485.5|...|RP11-34P13.3-002|RP11-34P13.3|535|lincRNA| 535
>ENST00000607096.1|ENSG00000284332.1|...|MIR1302-2-201|MIR1302-2|138|miRNA| 138
>ENST00000417324.1|ENSG00000237613.2|...|RP11-34P13.4-001|FAM138A|1187|lincRNA| 1187
>ENST00000461467.1|ENSG00000237613.2|...|RP11-34P13.4-002|FAM138A|590|lincRNA|  590
>ENST00000606857.1|ENSG00000268020.3|...|RP11-34P13.17-001|OR4G4P|840|unprocessed_pseudogene|   840
>ENST00000642116.1|ENSG00000240361.2|...|RP11-34P13.6-002|OR4G11P|1414|processed_transcript|    1414
>ENST00000492842.2|ENSG00000240361.2|...|RP11-34P13.6-001|OR4G11P|939|transcribed_unprocessed_pseudogene|   939
>ENST00000641515.2|ENSG00000186092.6|...|RP11-34P13.5-001|OR4F5|2618|protein_coding|    2618
>ENST00000335137.4|ENSG00000186092.6|...|OR4F5-201|OR4F5|1054|protein_coding|   1054
>ENST00000466430.5|ENSG00000238009.6|...|RP11-34P13.7-001|RP11-34P13.7|2748|lincRNA|    2748
>ENST00000477740.5|ENSG00000238009.6|...|RP11-34P13.7-003|RP11-34P13.7|491|lincRNA| 491
>ENST00000471248.1|ENSG00000238009.6|...|RP11-34P13.7-002|RP11-34P13.7|629|lincRNA| 629
>ENST00000610542.1|ENSG00000238009.6|...|AL627309.1-205|RP11-34P13.7|723|lincRNA|   723
>ENST00000453576.2|ENSG00000238009.6|...|RP11-34P13.7-004|RP11-34P13.7|336|lincRNA| 336
>ENST00000495576.1|ENSG00000239945.1|...|RP11-34P13.8-001|RP11-34P13.8|1319|lincRNA|    1319
>ENST00000442987.3|ENSG00000233750.3|...|RP11-34P13.10-001|RP11-34P13.10|3812|processed_pseudogene| 3812
>ENST00000494149.2|ENSG00000268903.1|...|RP11-34P13.15-001|RP11-34P13.15|755|processed_pseudogene|  755
>ENST00000595919.1|ENSG00000269981.1|...|RP11-34P13.16-001|RP11-34P13.16|284|processed_pseudogene|  284
ADD COMMENTlink modified 5 weeks ago • written 11 months ago by Kevin Blighe42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 613 users visited in the last hour