Question: Change fasta file header to include number of times a read apears
0
gravatar for Gabe Anderson
3.2 years ago by
Gabe Anderson10 wrote:

Hello,

I have a data set that looks like this:

>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG
>JAMESBROWN_1_FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA
>JAMESBROWN_1_FC20423AAXX_7_1_226_900
GCAAACTGTGCGTCATCGTGT

And I'd like to edit it to look like this:

>cel1_count=3
TGCCTTGTCTGTCCTAAAAATC
>cel2_count=9
GTTAAGTGGGAAACGATGT
>cel3_count=7
CCGACCTTGAAATACCAC
>cel4_count=7
TAGAAATCCACTATGCTTTGG
>cel5_count=5
CGCGGGTGAGCAGCCTGGTAGCTCGTC

Count in the header line specifies the number of times a sequence occurs in the data set. Kindly assist. Thanks!

sequencing sequence • 844 views
ADD COMMENTlink modified 8 months ago by RamRS21k • written 3.2 years ago by Gabe Anderson10
2
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:
cat in.fa | paste - - | cut -f 2 | LC_ALL=C sort |\
 uniq -c | sed 's/^[ ]*//' |\
 awk '{printf(">cel%d_count=%s\n%s\n",NR,$1,$2);}'
ADD COMMENTlink modified 8 months ago by RamRS21k • written 3.2 years ago by Pierre Lindenbaum120k

Thanks for your input. For some reason, the reads are also altered instead of the header line only. A lot of bases are replaced by A. Here is what my result looked like:

>dme1_count=329515

>dme2_count=534
A
>dme3_count=15
AA
>dme4_count=4
AAA
>dme5_count=1719
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>dme6_count=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
>dme7_count=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN
>dme8_count=2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT
>dme9_count=3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA
ADD REPLYlink modified 8 months ago by RamRS21k • written 3.2 years ago by Gabe Anderson10
2

The command is not doing anything to your sequences other than counting.I guess, as they sorted, the sequences with "A" appeared first in your output. The order is not maintained.

ADD REPLYlink modified 8 months ago by RamRS21k • written 3.2 years ago by geek_y9.6k

Thank you, Pierre and Goutham! It's clear now.

ADD REPLYlink written 3.2 years ago by Gabe Anderson10
0
gravatar for Charles Plessy
3.2 years ago by
Charles Plessy2.7k
Japan
Charles Plessy2.7k wrote:

The command fastx_collapser from the FASTX-Toolkit will produce what you want, except for the sequence name, which will look like x-y, where x is the position of the sequence in the output file, and y is the number of times it occurred in the input file.

For example:

$ cat in.fa
>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG
>JAMESBROWN_1_FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA
>JAMESBROWN_1_FC20423AAXX_7_1_226_900
GCAAACTGTGCGTCATCGTGT

$ fastx_collapser < in.fa
>1-2
GTTAGAGGTTCGAAG
>2-2
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>3-1
GGGGGTGTAGGGTGGGGTTGG
>4-1
GTTCGTATCCCACTTCTGACACCA
>5-1
GCAAACTGTGCGTCATCGTGT
ADD COMMENTlink modified 8 months ago by RamRS21k • written 3.2 years ago by Charles Plessy2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1207 users visited in the last hour