Question

What is the fastest way to add 'Ns' to variable length sequences in a .fasta such that they have the same length

0

Entering edit mode

7.2 years ago

Longshotx ▴ 70

Say I have a fasta containing 3 sequences...

>J00138:90:HG7YJBBXX:5:1101:19593:1719_2
CTTTACGCCCAGTAAT
>J00138:90:HG7YJBBXX:5:1101:20334:2299_1
TAAGGCGAAGGCTTCTTTC
>J00138:90:HG7YJBBXX:5:1101:20334:2299_2
GCTTTAATCTACATTTATAGACTGATTAATAACC
>J00138:90:HG7YJBBXX:5:1101:20730:2316_1
TAACTAGTTGCAAGGGTTTT

I want my sequence data to look like this:

 >J00138:90:HG7YJBBXX:5:1101:19593:1719_2
CTTTACGCCCAGTAATNNNNNNNNNNNNNN
>J00138:90:HG7YJBBXX:5:1101:20334:2299_1
TAAGGCGAAGGCTTCTTTCNNNNNNNNNNN
>J00138:90:HG7YJBBXX:5:1101:20334:2299_2
GCTTTAATCTACATTTATAGACTGATTAATAACC
>J00138:90:HG7YJBBXX:5:1101:20730:2316_1
TAACTAGTTGCAAGGGTTTTNNNNNNNNNNN

Are there any programs or scripts that could accomplish this in a reasonable timeframe? I have thousands of sequences. Thanks!

Someone mentioned something like n = 150 ; seq = seq + ("N" * n) ; seq = seq[:n] but I have no idea how I would put this into a python script. Thanks for all your help!

sequencing fasta python biopython • 3.3k views

ADD COMMENT • link updated 7.2 years ago by Petr Ponomarenko ★ 2.8k • written 7.2 years ago by Longshotx ▴ 70

1

Entering edit mode

That doesn't look like a fasta file to me. Can you correctly post your input data?
Should be rather easy to write a few lines of python to get this done.

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

>J00138:90:HG7YJBBXX:5:1101:19593:1719_2
CTTTACGCCCAGTAATTCCGATTAACGCTTGCACCCTCTGTATTACCGCGGCTGCTGGCACAGAGTTAGCCGGTGCTTATTCTGTCGGTAACGTCAAAACACTAACGTATTAGGTTAATGCCCTTCCTCCCA
>J00138:90:HG7YJBBXX:5:1101:20334:2299_1
TAAGGCGAAGGCTTCTTTCTAACTAAACACTGACGCTGAGGTACGAAAGTATGGGGAGCAAAACGGATTAGAGACCCGTGTAGTCCATACCGTAAACGATGAGTGCCTTAGATTGGTTATTAATCAGTCTAT
>J00138:90:HG7YJBBXX:5:1101:20334:2299_2
GCTTTAATCTACATTTATAGACTGATTAATAACCAATCTAAGGCACTCATCGTTTACGGTATGGACTACACGGGTCTCTAATCCGTTTTGCTCCCCATACTTTCGTACCTCAGCGTCAGTGTT
>J00138:90:HG7YJBBXX:5:1101:20730:2316_1
TAACTAGTTGCAAGGGTTTTGTTCGTTATGCGGACTTAACCTTACATCTCACGACACGAACTGAAGACAGCCATGCAACGCCTGTGTAGCATTCAAGGAGTGGTAAGATGTTGTGCGGA

Thanks! I saw a similar thread about padding fastq files but I couldn't figure out how to translate the script for fasta files. Essentially all I want to do is homogenize all the reads within a fasta file so they are all the same length by padding anything less than 150 with Ns.

ADD REPLY • link 7.2 years ago by Longshotx ▴ 70

0

Entering edit mode

This doesn't sound like a good idea to me. Why do you want to pad the reads with Ns?

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

I am trying to use a script called REAGO to find 16S fragments from metagenomic data and it requires the length of all sequences to be the same for paired end reads. Most of my sequencing distribution is skewed between 100 to 150 bp. But very few read pairs after adapter trimming are 150 bp...usually the read 2s are slightly shorter. So I want to pad the sequences so I don't end up having to throw data away.

ADD REPLY • link 7.2 years ago by Longshotx ▴ 70

1

Entering edit mode

You might be using the wrong for the job. Good tools don't have strange and arbitrary requirements to mutate your data before they work.

Also, it makes absolutely no sense for read 2 to be shorter than read 1 after adapter-trimming. They should be the same length, since the adapter sequence will start at the same position in each read. Also, the majority of reads should have zero adapter sequence, for 150 bp reads - if that's not the case, it sounds like library prep screwed up, unfortunately.

It is likely that you are doing adapter-trimming incorrectly. I recommend this approach; some other methods will unnecessarily trim actual genomic sequence from the end of reads. There is no justification for adapter-trimming paired reads to different lengths - if you are using software that does that, it is flawed, or you are using it incorrectly. When the reads are properly trimmed, R1 and R2 will be the same length. Perhaps you are conflating adapter-trimming with quality-trimming? R2 almost always has lower quality than R1. Unlike adapter-trimming, quality-trimming is not strictly necessary, and I don't recommend it in any situation that is sensitive to read lengths.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Just like Brian I also wonder why you would want to do this. For the sake of the script, do you a priori know that everything should have a length of 150nt? Or should the script first find the longest read, then add nucleotides to all those which are shorter?

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

These are HiSeq paired end reads. They won't be any bigger than 150bp. Everything is <=150bp.

ADD REPLY • link 7.2 years ago by Longshotx ▴ 70

0

Entering edit mode

Since you said you are learning python, what have you tried? For working with sequences, have a look at SeqIO from Biopython, for example in the very good tutorial and cookbook.

You want to loop over te fasta records, and add to every seq object N * (150 - len(seq))

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Cross-posted:

http://seqanswers.com/forums/showthread.php?p=204602

Please link other sites when you cross-post, so people don't waste their time answering a question that has already been answered.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

I apologize Brian. I'll make sure I do that next time!

ADD REPLY • link 7.2 years ago by Longshotx ▴ 70

score 2 · Answer 1 · 2017-02-24

2

Entering edit mode

7.2 years ago

Petr Ponomarenko ★ 2.8k

Here is a quick awk one-liner. It is not efficient or super quick to process, but the point is that you can learn awk in a month or faster and write this one-liner in a few minutes. It awk can do very complicated things with multiple text files at once. Together with cat, grep, sort and uniq this is a very cool tool for bioinformatics specialist.

awk '$1~">"{print $0}$1!~">"{tmp="";for(i=1;i<150-length($0)+1;i++){tmp=tmp"N"};print $0""tmp}' your_file.fasta > output.fasta

this is is when you have a sequence in one line for each read and it is going to add Ns only up to the 150nt total length exactly to the end. You may find awk very efficient with such little day to day tasks as it was designed for strings' processing.

ADD COMMENT • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

and from my experience of teaching and giving workshops on different topics in bioinformatics students with no programming experience are ok with learning basic Unix programs and commands like grep and sort, while learning to write a simple program in R, ruby, python or java is much much harder, especially for biologists. AWK is in a unique position of between right between this two extremes. That way awk is powerful for lots of tasks, helps learn logic, loops and get familiar with programming in general and at the same time it is not frustrating and most people can begin using it in an hour or two of lecture+practise. I wholeheartedly recommend you to at least look at my awk statement and try to understand and imagine what it is doing. Play with it.

ADD REPLY • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

I appreciate your advice. I'm actually in the process of learning python, but this would be a nice tool to learn relatively quickly.

How long do you think this would take on a fasta file containing say 100,000 sequences?

ADD REPLY • link 7.2 years ago by Longshotx ▴ 70

2

Entering edit mode

$ awk '$1~">"{name=$0}$1!~">"{for(i=1;i<=25000;i++){print name; print $0}}' your_file.fasta > your_100k.fasta
$ wc -l your_100k.fasta 

200000 your_100k.fasta

$ time awk '$1~">"{print $0}$1!~">"{tmp="";for(i=1;i<150-length($0)+1;i++){tmp=tmp"N"};print $0""tmp}' your_100k.fasta > output_100k.fasta 

real    0m10.050s
user    0m9.414s
sys 0m0.243s

This is on an old MackBook Air

ADD REPLY • link 7.2 years ago by Petr Ponomarenko ★ 2.8k