Question: How To Discard Sequences Longer Than N Nucleotides ?
1
gravatar for 2011101101
5.7 years ago by
201110110190
201110110190 wrote:

Hi Everyone,I have some small RNAs data.The length of reads are between 18 to 36. Now,I want to discard sequences longer than 27 nucleotides. How can I do?
There is an example. I have a file named mss.fasta.The contents of the following is the file:

>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

Now I need the retain the sequence length short than 28.This result of example is that.

>hb3_563308_x1
GGGATCGACGGTTCGGTCGGTGC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA

How can I do ? Thank you very much !

data small rna • 4.5k views
ADD COMMENTlink modified 5.7 years ago by yangzhenzhen198830 • written 5.7 years ago by 201110110190

Assuming reads are fastq format?

ADD REPLYlink written 5.7 years ago by Neilfws48k

clearly not, after edit :)

ADD REPLYlink written 5.7 years ago by Neilfws48k
10
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum113k wrote:

Using awk for a fastq file:

gunzip -c file.fastq.gz| \
awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

update: from your example ( a fasta input):

cat input.fa| \
awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])<=27) {printf("%s\n%s\n",L[0],L[1]);}}'
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Pierre Lindenbaum113k

Hi

Thanks for sharing this awk one liner. Could you please explain it abit in detail, I am leaning awk one liners. This command is really awesome!!

ADD REPLYlink written 5.7 years ago by Rahul Sharma560
2

'i' is the current number of lines,'y' is the index of the line for the fastq record and is just the modulo of i/4. 'y' can be 0,1,2 or 3. we put all the lines '$0' in an associative array 'L' with 'y' as the key. if y==3, we have reached the end of the record: we print the content 'L' if needed.

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum113k

excellent job!! thnks

ADD REPLYlink written 5.7 years ago by Rahul Sharma560

I enter the following command.

cat mss.fasta | awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

,I can't get the result I needed.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by 201110110190

because for your example, it's not a FASTQ input but a FASTA, change 4 to 2.

ADD REPLYlink written 5.7 years ago by Pierre Lindenbaum113k

Thank you very much ! I get that.

ADD REPLYlink written 5.7 years ago by 201110110190

I have another question .If I want to retain the length between 30 to 33 ,what can I do?

cat input.fa | awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])>=30 && length(L[1])<=33) {printf("%s\n%s\n",L[0],L[1]);}}'
ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by 201110110190
1

Best to post as new question rather than ask in the comments. In this case though, it would probably be closed for being too similar to the original question. I think you can come up with a solution yourself, based on what people have contributed here.

ADD REPLYlink written 5.7 years ago by Neilfws48k

This is great.  I had a need for this function and I wrapped the awk command here into a bash script (Linux).  Two scripts, actually.  One for fastq, one for fasta.  Fastq script handles singe read, paired ends, and keeps 1 or 2 indexes in phase with your filtered output.  To help you decide what to filter on, I also wrote two accompanying scripts to generate length histograms of your fastq or fasta file(s).  Clone my github repo (https://github.com/alk224/akutils) and add the cloned directory to your path to gain functionality.

The commands you would use are: filter_fastq_by_length.sh, filter_fasta_by_length.sh, fastq_length_histogram.sh, and fasta_length_histogram.sh.

They work great with miseq data, but I might need to change the grep strings to work with hiseq data.  Shoot me a note here or github if you run into any issues.

ADD REPLYlink written 3.1 years ago by alk22410
6
gravatar for Vince Buffalo
5.7 years ago by
Vince Buffalo460
Davis, CA
Vince Buffalo460 wrote:

Or, use bioawk, which deals with gzipped files and FASTQ natively. I recommend it more than rolling your own FASTQ parser each time.

For example:

bioawk -cfastx '{if (length($seq) <= 100) {print "@"$name"\n"$seq"\n+\n"$qual}} ' file.fq.gz

Note: this only uses FASTQ names (before the first space); it will discard everything after.

ADD COMMENTlink written 5.7 years ago by Vince Buffalo460
1

The example above is in fasta format, not fastq.

ADD REPLYlink written 5.7 years ago by Madelaine Gogol5.0k

Intentional; showing readers how to do it in FASTQ allows them to extend it to FASTA. Anyone that can't adapt this to FASTA probably shouldn't be copying and pasting commands from the internet into their shells.

ADD REPLYlink written 5.7 years ago by Vince Buffalo460

Okay, I just thought you misread it...

ADD REPLYlink written 5.7 years ago by Madelaine Gogol5.0k

I apologize if this is a really basic awk/bioawk question, but if I use this command, do I need to provide a new output name for the file that contains only the <= 100 bp reads, e.g. by adding a "> newfile.fq.gz", or, does this command overwrite the file.fq.gz with only the appropriately sized reads? Thanks!

ADD REPLYlink written 2.9 years ago by aeberardi0
3
gravatar for Daniel
5.7 years ago by
Daniel3.6k
Cardiff University
Daniel3.6k wrote:

In true "there's more than one way to do it" form, I present perl:

#!/usr/bin/perl

$head;
while (<>){
        if($_ =~ />/){
                $head = $_;
        }elsif(length($_) < 27){
                print "$head" . "$_";
        }
}
ADD COMMENTlink written 5.7 years ago by Daniel3.6k

I dont really dig oneliners so not sure how to convert it but I'm sure it's pretty easy.

ADD REPLYlink written 5.7 years ago by Daniel3.6k
2
gravatar for AGS
5.7 years ago by
AGS230
Brooklyn, ny
AGS230 wrote:

faFilter from the Genome Browser utilities

faFilter -maxSize=27
ADD COMMENTlink written 5.7 years ago by AGS230
1
gravatar for 14134125465346445
5.7 years ago by
United Kingdom
141341254653464453.4k wrote:

Do you want to discard them altogether or trim them to 27bp?

git clone https://github.com/lh3/seqtk.git    
cd seqtk/    
make    
./seqtk trimfq -l 27 file.fq > file.trim.fq
ADD COMMENTlink written 5.7 years ago by 141341254653464453.4k

Thank you .I need to discard them altogether nor trim them to 27bp.

ADD REPLYlink written 5.7 years ago by 201110110190
1
gravatar for Martin A Hansen
5.7 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Using Biopieces www.biopieces.org):

read_fasta -i in.fna | grab -e 'SEQ_LEN > 30' | write_fasta -o out.fna -x

ADD COMMENTlink written 5.7 years ago by Martin A Hansen3.0k
0
gravatar for yangzhenzhen1988
5.7 years ago by
yangzhenzhen198830 wrote:
>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

#put the above read fasta text into a file named text.txt
#use the following perl script, name it test.pl

#!/usr/bin/perl -w
use strict;
my $id ="";
my %seq;
open IN,"<test.txt";
while(<IN>) {
  chomp;
  if(/^>/) {
    $id = $_;
}
else {
    $seq{$id} = $_;
    }
}
close IN;

foreach my $key (keys %seq) {
    if(length($seq{$key}) <=27) {
        print "$key\n$seq{$key}\n";
    }
}

#run this script by typing in terminal: perl test.pl
ADD COMMENTlink modified 5.7 years ago by Neilfws48k • written 5.7 years ago by yangzhenzhen198830
2

This hurts my eyes ;oP: a Perl three liner.

ADD REPLYlink written 5.7 years ago by Martin A Hansen3.0k

It hurt my eyes because the formatting was really bad. Please, make an effort to format correctly (e.g. indent lines of code with 4 spaces). Answers are auto-previewed as you type, so you can see how it will look.

ADD REPLYlink written 5.7 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1071 users visited in the last hour