Question: Filtering Whole Genome Sequence Data
0
gravatar for Fid_o
10 months ago by
Fid_o20
Fid_o20 wrote:

I have whole genome sequence data (.FASTA files) for Salmonella bacteria. On average the sequence files have a size of 4.7MB but there are some that are too big, like 7Mb and others that are too small, like 500Kb. There is likelihood that the too large files contain unnecessary sequence data and the smaller ones have part of the organism genome sequenced, which would skew my data.

I would like to keep files that are in the range of 3.0- 6.0Mb. Am running on Linux server, any way around this?

Regards.

Addendum:

I would like to do gene core and accessory gene analysis on these bacterial genome sequences. I will be running roary after I do annotation. Or, based on what I want to do, what quality check can I do?

sequence genome • 221 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by Fid_o20
1
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:
mkdir -p keep && find dir1/dir2 -type f -name "*.fasta" -size +3M -size -6M -exec mv -v -t keep '{}' ';'
ADD COMMENTlink written 10 months ago by Pierre Lindenbaum133k
1

Ah, ok not surprising there are in-built function for size filtering. #TIL +1

ADD REPLYlink written 10 months ago by ATpoint44k
0
gravatar for ATpoint
10 months ago by
ATpoint44k
ATpoint44k wrote:

File size is a poor indicator for anything from my experience. Can you say what exactly you want to do and why, maybe there is a better alternatives to reach your goal. Please give some details.

If you type stat --printf="%s" your.file you get the size of a file in bytes. This you can then simply filter for your cutoffs. Say you have files that are all sufixed with *.fa we sort files within than cutoff to accept folder and outside of the range in reject folder:

mkdir accept
mkdir reject
ls *.fa \
| while read p
  do
    if (( $(stat --printf="%s" $p) >= 3000000 )) && (( $(stat --printf="%s" $p) <= 6000000 ))
      then mv $p accept/
      else mv $p reject/
    fi
  done
ADD COMMENTlink modified 10 months ago • written 10 months ago by ATpoint44k

Greetings @ATpoint!

Thanks for that reply.

I want to do core and accessory gene analysis using roary. I will later be doing phylogenetics. I want to remove all sequences that are too small or too large as, from what I was told, a too small size would mean only a shorter part of the genome was sequenced and too large a size would probably mean other things, like plasmids, might have been sequenced.

ADD REPLYlink written 10 months ago by Fid_o20

In this case I suggest you better count base pairs instead of file size. Something like this: Sequence length from Fasta I am not at all a genome assembly expert, so I only suggest a solution from the technical side, no idea if this makes sense in your context.

ADD REPLYlink modified 10 months ago • written 10 months ago by ATpoint44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2352 users visited in the last hour
_