Adding spaces to Phylip distance output files

Sometimes the output from Phylip's distance programs (e.g. protdist version 3.66) will produce output that concatenates two distances together:

  3.509929  3.766076296.642222 33.870491  6.012086  6.570648  6.716925
  4.990623  3.861747  3.861747  3.964430  3.964430822.377955  3.868161
  3.637750267.453401 30.466508  4.428072  4.854979 34.665454  6.859330
  5.273613  6.466854  3.548963  3.586986  6.230058126.479800 31.998087

when what was meant was

  3.509929  3.766076 296.642222 33.870491  6.012086  6.570648  6.716925
  4.990623  3.861747  3.861747  3.964430  3.964430 822.377955  3.868161
  3.637750 267.453401 30.466508  4.428072  4.854979 34.665454  6.859330
  5.273613  6.466854  3.548963  3.586986  6.230058 126.479800 31.998087

This occurs when the following distance has three or more numbers preceding its decimal point, and is scheduled to be fixed in the next Phylip release.

I've written a tiny Unix/Linux (and Windows via Cygwin) script that uses the sed tool to fix this problem. Phylip writes its distances so that there are always six digits to the right of the decimal point. This script simply looks for instances where there are six digits following a decimal point, immediately followed by three digits. It then inserts a space between the decimal-six-digit group of characters and the three-digit group of characters. Note that this will also work if the distance is >= 1000, and thus has four or more numbers preceding its decimal point.

The lines above were produced by this script. The sed syntax is

sed -e 's/\(\.[0-9]\{6\}\)\([0-9]\{3\}\)/\1 \2/g' < input > output

physed.sh
A Bourne-shell script that takes input from stdin and writes to stdout. If someone would like to contribute a more sophisticated version of this script that can handle command-line arguments, or perhaps a pure Windows equivalent, I'd be happy to post it here.
physed_check.sh
A Bourne-shell script that takes a single filename as an argument and checks the file for format violations that the physed.sh script would fix. It checks the file until it has found 10 lines that violate the number format, and then quits. If there are fewer than 10 lines of format violations, or there are no format violations, then the script will continue to check the entire file. This is especially useful for a quick check of a large outfile, to determine whether or not the physed.sh script is needed.

Cheers,

Doug Scofield
Indiana University Department of Biology
Edit for email address


December 27, 2006. Copyright (c) D. G. Scofield