Where People Run: Redux

I recently came across this article that provided some beautiful visualizations showing where people run using RunKeeper data.

At the end of the article a link is provided to an R script so you can plot your own runs.  Awesome!  Except… some things were a bit off.

After downloading my RunKeeper data from here I ran the script.  I was presented with a PDF containing a bunch of lines and no map.

RunKeeperScriptFromFlowingData

That’s not entirely useful, so I modified the script to download a map automatically and plot the routes on the map.

The other issue that I found was that only the first segment of your run was imported using the script, so if you routinely pause RunKeeper during your run (or have it do that automatically when you come to a stop) then you will have issues.  So I’ve also modified the script to read all segments from each run instead of just the first.

Here’s the final result, where you’ll notice a nice map and many more tracks present.

RunKeeperScriptAfter

The script is provided for you here.  Just make sure you modify the LEFT, BOTTOM, RIGHT, and TOP variables to bound your area of interest.

map-routes

How to archive your Pidgin chat log files.

I am somewhat of a hoarder when it comes to my chat logs.  I routinely find them useful to go back to: “What was that website Bob showed me, again?”  “What was the grocery list my wife gave me?”  As such, I have around 23,000 logged chats.  Now while this may not seem like a problem, I use Dropbox to sync my logs to multiple machines, so that no matter where I am I have access to them.  It takes a long time for programs to index 23,000 small files, so I wanted to come up with a way to archive the logs.

Unfortunately Pidgin doesn’t make this easy, as the log files follow a folder naming convention like this:

logs/protocol/username/buddy name/2010-02-23…html

There’s no way to easily pick out, and keep the folder structure of your log files, so I wrote a Perl script to accomplish the task.

You’ll need Perl installed on your machine.  I like to use ActivePerl, which you can grab here.

Extract the package to a directory of your choosing.  Next, run the script with the following usage:

perl    ArchiveLogs.pl    path_to_your_logs_dir    year_you_want_to_archive    output_directory

For example, on Windows your Pidgin log files are located in the %AppData% directory.  Here’s what the command would look like on my machine:

perl    ArchiveLogs.pl    C:\Users\Brian\AppData\Roaming\.purple\logs    2010    C:\Users\Brian\LogBackups

This will copy all conversations from all of your IM accounts into the LogBackups/YEAR folder.  It will then automatically zip those files into an archive.

But what about the old log files?  The script has an option to delete your old files.  Use this ONLY if you have already archived your logs!

The syntax of the command is the same, but you replace the output directory with the word “DELETE”.  For example, if I wanted to delete my old 2010 conversations:

perl    ArchiveLogs.pl    C:\Users\Brian\AppData\Roaming\.purple\logs    2010    DELETE

That’s all there is to it!  However, as usual, USE THIS SCRIPT AT YOUR OWN RISK.  I make no guarantees that it’s bug free. 😉

Get the script here: ArchiveLogs.

Brian

How are these not already MATLAB functions?

EDIT: I have since been informed that you can concatenate multiple arrays together just by putting them inside braces []. Thus, you can use MATLAB’s built-in min and max functions by doing the following:

% assuming A and B are two arrays…
C = min([A B]);

You learn something new every day when you’re a novice in some language!

Nevertheless, I will keep my two functions implemented. The downside to using the above method is that it results in a full-on copy of the two vectors into a single vector. The problem with doing this is that:
1) It is slow!
2) It could result in a memory overflow if the arrays are large enough.

I have also updated my functions to handle matrices as well. Enjoy.

=============================================================================

I was at work today and needed to loop through a large amount of data in MATLAB. I was doing a diff between two arrays and needed to loop between it’s min and max values. Unless I suck at using Google, I couldn’t find any built-in MATLAB functions that return the smallest min of multiple arrays or the largest max of multiple arrays.

Here they are for your pleasure:

smallestMin

% SMALLESTMIN(X) The smallest component of all vector and/or matrix inputs.
%   SMALLESTMIN(X), where X is any number of vectors and/or matrices, is 
%   the smallest component present in any of these structures.  
%   SMALLESTMIN(X) is consistent with MATLAB's built-in MIN(X) for vectors
%   only!! SMALLESTMIN(X) for a matrix returns the equivalent of 
%   MIN(MIN(X)).  SMALLESTMIN accepts mixed inputs of vectors and matrices.
%   A matrix is defined as having exactly 2 dimensions.
%
%   VALID INPUTS: row and column vectors and 2-D matrices.  These can
%                 contain characters, complex numbers, and numerics.
%                 Logicals, doubles, singles, int8, uint8, int16, int32,
%                 and uint32 are also valid.
%   INVALID INPUTS: cell arrays, 64-bit signed and unsigned values,
%   structures, function handles, user classes, and java classes.
%
%   These lists may not be all-inclusive.
%
%   Ex 1: If A = [1 2 3 4] and B = [-1   then smallestMin(A,B) returns -1.
%                                    2
%                                    3
%                                    4]
%
%   Ex 2: If A = [1 2   and B = [5 6 7 8] then smallestMin(A,B) returns 1.
%                 3 4]
%
%   Author:  Brian A. Schrameck
%   Version: v0.1
%   Date:    07/15/2009
%   Email:   <a href="mailto:schrameckbrian+smallestMin@gmail.com">schrameckbrian+smallestMin@gmail.com</a>
%
%   See also min, max, isnumeric, and largestMax (available on the MATLAB
%   File Exchange).

function smallestVal = smallestMin (varargin)

smallestVal = 0;

if ( length(varargin) < 1 )
    error('At least 1 argument must be present!')
end

for i=1:length(varargin)
    % This will catch at least most of the bad stuff thrown at it.  It will
    % not catch uint64/int64, or function handles; probably some other
    % stuff too.
    if ( (~isvector(varargin{i}) && ndims(varargin{i}) ~= 2 ) || ...
            iscell(varargin{i}) )
        error('Argument %d is not a vector or 2D matrix!', i)
    end
    % We will error out here if we didn't catch something earlier, as min
    % will die.
    curMin = min(min(varargin{i}));
    if ( i == 1 )
        % First time through is always the min value.
        smallestVal = curMin;
    elseif ( curMin < smallestVal )
        % Compare against current smallest value.
        smallestVal = curMin;
    end
end
end

largestMax

% LARGESTMAX(X) The largest component of all vector and/or matrix inputs.
%   LARGESTMAX(X), where X is any number of vectors and/or matrices, is 
%   the largest component present in any of these structures.  
%   LARGESTMAX(X) is consistent with MATLAB's built-in MAX(X) for vectors
%   only!! LARGESTMAX(X) for a matrix returns the equivalent of 
%   MAX(MAX(X)).  LARGESTMAX accepts mixed inputs of vectors and matrices.
%   A matrix is defined as having exactly 2 dimensions.
%
%   VALID INPUTS: row and column vectors and 2-D matrices.  These can
%                 contain characters, complex numbers, and numerics.
%                 Logicals, doubles, singles, int8, uint8, int16, int32,
%                 and uint32 are also valid.
%   INVALID INPUTS: cell arrays, 64-bit signed and unsigned values,
%   structures, function handles, user classes, and java classes.
%
%   These lists may not be all-inclusive.
%
%   Ex 1: If A = [1 2 3 4] and B = [-1   then largestMax(A,B) returns 5.
%                                    2
%                                    3
%                                    5]
%
%   Ex 2: If A = [1 2   and B = [5 6 7 8] then largestMax(A,B) returns 8.
%                 3 4]
%
%   Author:  Brian A. Schrameck
%   Version: v0.1
%   Date:    07/15/2009
%   Email:   <a href="mailto:schrameckbrian+largestMax@gmail.com">schrameckbrian+largestMax@gmail.com</a>
%
%   See also max, min, isnumeric, and smallestMin (available on the MATLAB
%   File Exchange).

function largestVal = largestMax (varargin)

largestVal = 0;

if ( length(varargin) < 1 )
    error('At least 1 argument must be present!')
end

for i=1:length(varargin)
    % This will catch at least most of the bad stuff thrown at it.  It will
    % not catch uint64/int64, or function handles; probably some other
    % stuff too.
    if ( (~isvector(varargin{i}) && ndims(varargin{i}) ~= 2 ) || ...
            iscell(varargin{i}) )
        error('Argument %d is not a vector or 2D matrix!', i)
    end
    % We will error out here if we didn't catch something earlier, as max
    % will die.
    curMax = max(max(varargin{i}));
    if ( i == 1 )
        % First time through is always the max value.
        largestVal = curMax;
    elseif ( curMax > largestVal )
        % Compare against current largest value.
        largestVal = curMax;
    end
end
end

Aww, how cute.

Look, my first C++ project! Just look at the lines that are 196 characters long and improper indentation. Isn’t it adorable?

//<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\\
// Brian Schrameck																\\
// COMP 115																		\\
// Started:  September 15, 2004													\\
// Last Modified:  September 26, 2004											\\
//																				\\
// SchrameckProject1.cpp:  A program that takes a user-input dollar amount		\\
// and maximizes the amount of candy able to be bought with it.  It will		\\
// buy as much of the favorite candy as possible, and then buy as much			\\
// of the next favorite, etc., until no more candy is able to be bought.		\\
// It then displays the candy bought, the amount, and the price, as well		\\
// as the total price and the leftover money in a table, with decimal points	\\
// lined-up appropriately.														\\
//																				\\
// I have abided by the Wheaton Honor Code in this work.						\\
// - Brian A. Schrameck															\\
//<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\\

#include <iomanip>
#include <iostream>
using namespace std;

// Declare the price of the candies as constants in CENTS.

const int TURTLE = 225; 	// Ton o' Chocolate Turtles
const int CUPS = 90; 		// Nutty Cups
const int TRUFFLE = 55; 	// Death by Chocolate Truffles
const int TWISTER = 7; 		// Twisters

int main()
{
double cash; 					// How much money the person has to start out with.
int cents; 					    // How much money the person has in CENTS.
int turtles; 					// How many Ton o' Chocolate Turtles the person will be able to buy.
int cashafterturtle;			// How much money the person has in cents after buying the turtles.
int cups;						// How many Nutty Cups the person will be able to buy.
int cashaftercups;				// How much money the person has in cents after buying the cups.
int truffles;					// How many Death by Chocolate Truffles the person will be able to buy.
int cashaftertruffles;			// How much money the person has in cents after buying the truffles.
int twisters;					// How many Twisters the person will be able to buy.
int cashaftertwisters;			// How much money the person has in cents after buying the twisters.
float turtlescost;				// How much money the turtles cost total in dollars.
float cupscost;					// How much money the cups cost total in dollars.
float trufflescost;				// How much money the truffles cost total in dollars.
float twisterscost;				// How much money the twisters cost total in dollars.
float total; 					// How much money is spent total.
float change;					// How much money is leftover.

// Ask the user how much money they have to spend.

cout << "How much money do you have to spend on the candy?" << endl;
cin >> cash;

// If the user provides 0 as their cash amount, they will be asked to go get some money.

while  (cash == 0)
{
cout << "You can't buy something for nothing!  Get some money, silly!" << endl;
cout << "How much money do you have to spend, now?" << endl;
cin >> cash;
}

// Convert the amount of money from dollars into cents.

cents = 100.0 * cash;

// Find out how many of each candy you will be able to buy, then find out how much change you have leftover after each.

turtles = cents / TURTLE;
cashafterturtle = cents % TURTLE;

cups = cashafterturtle / CUPS;
cashaftercups = cashafterturtle % CUPS;

truffles = cashaftercups / TRUFFLE;
cashaftertruffles = cashaftercups % TRUFFLE;

twisters = cashaftercups / TWISTER;
cashaftertwisters = cashaftercups % TWISTER;

// Find out how much the total amount of each batch of candy cost.

turtlescost = (TURTLE / 100.0) * turtles;
cupscost = (CUPS / 100.0) * cups;
trufflescost = (TRUFFLE / 100.0) * truffles;
twisterscost = (TWISTER / 100.0) * twisters;

// Find out the total cost.

total = turtlescost + cupscost + trufflescost + twisterscost;

// Find out how much money will be leftover.

change = cash - total;

// Display the "receipt".

cout << endl;
cout << "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^" << endl;
cout << "|                                                   |" << endl;
cout << "|                Brian's Candy Shop                 |" << endl;
cout << "|                   Your Receipt                    |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "| Candy                        Amount       Price   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| " << setprecision (2) << fixed << "Ton o' Chocolate Turtles" << setw (6) << "| " << setw (4) << turtles << " |" << setw (5) << "$" << setw (6) << turtlescost << "   |" << endl;
cout << "| " << "Nutty Cups" << setw (20) << "| " << setw (4) << cups << " |" << setw (5) << "$" << setw (6) << cupscost << "   |" << endl;
cout << "| " << "Death by Chocolate Truffles" << setw (3) << "| " << setw (4) << truffles << " |" << setw (5) << "$" << setw (6) << trufflescost << "   |" << endl;
cout << "| " << "Twisters" << setw (22) << "| " << setw (4) << twisters << " |" << setw (5) << "$" << setw (6) << twisterscost << "   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| TOTAL:" << setw (35) << "$" << setw (6) << total << "   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| Leftover:" << setw (32) << "$" << setw (6) << change << "   |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "|                    Thank you!                     |" << endl;
cout << "|                Please come again!                 |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^" << endl;

return 0;
}

Sample output with an input of $50.00

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|                                                   |
|                Brian's Candy Shop                 |
|                   Your Receipt                    |
|                                                   |
|                                                   |
| Candy                        Amount       Price   |
| ------------------------------------------------- |
| Ton o' Chocolate Turtles    |   22 |    $ 49.50   |
| Nutty Cups                  |    0 |    $  0.00   |
| Death by Chocolate Truffles |    0 |    $  0.00   |
| Twisters                    |    7 |    $  0.49   |
| ------------------------------------------------- |
| TOTAL:                                  $ 49.99   |
| ------------------------------------------------- |
| Leftover:                               $  0.01   |
|                                                   |
|                                                   |
|                    Thank you!                     |
|                Please come again!                 |
|                                                   |
|                                                   |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A crash course in Perl and DNA.

In my Senior year of undergrad I took a class called “DNA.”  It was a cross-listed course; Computer Science and Biology.  I thought it might be interesting, and it involved coding up some stuff in Perl.  The course was basically a crash-course in both Perl and DNA. As such, some of the starting code (such as reading from a file) was given to us. Lucky!

For our final project we had to come up with an application summing up the knowledge we learned in class.  I decided to do the following:

Summary

In DNA there are four possible nucleotides that can be used: T, C, A, and G.  Codons are formed by joining three of these nucleotides together.  This means that there are sixty-four possible codons (4x4x4).  The purpose of these codons is to provide a mechanism for the organism to produce amino acids.  There are 20 different amino acids.  This means that there will be some codons that, while having different nucleotide structures, allow for the production of the same amino acids.

Different organisms may show different preference, or a bias, towards different codons to encode the same amino acid.  The reasons for this happening are still vague, but much research is being done to get a clue as to why certain codons are preferred over others.

In bacteria, for example, different variables have been found to contribute to certain codon usage.  Some of these include GC-composition, the likelihood of a specific strand to mutate, and RNA stability. (Curr Issues Mol Biol., Oct 3 2001)  It has also been found that codon bias may be used to improve translation efficiency. (Yizhar Lavner, Daniel Kotlar, 2004)

Proposed Solution

The software being proposed will be able to compare different biases toward codons in any set of organisms.  It will be able to find GC-rich regions, as well as output information to an Excel format file for analysis.  By reading in a DNA sequence, breaking it up into codons, and using a hash table to keep track of the counts of each different codon, a statistical analysis can be performed on the bias towards one codon or another to encode the same amino acid.  A visual analysis can be performed using Microsoft Excel’s chart capabilities.

The program will start by getting the input.  This is done by using glob to grab a bunch of files stored in a local folder.  The file names will be stored in an array.  The program will then enter a while loop to go through each file in the array and perform work on it.

#!/usr/bin/perl

use strict;
use warnings;

# needed for MacOS X if using BBedit or TexWrangler (see comments below)
use File::Spec;

#-------------------------------------------------------------------------------
# CONSTANTS
#-------------------------------------------------------------------------------

my $MAX_CODONS_OUTPUT = 64;               # top-N most frequent codons
my $dataDirectory = "inputGeneDirectory"; # relative path of directory holding input files
my $outputFileName = "outputResults.xls"; # set-up OUTPUT file (tab-delimited for Excel)
my $TRUE = 1;                             # bool value for true
my $FALSE = 0;                            # bool value for false
my $CODON_LENGTH = 3;                     # size of codon to use in current run

#-------------------------------------------------------------------------------
# VARIABLES/DATA STRUCTURES
#-------------------------------------------------------------------------------

# ARRAYS
my @fileList;      # all files in the directory ($dataDirectory) of data files
my @allCodons;     # an array to hold each codon
my @aminoAcids;    # a list of frequency counts for each aminoAcid and codons
my @sorted;        # a list of the counts sorted highest to lowest

# HASH MAPS
my %codonCounts;        # a hash table of frequency counts for each codon
my %codonMap;           # declare a hash table to hold the Amino Acid abbreviations
                        # corresponding to each possible triple of DNA (codon).
                        # Amino Acids are abgreviated with the ONE LETTER codes.
                        # (see the table at the bottom of this program for AA codes)

# VARIABLES
my $pathname;           # the path to the files
my $filename;           # the file name being worked on
my $nextFile;           # the next file in the list of files
my $sequence;           # the DNA sequence being read in
my $sequenceLength;     # the length of the DNA sequence
my $numCodons = 0;      # the number of codons found
my $nextCodon;          # scalar to hold each codon, one at a time
my $numSortedCodons = 0;# the number of sorted codons, useful if less than N
my $numToPrint;         # the final number of top-N codons to print
my $num = 0;            # counter variable, i would not work for some reason
my $rank = 0;           # the rank of the current codon
my $currentCount = 0;   # the number of times the current codon was found
my $proportion;         # the proportion of this codon compared to all codons
my $GCRich;             # the GC-rich region that was found
my $lengthGC;           # the length of this GC-rich region
my $firstGCRich = $TRUE;# is this the first GC-rich region found for this file?
my $sumGCLengths = 0;   # the sum of all of the GC-rich region lengths
my $percentGCs = 0;     # the percentage of the file that is rich in GC
my $nextAcid;           # the next amino acid in the array

#===============================================================================
#                      START PROGRAM
#===============================================================================

%codonMap = getAminoAcidLookUpTable(); # populate the codon hash map->amino acids

printTitle();             # print the title of the report (to the console)

#============================
# SET OUTPUT INFORMATION

#set the path name for the directory
setPathname ();

# set the output filename
$outputFileName = $pathname.$outputFileName;

# create the output file (tab-delimited for Excel)
open (OUTPUT, ">", $outputFileName)
    or die "Cannot create the file: $outputFileName: $!";
#============================
# SET INPUT INFORMATION

# use filename with .fna to get all FASTA files in the directory
$filename = $pathname.$dataDirectory."/*.fna";

# grab a "glob" of files, each filename is put into elements of the array @fileList
@fileList = glob($filename);

The first step when entering the while loop is to get the DNA from the file.  The input expected will be a FASTA format file.  The sequence must also begin with a start codon and end with a stop codon.

#========================================================
# FOREACH file in the directory of input data files

print "========================================================\n\n";

foreach $nextFile (@fileList)
{
    %codonCounts = ();       # clear the hash table

    # print the current filename to the screen (console, stdout)
    print "Next file is: $nextFile \n";

    # print filename to Excel (output) file
    print OUTPUT "$nextFile\n";

    # skip Linux files with names that start with ".";
    if ($nextFile =~ m/^\./)
    {
        next;
    }

    # get the DNA sequence and its length.
    $sequence = readInDNA($nextFile);
    $sequenceLength = length ( $sequence );

    print "Length of the DNA sequence: $sequenceLength \n\n";

    # get the last codon from the sequence
    my $lastCodon = substr ( $sequence, $sequenceLength - 3, 3 );

    # check if the sequence starts with an ATG.
    if ( substr ( $sequence, 0, $CODON_LENGTH ) ne "ATG" )
    {
        print "Gene must start with ATG.\n";
        print OUTPUT "Gene must start with ATG. Results may not be valid!\n\n";
    }

    # check if the sequence ends with a STOP.
    if ( $lastCodon ne "TAG" && $lastCodon ne "TAA" && $lastCodon ne "TGA" )
    {
        print "Gene must end with stop codon.\n";
        print OUTPUT "Gene must end with stop codon. Results may not be valid!\n\n";
    } 

The next step is to break the DNA sequence given into codons.  Since we know that the sequence is in the correct reading frame, no changes in the reading frame will need to occur.  For each codon a count will be increased.  This will be implemented using a hash table.  So if the sequence is CCACCA, the hash table will have one entry, CCA, with a count of 2.

# break the sequence into codons, storing each codon in a slot in an array
    @allCodons = breakIntoCodons($sequence );
    $numCodons = scalar ( @allCodons );    # set the number of codons found

    print "Number of codons: $numCodons \n";
    print OUTPUT "Number of codons: $numCodons \n";

    # go through the array of codons...
    foreach $nextCodon (@allCodons)
    {
        $codonCounts { $nextCodon }++;    #increase the count for this codon
    }

After reading all of the codons in the sequence, all of the unique codons found, along with their counts, will be output to an Excel format .xls file.  This is the first of two Excel files being used for this project.  Next, the hash table will be analyzed to see which types of amino acids are present.  These will be matched up to the codons to get a count of how many amino acids were formed using a specific codon.  The results of this will be output to the same Excel format file as above.  This will include all of the amino acids, sorted by the most biased codons that formed that amino acid.

# sort the array of codons by their count
    @sorted =  sort { $codonCounts{$b} <=> $codonCounts{$a} or $a cmp $b } keys(%codonCounts);

    # get the length of how many unique codons were found
    $numSortedCodons = scalar ( @sorted );

    # get the amino acid that corresponds to each unique codon
    for ( $num = 0; $num < $numSortedCodons; $num++ )
    {
        $aminoAcids [$num] = translate ( $sorted[$num] );
    }

The next part is to find GC-rich regions in the sequence.  The point of doing this is to find if there is a certain bias only during GC-rich regions.  The same procedure will be applied as above, but a new file will be created just for the GC-rich regions.

#===============================================================================
#		start search for GC-rich regions
#===============================================================================

    print "Searching for GC-rich regions greater than length 4...\n";

    # Search for GC-rich sequences of 4G/Cs or more
    while ( $sequence =~ m/([GC]{4,})/g )
    {
        $GCRich = $1;                       # the region
        $lengthGC = length ( $GCRich );     # length of region
        $sumGCLengths += $lengthGC;	        # total length of GC regions
        $firstGCRich = $FALSE;              # no long first region found
    }

    # if this is the first GC-rich region in the file...
    if ( $firstGCRich == $TRUE )
    {
        print "No GC-rich regions were located. \n";
        print "========================================================\n\n";
    }
    else
    {
        print "Found some! \n";
        print "========================================================\n\n";
    }

    # find the GC percentage
    $percentGCs = ( $sumGCLengths / $sequenceLength ) * 100;

    $sumGCLengths = 0;      # start over the GC-length
    $firstGCRich = $TRUE;   # set the next GC-rich region found as the first

The next part is to loop back to the beginning and move-on to the next file.  Each file will be separated by a few spaces.

The last step of the program deals with the output to the screen.   The program will output the number of codons found, each file name, and the relevant number of amino acids found.

#===============================================================================
#			start outputting results
#===============================================================================

    printf OUTPUT "GC%%: \t%5.1f \n\n", $percentGCs;
    printf OUTPUT "Rank\tCodon\tAmino Acid\tFrequency\tProportion\n";

    # if there are less than N "top-N" codons found
    if ( $MAX_CODONS_OUTPUT > $numSortedCodons )
    {
        $numToPrint = $numSortedCodons;	# set the number to output
    }

    # otherwise...
    else
    {
        $numToPrint = $MAX_CODONS_OUTPUT; # set to constant N, defined at top
    }

    # print the top N codons: rank, codon, occurences, and proportion
    for ( $num = 0; $num < $numToPrint; $num++ )
    {
        $rank++;     					                # set this codon's rank
        $nextCodon = $sorted[$num];			            # get the codon
        $currentCount = $codonCounts { $nextCodon };    # get the count
        $proportion = $currentCount / $numCodons;       # calculate proportion
        $nextAcid = $aminoAcids[$num];                  # get the next A.A.

        # don't output start or end codons, they are unnecessary
        if ( $nextAcid ne "..." && $nextAcid ne "Met" )
        {
            # now print...
            printf OUTPUT "$rank\t$nextCodon\t$nextAcid\t$currentCount\t%5.3f\n", $proportion;
        }
        else
        {
            # if it was a start or end codon, decrease rank by 1
            $rank--;
        }
    }

    $rank = 0;		        # start the ranks over
    $percentGCs = 0;        # start over the %GCs

    # make some space in the Excel otuput between tables
    print OUTPUT "\n\n\n";
}

# w00t, done!
close ( OUTPUT );
print "\n\nDone.\n";

#----- end of Main Program -----------------------------------------------------

The rest of the file contains various subroutines to read in the DNA and break it apart into codons, as well as set up the codon map.

#-----------\
# readInDNA \
#-------------------------------------------------------------------------------
# Subroutine to open a FASTA formatted file of DNA sequence, remove
# newline characters at the end of each line, discard any blank lines
# discard lines containing a comment(#) symbol, removes any digits,
# removes any extra whitespace characters, converts nucleotides
# to uppercase, and return as one big string.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# 1 argument:  name of file holding the DNA
#
# RETURNS:  file of DNA as one string
#
# A sample of how you might "call" this subroutine from your program
# to 'get' the DNA from a Fasta formatted file for a gene sequence
#
#                  $sequence = readInDNA("humanXM_012345.fna");

sub readInDNA
{
    my $filename = shift;
    my $line;
    my $firstline;
    my $seq;

        # try to open the file; if no file by that name, print a warning and exit
        open (FNAFILE, "<", $filename)
            or die "Cannot open the input file: $filename: $!";

        if ( !($firstline = <FNAFILE>) )
        {
               print "Can NOT read the header line from the file called: $filename \n";
               exit();
         }

    $seq = "";

      # WHILE not EOF, grab next line
    while ($line = <FNAFILE>)
    {
        chomp $line;

            # discard blank line
            if ($line =~ m/^\s*$/)
            {
                    next;
                }

            # discard comment line (any line with a comment(#) symbol)
            if ($line =~ m/^\s*#/)
            {
                    next;
                }

                $line =~ tr/0123456789//d;    # remove all digits
        $line =~ tr/     \n\r//d;        # remove any extra whitespace
        $line = uc($line);        # convert everything to upper-case

            $seq = $seq . $line;        # concatenate new line onto the entire text so far

    } # end WHILE not EOF

    close(FNAFILE);

    return $seq;
}
#----- end of readInDNA --------------------------------------------------------


#-----------------\
# breakIntoCodons \
#-------------------------------------------------------------------------------
# Subroutine to break the text passed to it into codons of length $CODON_LENGTH.
#
# Partially WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: an array of codons.

sub breakIntoCodons
{
    my $text = shift;        # the sequence being searched
       my $DNAlen = length($text);     # the length of the sequence
    my @codons;              # array of codons to fill and return when done
    my $i=0;                # points to start of the next codon each time
    my $j=0;                # $codon[$j] is the next codon
        my $end = $DNAlen-($CODON_LENGTH);    # don't fall off the end of the sequence
        my $tempMotif;            # the current codon being worked on

        while ( $i < $end )
       {
               # get the next codon
               $tempMotif = substr($text, $i, $CODON_LENGTH);

                # if there is some other letter than ACGT...
              if ( $tempMotif =~ m/[^ACGT]/i )
               {
                    # skip to next valid codon
                 $i += $CODON_LENGTH;
               }

                # otherwise, add this codon to valid codons
               else
               {
                 $codons[$j] = $tempMotif; # add to array of codons
                $i += $CODON_LENGTH;     # move to next codon
                 $j++;                    # move to next box in array
               }
    }

        return @codons;
}
#----- end of breakIntoCodons --------------------------------------------------


#-------------\
# setPathname \
#-------------------------------------------------------------------------------
# Subroutine to set the pathname.  Useful to keep the main program a little
# more readable.  Expects a global variable of $pathname.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: nothing.

sub setPathname
{
    #-----------------------------------------------------------------------
    # establish the absolute pathname to the directory of input files
    #-----------------------------------------------------------------------
    # MacOS USERS TAKE NOTE
    # determine the absolute pathname to the current directory
    # next two lines need to be uncommented for MacOS X (e.g., BBedit or TexWrangler)
    # my @path       = File::Spec->splitdir( File::Spec->rel2abs($0) );
    # $pathname      = File::Spec->catfile( @path[0 .. $#path - 1], "");
    #-----------------------------------------------------------------------
    # uncomment if using
    # WINDOWS or Linux (or MacOS X on command-line)
    $pathname   = "";
    #-----------------------------------------------------------------------
}
#----- end of setPathname ------------------------------------------------------


#------------------------\
# getAminoAcidLookUpTable \
#-----------------------------------------------------------------------
# Construct hash that maps codons to amino acids by reading the table
# from DATA lines at the end of the program, which have the form:
# Residue Codon1 Codon2 ...
#
# no arguments
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: hash table of codon and their one-letter amino acid abbreviations

sub getAminoAcidLookUpTable
{
  my %codonMap;

  while (my $in = <DATA> )    # assigns next line of DATA to $in; fails if none
  {
    chomp($in);                   # remove line feed at end of $in
    my @codons = split " ",$in;
    my $residue = shift @codons;  # remove first item from @codons and assign
    foreach my $nnn (@codons) {
        $codonMap{$nnn} = $residue;
    }
  }

  return %codonMap;

}
#----- end of getAminoAcidLookUpTable ------------------------------------------


#-----------\
# translate  \
#-----------------------------------------------------------------------
# translates (presumably pre-transcribed) DNA strings to proteins
#
# 1 argument: the DNA string to translate
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: the protein string. (1-letter Amino Acid abbreviations)

sub translate
{
    my $dna = shift(@_);  # the DNA string to be translated.

    my $pro = "";
    while ( $dna =~ s/(...)// ) {
        $pro = $pro . $codonMap{$1};
    }
    return $pro;
}
#----- end of translate --------------------------------------------------------

#define the mapping from codons to amino acids...
__END__
Ala GCT GCC GCA GCG
Arg CGT CGC CGA CGG AGA AGG
Asn AAT AAC
Asp GAT GAC
Cys TGT TGC
Gln CAA CAG
Glu GAA GAG
Gly GGT GGC GGA GGG
His CAT CAC
Ile ATT ATC ATA
Lei TTA TTG CTT CTC CTA CTG
Lys AAA AAG
Met ATG
Phe TTT TTC
Pro CCT CCC CCA CCG
Ser TCT TCC TCA TCG AGT AGC
Thr ACT ACC ACA ACG
Trp TGG
Tye TAT TAC
Val GTT GTC GTA GTG
... TAA TAG TGA

Results

The results proved that there is a definite bias towards some codons to create certain amino acids, namely asparagine, aspartic acid, and lysine. GC-richness appeared to have no effect on codon bias. Any answers as to why there is a codon bias at all are still being researched.

For the full project, including raw notes and code, download this zip file: codonBiasFinal
(Note: this project is kind of thrown together, meaning standard coding practices may not have been followed, such as breaking up parts of the program into subroutines, etc. If I remember correctly, it was a requirement of the professors that all work be done in the provided subroutines, with no new subs created, but I could be wrong. Either way, looking at the code now it’s atrocious.)