How To Align Partitions on an Upgraded iPod Hard Drive in Windows

If you’ve upgraded your iPod using a newer, larger drive, there’s a chance that the new drive uses “Advanced Format” sectors.  Drives that are use Advanced Format have sectors larger than 512 bytes; generally 4K bytes.

If you upgraded your iPod using one of these drives, it will work, but performance will be degraded significantly.  This is because iTunes partitions the drive inside of your iPod with no consideration about whether it is an Advanced Format drive or not.  iTunes uses two partitions when formatting:

  1. The first partition is the “system” hidden partition.  This is where the iPod stores its operating system.
  2. The second partition is free space formatted as FAT32.  This is the partition we are concerned about.

I upgraded my 5G iPod Video 60GB to a 120GB drive using a Samsung HS12YHA, but I was really disappointed when syncing took forever and write speeds were less than 1MB/s.  This occurs when the logical sectors on the disk are unaligned with the physical sectors.  The iPod uses 2K sectors.  Here’s a diagram to show what I mean:

diagram that shows the different between aligned and unaligned sectors

If you take a look at the unaligned picture you can see that when the OS writes to disk (using logical sectors, as usual) it’s possible that the logical sector spans across two physical sectors.   Because a sector is the smallest unit that a disk can address, when a sector is written to the entire sector needs to be read, the change made, and then the sector written back to disk.  As such, if you change a logical sector that spans across multiple physical sectors, then both physical sectors need to be read entirely, changed, and written to disk again causing the drive to do a lot of extra work.

What we want to achieve is a drive that has its logical sectors aligned with its physical sectors.  To do this, the start of the data partition on the iPod needs to start at a physical sector.  Here’s how to do it on Windows.

WARNING: YOU WILL LOSE ALL DATA ON YOUR IPOD WHEN DOING THIS.  ALSO, BE EXTREMELY CAREFUL THAT YOU DO NOT EDIT THE WRONG DISK.  I am not responsible if something goes wrong.

  1. Restore your iPod using iTunes.
  2. Download Symantec’s Partition Table Editor, PTEDIT32, from this link.
  3. Open PTEDIT and select your iPod from the drop down menu.  You can do this by matching the disk size with your iPod side.  You may have to start PTEDIT as an administrator by right-clicking it and choosing “Run as administrator”.
  4. You now need to figure out which sector partition 2 should start on.  DO NOT alter partition 1!  To determine which sector to use, increase the number in “Sectors Before” for partition 2 until it is divisible by 8.  Keep track of how many sectors you add!  For instance, if your “Sectors Before” field reads 224910 (as mine did) you would increase this to 224912 because 8 divides evenly into it.
  5. Add the same number of sectors to the Starting Sector field for partition 2.  My disk started at sector 1, so I increased this to 3.
  6. Subtract the same number of sectors from the Sectors field for partition 2.  My partition had 234216737 sectors, so I made this 234216735.
    DO NOT alter any other fields!
    Here’s what my final partition table looked like:
  7. Save your changes.  Exit PTEDIT and “safely remove” your iPod.
  8. The iPod should reboot, but it’s probably going nuts trying to understand what happened to the partitions, as you’ve destroyed the file system.  You’ll need to boot the iPod into Disk Mode to continue.
  9. Connect the iPod to the computer.
  10. Download fat32format and extract it somewhere convenient.
  11. Open a command prompt and navigate to the spot where you extracted fat32format.
  12. Make SURE you type this command correctly.  Type:

    fat32format IPOD_DRIVE_LETTER_HERE:

  13. This will quickly format the iPod’s data partition.  When it’s done, continue.
  14. Disconnect the iPod and reconnect it to the computer.
  15. You should now be able to use your iPod at full speed!
As always, if you have any questions, post them here and I’ll do my best to help.

How to archive your Pidgin chat log files.

I am somewhat of a hoarder when it comes to my chat logs.  I routinely find them useful to go back to: “What was that website Bob showed me, again?”  ”What was the grocery list my wife gave me?”  As such, I have around 23,000 logged chats.  Now while this may not seem like a problem, I use Dropbox to sync my logs to multiple machines, so that no matter where I am I have access to them.  It takes a long time for programs to index 23,000 small files, so I wanted to come up with a way to archive the logs.

Unfortunately Pidgin doesn’t make this easy, as the log files follow a folder naming convention like this:

logs/protocol/username/buddy name/2010-02-23…html

There’s no way to easily pick out, and keep the folder structure of your log files, so I wrote a Perl script to accomplish the task.

You’ll need Perl installed on your machine.  I like to use ActivePerl, which you can grab here.

Extract the package to a directory of your choosing.  Next, run the script with the following usage:

perl    ArchiveLogs.pl    path_to_your_logs_dir    year_you_want_to_archive    output_directory

For example, on Windows your Pidgin log files are located in the %AppData% directory.  Here’s what the command would look like on my machine:

perl    ArchiveLogs.pl    C:\Users\Brian\AppData\Roaming\.purple\logs    2010    C:\Users\Brian\LogBackups

This will copy all conversations from all of your IM accounts into the LogBackups/YEAR folder.  It will then automatically zip those files into an archive.

But what about the old log files?  The script has an option to delete your old files.  Use this ONLY if you have already archived your logs!

The syntax of the command is the same, but you replace the output directory with the word “DELETE”.  For example, if I wanted to delete my old 2010 conversations:

perl    ArchiveLogs.pl    C:\Users\Brian\AppData\Roaming\.purple\logs    2010    DELETE

That’s all there is to it!  However, as usual, USE THIS SCRIPT AT YOUR OWN RISK.  I make no guarantees that it’s bug free. ;)

Get the script here: ArchiveLogs.

Brian

New (and awesome) Programming Jargon

I stumbled across this post the other day on globalnerdy.com.

There are some real gems in here, and I’m sure those of you that have developed software before can relate to at least a few of them. Here are some of my favorites:


Yoda Condidions

The act of using:

if (constant == variable)

instead of:

if (variable == constant)

It’s like saying “If blue is the sky”.


Bugfoot

A bug that isn’t reproducible and has been sighted by only one person.


Hindenbug

A catastrophic data-destroying bug. Oh, the humanity!

Sync your iTunes library and settings across multiple computers on Windows Vista/7.

You will need Windows Vista or Windows 7, the newest version of iTunes, and a Dropbox account for this tutorial. A Dropbox referral link is provided in this article which allows you and I to both get more storage space! This is also doable on Mac OS X, but I do not have access to a machine for screenshots and testing. It is also doable with Windows XP, but the commands at the CLI are a bit different. I’ll update this later with that information.

Note that both computers must have the SAME user name.

REMEMBER: When working with your personal data it is IMPERITIVE that you back everything up before you start working. Anything can happen, and I cannot be held accountable for data loss.
Let’s just take the situation of 2 PCs, a laptop and a desktop. You want your desktop to function as a media server with all of your music for your laptop to access over the network. Let’s also assume that you have a separate hard drive to make things easier.

Let’s assume again that your media is stored on a separate drive than your iTunes library. I use my M: drive. We need to have this path mapped on the laptop to access the music across the network. Map the drive to the same drive letter on both PCs. There are many how-to’s on the net for reference here, so it won’t be covered.

1. Next, download Dropbox from here:
Dropbox

2. Install Dropbox using the instructions provided on their website. Once it is done installing and you have logged in and gone through the tutorial, come back here.

Welcome back.

Now, we are going to use Dropbox to sync our iTunes library AND settings between multiple computers. I like to have exactly the same settings and library on both computers.

3. The first step is to move our library files (not the music files) into the cloud! Go to your Music folder (C:\Users\USERNAME\Music) and CUT the iTunes folder.
(note, it is shown as a shortcut in the screenshot, it will not be on your computer)

1iTunesDropbox

4. Now navigate to your new Dropbox folder, by default at “C:\Users\USERNAME\My Dropbox” and create a folder called “iTunes”

2iTunesDropbox

5. Now go to that folder and paste your iTunes folder that you just cut from your Music folder.

3iTunesDropbox

6. Now time for some command prompt fun! Go to the start menu and type “cmd” then right click cmd.exe and click “Run as administrator”

4iTunesDropbox

7. Click “Yes” and you will see this:

5iTunesDropbox

8. Now type:

cd C:\Users\USERNAME\Music\
mklink /d iTunes “..\My Dropbox\iTunes\iTunes”

You will see:

6iTunesDropbox

Now, when you double-click the link you just created, it will bring you to your iTunes music folder. Remember you moved it to the Dropbox folder. But look at the address bar!

7iTunesDropbox

Windows looks at this as an actual folder, but every file here is actually one of the files in your Dropbox iTunes folder! Any file created, deleted, or modified here has the same done to it in the Dropbox folder.

9. Now go to your laptop and install Dropbox on this computer.

10. (Assuming iTunes is already installed) Go to your music folder on this computer (C:\Users\USERNAME\Music) and delete the iTunes directory.

11. Follow steps 6 – 8 again on this computer.

12. Go back to your desktop and go to the start menu. Type “%appdata%” and hit enter.

8iTunesDropbox

13. Navigate to “Apple Computer” and CUT the iTunes folder from inside of it.

14. Go to your Dropbox and make a new folder titled “AppData” and paste the iTunes folder into it.

15. Start up the command prompt again (step 6). This time, use the following commands:

cd “C:\Users\USERNAME\AppData\Roaming\Apple Computer”
mklink /d iTunes “..\..\..\My Dropbox\iTunes\AppData”

16. Go to your laptop and follow steps 12 – 15.

Now it’s a waiting game. Wait until the Dropbox tray icon has stopped being a blue “recycling” or “refresh” icon and has become a green “check” icon. At this point, fire-up iTunes. You should be ready to go.
This is doable without Dropbox, but the benefit is that all of the library files are stored locally on the hard drive. This means that access to changing song tags and browsing the library in general is MUCH faster than over the network. Dropbox does a binary diff between files and only uploads the bits that have changed, so when uploading a change to your library it is much smaller than the entire file.

Corrupted email and Outlook.

I had a customer the other day that was complaining about an error message that was coming up with her Microsoft Outlook. We had just added her AOL account to Outlook using IMAP, and she was getting the following error message:

Your IMAP server wants to alert you the following: Can’t read attachment for message [messagenumber]

I did a search on this message and couldn’t find much, but did find a few other similar messages. As it turns out, looks like some of her emails or attachments were corrupted on AOL’s servers, and when Outlook was syncing them it notified her. The reason she never noticed it before was because when you browse your mail on AOL, it only downloads the email that you click on, not the entire message (obviously).

Luckily, out of her thousands of emails, only 17 returned the error.

Brian

How are these not already MATLAB functions?

EDIT: I have since been informed that you can concatenate multiple arrays together just by putting them inside braces []. Thus, you can use MATLAB’s built-in min and max functions by doing the following:

% assuming A and B are two arrays…
C = min([A B]);

You learn something new every day when you’re a novice in some language!

Nevertheless, I will keep my two functions implemented. The downside to using the above method is that it results in a full-on copy of the two vectors into a single vector. The problem with doing this is that:
1) It is slow!
2) It could result in a memory overflow if the arrays are large enough.

I have also updated my functions to handle matrices as well. Enjoy.

=============================================================================

I was at work today and needed to loop through a large amount of data in MATLAB. I was doing a diff between two arrays and needed to loop between it’s min and max values. Unless I suck at using Google, I couldn’t find any built-in MATLAB functions that return the smallest min of multiple arrays or the largest max of multiple arrays.

Here they are for your pleasure:

smallestMin

% SMALLESTMIN(X) The smallest component of all vector and/or matrix inputs.
%   SMALLESTMIN(X), where X is any number of vectors and/or matrices, is
%   the smallest component present in any of these structures.
%   SMALLESTMIN(X) is consistent with MATLAB's built-in MIN(X) for vectors
%   only!! SMALLESTMIN(X) for a matrix returns the equivalent of
%   MIN(MIN(X)).  SMALLESTMIN accepts mixed inputs of vectors and matrices.
%   A matrix is defined as having exactly 2 dimensions.
%
%   VALID INPUTS: row and column vectors and 2-D matrices.  These can
%                 contain characters, complex numbers, and numerics.
%                 Logicals, doubles, singles, int8, uint8, int16, int32,
%                 and uint32 are also valid.
%   INVALID INPUTS: cell arrays, 64-bit signed and unsigned values,
%   structures, function handles, user classes, and java classes.
%
%   These lists may not be all-inclusive.
%
%   Ex 1: If A = [1 2 3 4] and B = [-1   then smallestMin(A,B) returns -1.
%                                    2
%                                    3
%                                    4]
%
%   Ex 2: If A = [1 2   and B = [5 6 7 8] then smallestMin(A,B) returns 1.
%                 3 4]
%
%   Author:  Brian A. Schrameck
%   Version: v0.1
%   Date:    07/15/2009
%   Email:   <a href="mailto:schrameckbrian+smallestMin@gmail.com">schrameckbrian+smallestMin@gmail.com</a>
%
%   See also min, max, isnumeric, and largestMax (available on the MATLAB
%   File Exchange).

function smallestVal = smallestMin (varargin)

smallestVal = 0;

if ( length(varargin) < 1 )
    error('At least 1 argument must be present!')
end

for i=1:length(varargin)
    % This will catch at least most of the bad stuff thrown at it.  It will
    % not catch uint64/int64, or function handles; probably some other
    % stuff too.
    if ( (~isvector(varargin{i}) && ndims(varargin{i}) ~= 2 ) || ...
            iscell(varargin{i}) )
        error('Argument %d is not a vector or 2D matrix!', i)
    end
    % We will error out here if we didn't catch something earlier, as min
    % will die.
    curMin = min(min(varargin{i}));
    if ( i == 1 )
        % First time through is always the min value.
        smallestVal = curMin;
    elseif ( curMin < smallestVal )
        % Compare against current smallest value.
        smallestVal = curMin;
    end
end
end

largestMax

% LARGESTMAX(X) The largest component of all vector and/or matrix inputs.
%   LARGESTMAX(X), where X is any number of vectors and/or matrices, is
%   the largest component present in any of these structures.
%   LARGESTMAX(X) is consistent with MATLAB's built-in MAX(X) for vectors
%   only!! LARGESTMAX(X) for a matrix returns the equivalent of
%   MAX(MAX(X)).  LARGESTMAX accepts mixed inputs of vectors and matrices.
%   A matrix is defined as having exactly 2 dimensions.
%
%   VALID INPUTS: row and column vectors and 2-D matrices.  These can
%                 contain characters, complex numbers, and numerics.
%                 Logicals, doubles, singles, int8, uint8, int16, int32,
%                 and uint32 are also valid.
%   INVALID INPUTS: cell arrays, 64-bit signed and unsigned values,
%   structures, function handles, user classes, and java classes.
%
%   These lists may not be all-inclusive.
%
%   Ex 1: If A = [1 2 3 4] and B = [-1   then largestMax(A,B) returns 5.
%                                    2
%                                    3
%                                    5]
%
%   Ex 2: If A = [1 2   and B = [5 6 7 8] then largestMax(A,B) returns 8.
%                 3 4]
%
%   Author:  Brian A. Schrameck
%   Version: v0.1
%   Date:    07/15/2009
%   Email:   <a href="mailto:schrameckbrian+largestMax@gmail.com">schrameckbrian+largestMax@gmail.com</a>
%
%   See also max, min, isnumeric, and smallestMin (available on the MATLAB
%   File Exchange).

function largestVal = largestMax (varargin)

largestVal = 0;

if ( length(varargin) < 1 )
    error('At least 1 argument must be present!')
end

for i=1:length(varargin)
    % This will catch at least most of the bad stuff thrown at it.  It will
    % not catch uint64/int64, or function handles; probably some other
    % stuff too.
    if ( (~isvector(varargin{i}) && ndims(varargin{i}) ~= 2 ) || ...
            iscell(varargin{i}) )
        error('Argument %d is not a vector or 2D matrix!', i)
    end
    % We will error out here if we didn't catch something earlier, as max
    % will die.
    curMax = max(max(varargin{i}));
    if ( i == 1 )
        % First time through is always the max value.
        largestVal = curMax;
    elseif ( curMax > largestVal )
        % Compare against current largest value.
        largestVal = curMax;
    end
end
end

Aww, how cute.

Look, my first C++ project! Just look at the lines that are 196 characters long and improper indentation. Isn’t it adorable?

//<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\\
// Brian Schrameck																\\
// COMP 115																		\\
// Started:  September 15, 2004													\\
// Last Modified:  September 26, 2004											\\
//																				\\
// SchrameckProject1.cpp:  A program that takes a user-input dollar amount		\\
// and maximizes the amount of candy able to be bought with it.  It will		\\
// buy as much of the favorite candy as possible, and then buy as much			\\
// of the next favorite, etc., until no more candy is able to be bought.		\\
// It then displays the candy bought, the amount, and the price, as well		\\
// as the total price and the leftover money in a table, with decimal points	\\
// lined-up appropriately.														\\
//																				\\
// I have abided by the Wheaton Honor Code in this work.						\\
// - Brian A. Schrameck															\\
//<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\\

#include <iomanip>
#include <iostream>
using namespace std;

// Declare the price of the candies as constants in CENTS.

const int TURTLE = 225; 	// Ton o' Chocolate Turtles
const int CUPS = 90; 		// Nutty Cups
const int TRUFFLE = 55; 	// Death by Chocolate Truffles
const int TWISTER = 7; 		// Twisters

int main()
{
double cash; 					// How much money the person has to start out with.
int cents; 					    // How much money the person has in CENTS.
int turtles; 					// How many Ton o' Chocolate Turtles the person will be able to buy.
int cashafterturtle;			// How much money the person has in cents after buying the turtles.
int cups;						// How many Nutty Cups the person will be able to buy.
int cashaftercups;				// How much money the person has in cents after buying the cups.
int truffles;					// How many Death by Chocolate Truffles the person will be able to buy.
int cashaftertruffles;			// How much money the person has in cents after buying the truffles.
int twisters;					// How many Twisters the person will be able to buy.
int cashaftertwisters;			// How much money the person has in cents after buying the twisters.
float turtlescost;				// How much money the turtles cost total in dollars.
float cupscost;					// How much money the cups cost total in dollars.
float trufflescost;				// How much money the truffles cost total in dollars.
float twisterscost;				// How much money the twisters cost total in dollars.
float total; 					// How much money is spent total.
float change;					// How much money is leftover.

// Ask the user how much money they have to spend.

cout << "How much money do you have to spend on the candy?" << endl;
cin >> cash;

// If the user provides 0 as their cash amount, they will be asked to go get some money.

while  (cash == 0)
{
cout << "You can't buy something for nothing!  Get some money, silly!" << endl;
cout << "How much money do you have to spend, now?" << endl;
cin >> cash;
}

// Convert the amount of money from dollars into cents.

cents = 100.0 * cash;

// Find out how many of each candy you will be able to buy, then find out how much change you have leftover after each.

turtles = cents / TURTLE;
cashafterturtle = cents % TURTLE;

cups = cashafterturtle / CUPS;
cashaftercups = cashafterturtle % CUPS;

truffles = cashaftercups / TRUFFLE;
cashaftertruffles = cashaftercups % TRUFFLE;

twisters = cashaftercups / TWISTER;
cashaftertwisters = cashaftercups % TWISTER;

// Find out how much the total amount of each batch of candy cost.

turtlescost = (TURTLE / 100.0) * turtles;
cupscost = (CUPS / 100.0) * cups;
trufflescost = (TRUFFLE / 100.0) * truffles;
twisterscost = (TWISTER / 100.0) * twisters;

// Find out the total cost.

total = turtlescost + cupscost + trufflescost + twisterscost;

// Find out how much money will be leftover.

change = cash - total;

// Display the "receipt".

cout << endl;
cout << "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^" << endl;
cout << "|                                                   |" << endl;
cout << "|                Brian's Candy Shop                 |" << endl;
cout << "|                   Your Receipt                    |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "| Candy                        Amount       Price   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| " << setprecision (2) << fixed << "Ton o' Chocolate Turtles" << setw (6) << "| " << setw (4) << turtles << " |" << setw (5) << "$" << setw (6) << turtlescost << "   |" << endl;
cout << "| " << "Nutty Cups" << setw (20) << "| " << setw (4) << cups << " |" << setw (5) << "$" << setw (6) << cupscost << "   |" << endl;
cout << "| " << "Death by Chocolate Truffles" << setw (3) << "| " << setw (4) << truffles << " |" << setw (5) << "$" << setw (6) << trufflescost << "   |" << endl;
cout << "| " << "Twisters" << setw (22) << "| " << setw (4) << twisters << " |" << setw (5) << "$" << setw (6) << twisterscost << "   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| TOTAL:" << setw (35) << "$" << setw (6) << total << "   |" << endl;
cout << "| ------------------------------------------------- |" << endl;
cout << "| Leftover:" << setw (32) << "$" << setw (6) << change << "   |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "|                    Thank you!                     |" << endl;
cout << "|                Please come again!                 |" << endl;
cout << "|                                                   |" << endl;
cout << "|                                                   |" << endl;
cout << "^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^" << endl;

return 0;
}

Sample output with an input of $50.00

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|                                                   |
|                Brian's Candy Shop                 |
|                   Your Receipt                    |
|                                                   |
|                                                   |
| Candy                        Amount       Price   |
| ------------------------------------------------- |
| Ton o' Chocolate Turtles    |   22 |    $ 49.50   |
| Nutty Cups                  |    0 |    $  0.00   |
| Death by Chocolate Truffles |    0 |    $  0.00   |
| Twisters                    |    7 |    $  0.49   |
| ------------------------------------------------- |
| TOTAL:                                  $ 49.99   |
| ------------------------------------------------- |
| Leftover:                               $  0.01   |
|                                                   |
|                                                   |
|                    Thank you!                     |
|                Please come again!                 |
|                                                   |
|                                                   |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A crash course in Perl and DNA.

In my Senior year of undergrad I took a class called “DNA.”  It was a cross-listed course; Computer Science and Biology.  I thought it might be interesting, and it involved coding up some stuff in Perl.  The course was basically a crash-course in both Perl and DNA. As such, some of the starting code (such as reading from a file) was given to us. Lucky!

For our final project we had to come up with an application summing up the knowledge we learned in class.  I decided to do the following:

Summary

In DNA there are four possible nucleotides that can be used: T, C, A, and G.  Codons are formed by joining three of these nucleotides together.  This means that there are sixty-four possible codons (4x4x4).  The purpose of these codons is to provide a mechanism for the organism to produce amino acids.  There are 20 different amino acids.  This means that there will be some codons that, while having different nucleotide structures, allow for the production of the same amino acids.

Different organisms may show different preference, or a bias, towards different codons to encode the same amino acid.  The reasons for this happening are still vague, but much research is being done to get a clue as to why certain codons are preferred over others.

In bacteria, for example, different variables have been found to contribute to certain codon usage.  Some of these include GC-composition, the likelihood of a specific strand to mutate, and RNA stability. (Curr Issues Mol Biol., Oct 3 2001)  It has also been found that codon bias may be used to improve translation efficiency. (Yizhar Lavner, Daniel Kotlar, 2004)

Proposed Solution

The software being proposed will be able to compare different biases toward codons in any set of organisms.  It will be able to find GC-rich regions, as well as output information to an Excel format file for analysis.  By reading in a DNA sequence, breaking it up into codons, and using a hash table to keep track of the counts of each different codon, a statistical analysis can be performed on the bias towards one codon or another to encode the same amino acid.  A visual analysis can be performed using Microsoft Excel’s chart capabilities.

The program will start by getting the input.  This is done by using glob to grab a bunch of files stored in a local folder.  The file names will be stored in an array.  The program will then enter a while loop to go through each file in the array and perform work on it.

#!/usr/bin/perl

use strict;
use warnings;

# needed for MacOS X if using BBedit or TexWrangler (see comments below)
use File::Spec;

#-------------------------------------------------------------------------------
# CONSTANTS
#-------------------------------------------------------------------------------

my $MAX_CODONS_OUTPUT = 64;               # top-N most frequent codons
my $dataDirectory = "inputGeneDirectory"; # relative path of directory holding input files
my $outputFileName = "outputResults.xls"; # set-up OUTPUT file (tab-delimited for Excel)
my $TRUE = 1;                             # bool value for true
my $FALSE = 0;                            # bool value for false
my $CODON_LENGTH = 3;                     # size of codon to use in current run

#-------------------------------------------------------------------------------
# VARIABLES/DATA STRUCTURES
#-------------------------------------------------------------------------------

# ARRAYS
my @fileList;      # all files in the directory ($dataDirectory) of data files
my @allCodons;     # an array to hold each codon
my @aminoAcids;    # a list of frequency counts for each aminoAcid and codons
my @sorted;        # a list of the counts sorted highest to lowest

# HASH MAPS
my %codonCounts;        # a hash table of frequency counts for each codon
my %codonMap;           # declare a hash table to hold the Amino Acid abbreviations
                        # corresponding to each possible triple of DNA (codon).
                        # Amino Acids are abgreviated with the ONE LETTER codes.
                        # (see the table at the bottom of this program for AA codes)

# VARIABLES
my $pathname;           # the path to the files
my $filename;           # the file name being worked on
my $nextFile;           # the next file in the list of files
my $sequence;           # the DNA sequence being read in
my $sequenceLength;     # the length of the DNA sequence
my $numCodons = 0;      # the number of codons found
my $nextCodon;          # scalar to hold each codon, one at a time
my $numSortedCodons = 0;# the number of sorted codons, useful if less than N
my $numToPrint;         # the final number of top-N codons to print
my $num = 0;            # counter variable, i would not work for some reason
my $rank = 0;           # the rank of the current codon
my $currentCount = 0;   # the number of times the current codon was found
my $proportion;         # the proportion of this codon compared to all codons
my $GCRich;             # the GC-rich region that was found
my $lengthGC;           # the length of this GC-rich region
my $firstGCRich = $TRUE;# is this the first GC-rich region found for this file?
my $sumGCLengths = 0;   # the sum of all of the GC-rich region lengths
my $percentGCs = 0;     # the percentage of the file that is rich in GC
my $nextAcid;           # the next amino acid in the array

#===============================================================================
#                      START PROGRAM
#===============================================================================

%codonMap = getAminoAcidLookUpTable(); # populate the codon hash map->amino acids

printTitle();             # print the title of the report (to the console)

#============================
# SET OUTPUT INFORMATION

#set the path name for the directory
setPathname ();

# set the output filename
$outputFileName = $pathname.$outputFileName;

# create the output file (tab-delimited for Excel)
open (OUTPUT, ">", $outputFileName)
    or die "Cannot create the file: $outputFileName: $!";
#============================
# SET INPUT INFORMATION

# use filename with .fna to get all FASTA files in the directory
$filename = $pathname.$dataDirectory."/*.fna";

# grab a "glob" of files, each filename is put into elements of the array @fileList
@fileList = glob($filename);

The first step when entering the while loop is to get the DNA from the file.  The input expected will be a FASTA format file.  The sequence must also begin with a start codon and end with a stop codon.

#========================================================
# FOREACH file in the directory of input data files

print "========================================================\n\n";

foreach $nextFile (@fileList)
{
    %codonCounts = ();       # clear the hash table

    # print the current filename to the screen (console, stdout)
    print "Next file is: $nextFile \n";

    # print filename to Excel (output) file
    print OUTPUT "$nextFile\n";

    # skip Linux files with names that start with ".";
    if ($nextFile =~ m/^\./)
    {
        next;
    }

    # get the DNA sequence and its length.
    $sequence = readInDNA($nextFile);
    $sequenceLength = length ( $sequence );

    print "Length of the DNA sequence: $sequenceLength \n\n";

    # get the last codon from the sequence
    my $lastCodon = substr ( $sequence, $sequenceLength - 3, 3 );

    # check if the sequence starts with an ATG.
    if ( substr ( $sequence, 0, $CODON_LENGTH ) ne "ATG" )
    {
        print "Gene must start with ATG.\n";
        print OUTPUT "Gene must start with ATG. Results may not be valid!\n\n";
    }

    # check if the sequence ends with a STOP.
    if ( $lastCodon ne "TAG" && $lastCodon ne "TAA" && $lastCodon ne "TGA" )
    {
        print "Gene must end with stop codon.\n";
        print OUTPUT "Gene must end with stop codon. Results may not be valid!\n\n";
    } 

The next step is to break the DNA sequence given into codons.  Since we know that the sequence is in the correct reading frame, no changes in the reading frame will need to occur.  For each codon a count will be increased.  This will be implemented using a hash table.  So if the sequence is CCACCA, the hash table will have one entry, CCA, with a count of 2.

# break the sequence into codons, storing each codon in a slot in an array
    @allCodons = breakIntoCodons($sequence );
    $numCodons = scalar ( @allCodons );    # set the number of codons found

    print "Number of codons: $numCodons \n";
    print OUTPUT "Number of codons: $numCodons \n";

    # go through the array of codons...
    foreach $nextCodon (@allCodons)
    {
        $codonCounts { $nextCodon }++;    #increase the count for this codon
    }

After reading all of the codons in the sequence, all of the unique codons found, along with their counts, will be output to an Excel format .xls file.  This is the first of two Excel files being used for this project.  Next, the hash table will be analyzed to see which types of amino acids are present.  These will be matched up to the codons to get a count of how many amino acids were formed using a specific codon.  The results of this will be output to the same Excel format file as above.  This will include all of the amino acids, sorted by the most biased codons that formed that amino acid.

# sort the array of codons by their count
    @sorted =  sort { $codonCounts{$b} <=> $codonCounts{$a} or $a cmp $b } keys(%codonCounts);

    # get the length of how many unique codons were found
    $numSortedCodons = scalar ( @sorted );

    # get the amino acid that corresponds to each unique codon
    for ( $num = 0; $num < $numSortedCodons; $num++ )
    {
        $aminoAcids [$num] = translate ( $sorted[$num] );
    }

The next part is to find GC-rich regions in the sequence.  The point of doing this is to find if there is a certain bias only during GC-rich regions.  The same procedure will be applied as above, but a new file will be created just for the GC-rich regions.

#===============================================================================
#		start search for GC-rich regions
#===============================================================================

    print "Searching for GC-rich regions greater than length 4...\n";

    # Search for GC-rich sequences of 4G/Cs or more
    while ( $sequence =~ m/([GC]{4,})/g )
    {
        $GCRich = $1;                       # the region
        $lengthGC = length ( $GCRich );     # length of region
        $sumGCLengths += $lengthGC;	        # total length of GC regions
        $firstGCRich = $FALSE;              # no long first region found
    }

    # if this is the first GC-rich region in the file...
    if ( $firstGCRich == $TRUE )
    {
        print "No GC-rich regions were located. \n";
        print "========================================================\n\n";
    }
    else
    {
        print "Found some! \n";
        print "========================================================\n\n";
    }

    # find the GC percentage
    $percentGCs = ( $sumGCLengths / $sequenceLength ) * 100;

    $sumGCLengths = 0;      # start over the GC-length
    $firstGCRich = $TRUE;   # set the next GC-rich region found as the first

The next part is to loop back to the beginning and move-on to the next file.  Each file will be separated by a few spaces.

The last step of the program deals with the output to the screen.   The program will output the number of codons found, each file name, and the relevant number of amino acids found.

#===============================================================================
#			start outputting results
#===============================================================================

    printf OUTPUT "GC%%: \t%5.1f \n\n", $percentGCs;
    printf OUTPUT "Rank\tCodon\tAmino Acid\tFrequency\tProportion\n";

    # if there are less than N "top-N" codons found
    if ( $MAX_CODONS_OUTPUT > $numSortedCodons )
    {
        $numToPrint = $numSortedCodons;	# set the number to output
    }

    # otherwise...
    else
    {
        $numToPrint = $MAX_CODONS_OUTPUT; # set to constant N, defined at top
    }

    # print the top N codons: rank, codon, occurences, and proportion
    for ( $num = 0; $num < $numToPrint; $num++ )
    {
        $rank++;     					                # set this codon's rank
        $nextCodon = $sorted[$num];			            # get the codon
        $currentCount = $codonCounts { $nextCodon };    # get the count
        $proportion = $currentCount / $numCodons;       # calculate proportion
        $nextAcid = $aminoAcids[$num];                  # get the next A.A.

        # don't output start or end codons, they are unnecessary
        if ( $nextAcid ne "..." && $nextAcid ne "Met" )
        {
            # now print...
            printf OUTPUT "$rank\t$nextCodon\t$nextAcid\t$currentCount\t%5.3f\n", $proportion;
        }
        else
        {
            # if it was a start or end codon, decrease rank by 1
            $rank--;
        }
    }

    $rank = 0;		        # start the ranks over
    $percentGCs = 0;        # start over the %GCs

    # make some space in the Excel otuput between tables
    print OUTPUT "\n\n\n";
}

# w00t, done!
close ( OUTPUT );
print "\n\nDone.\n";

#----- end of Main Program -----------------------------------------------------

The rest of the file contains various subroutines to read in the DNA and break it apart into codons, as well as set up the codon map.

#-----------\
# readInDNA \
#-------------------------------------------------------------------------------
# Subroutine to open a FASTA formatted file of DNA sequence, remove
# newline characters at the end of each line, discard any blank lines
# discard lines containing a comment(#) symbol, removes any digits,
# removes any extra whitespace characters, converts nucleotides
# to uppercase, and return as one big string.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# 1 argument:  name of file holding the DNA
#
# RETURNS:  file of DNA as one string
#
# A sample of how you might "call" this subroutine from your program
# to 'get' the DNA from a Fasta formatted file for a gene sequence
#
#                  $sequence = readInDNA("humanXM_012345.fna");

sub readInDNA
{
    my $filename = shift;
    my $line;
    my $firstline;
    my $seq;

        # try to open the file; if no file by that name, print a warning and exit
        open (FNAFILE, "<", $filename)
            or die "Cannot open the input file: $filename: $!";

        if ( !($firstline = <FNAFILE>) )
        {
               print "Can NOT read the header line from the file called: $filename \n";
               exit();
         }

    $seq = "";

      # WHILE not EOF, grab next line
    while ($line = <FNAFILE>)
    {
        chomp $line;

            # discard blank line
            if ($line =~ m/^\s*$/)
            {
                    next;
                }

            # discard comment line (any line with a comment(#) symbol)
            if ($line =~ m/^\s*#/)
            {
                    next;
                }

                $line =~ tr/0123456789//d;    # remove all digits
        $line =~ tr/     \n\r//d;        # remove any extra whitespace
        $line = uc($line);        # convert everything to upper-case

            $seq = $seq . $line;        # concatenate new line onto the entire text so far

    } # end WHILE not EOF

    close(FNAFILE);

    return $seq;
}
#----- end of readInDNA --------------------------------------------------------

#-----------------\
# breakIntoCodons \
#-------------------------------------------------------------------------------
# Subroutine to break the text passed to it into codons of length $CODON_LENGTH.
#
# Partially WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: an array of codons.

sub breakIntoCodons
{
    my $text = shift;        # the sequence being searched
       my $DNAlen = length($text);     # the length of the sequence
    my @codons;              # array of codons to fill and return when done
    my $i=0;                # points to start of the next codon each time
    my $j=0;                # $codon[$j] is the next codon
        my $end = $DNAlen-($CODON_LENGTH);    # don't fall off the end of the sequence
        my $tempMotif;            # the current codon being worked on

        while ( $i < $end )
       {
               # get the next codon
               $tempMotif = substr($text, $i, $CODON_LENGTH);

                # if there is some other letter than ACGT...
              if ( $tempMotif =~ m/[^ACGT]/i )
               {
                    # skip to next valid codon
                 $i += $CODON_LENGTH;
               }

                # otherwise, add this codon to valid codons
               else
               {
                 $codons[$j] = $tempMotif; # add to array of codons
                $i += $CODON_LENGTH;     # move to next codon
                 $j++;                    # move to next box in array
               }
    }

        return @codons;
}
#----- end of breakIntoCodons --------------------------------------------------

#-------------\
# setPathname \
#-------------------------------------------------------------------------------
# Subroutine to set the pathname.  Useful to keep the main program a little
# more readable.  Expects a global variable of $pathname.
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: nothing.

sub setPathname
{
    #-----------------------------------------------------------------------
    # establish the absolute pathname to the directory of input files
    #-----------------------------------------------------------------------
    # MacOS USERS TAKE NOTE
    # determine the absolute pathname to the current directory
    # next two lines need to be uncommented for MacOS X (e.g., BBedit or TexWrangler)
    # my @path       = File::Spec->splitdir( File::Spec->rel2abs($0) );
    # $pathname      = File::Spec->catfile( @path[0 .. $#path - 1], "");
    #-----------------------------------------------------------------------
    # uncomment if using
    # WINDOWS or Linux (or MacOS X on command-line)
    $pathname   = "";
    #-----------------------------------------------------------------------
}
#----- end of setPathname ------------------------------------------------------

#------------------------\
# getAminoAcidLookUpTable \
#-----------------------------------------------------------------------
# Construct hash that maps codons to amino acids by reading the table
# from DATA lines at the end of the program, which have the form:
# Residue Codon1 Codon2 ...
#
# no arguments
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: hash table of codon and their one-letter amino acid abbreviations

sub getAminoAcidLookUpTable
{
  my %codonMap;

  while (my $in = <DATA> )    # assigns next line of DATA to $in; fails if none
  {
    chomp($in);                   # remove line feed at end of $in
    my @codons = split " ",$in;
    my $residue = shift @codons;  # remove first item from @codons and assign
    foreach my $nnn (@codons) {
        $codonMap{$nnn} = $residue;
    }
  }

  return %codonMap;

}
#----- end of getAminoAcidLookUpTable ------------------------------------------

#-----------\
# translate  \
#-----------------------------------------------------------------------
# translates (presumably pre-transcribed) DNA strings to proteins
#
# 1 argument: the DNA string to translate
#
# WRITTEN BY MARK LEBLANC/BETSY DYER
#
# RETURNS: the protein string. (1-letter Amino Acid abbreviations)

sub translate
{
    my $dna = shift(@_);  # the DNA string to be translated.

    my $pro = "";
    while ( $dna =~ s/(...)// ) {
        $pro = $pro . $codonMap{$1};
    }
    return $pro;
}
#----- end of translate --------------------------------------------------------

#define the mapping from codons to amino acids...
__END__
Ala GCT GCC GCA GCG
Arg CGT CGC CGA CGG AGA AGG
Asn AAT AAC
Asp GAT GAC
Cys TGT TGC
Gln CAA CAG
Glu GAA GAG
Gly GGT GGC GGA GGG
His CAT CAC
Ile ATT ATC ATA
Lei TTA TTG CTT CTC CTA CTG
Lys AAA AAG
Met ATG
Phe TTT TTC
Pro CCT CCC CCA CCG
Ser TCT TCC TCA TCG AGT AGC
Thr ACT ACC ACA ACG
Trp TGG
Tye TAT TAC
Val GTT GTC GTA GTG
... TAA TAG TGA

Results

The results proved that there is a definite bias towards some codons to create certain amino acids, namely asparagine, aspartic acid, and lysine. GC-richness appeared to have no effect on codon bias. Any answers as to why there is a codon bias at all are still being researched.

For the full project, including raw notes and code, download this zip file: codonBiasFinal
(Note: this project is kind of thrown together, meaning standard coding practices may not have been followed, such as breaking up parts of the program into subroutines, etc. If I remember correctly, it was a requirement of the professors that all work be done in the provided subroutines, with no new subs created, but I could be wrong. Either way, looking at the code now it’s atrocious.)