How to Use Parquet with MapReduce

Parquet is a great file format for use with higher level tools like Impala, Hive, Pig, and Spark.  But what if you want to use it in MapReduce?  Cloudera provides an easy to follow example on how to do this, and is a perfect guide for basic usage of the Parquet MapReduce API.  As an enhancement, more speed can be gained by using a different object model.

The Parquet SimpleGroup toString() method, which is what is utilized in MapReduce when using the default Parquet object model, is extremely slow.  I had a client recently with a job that was taking over an hour to run with only 1.9 GB of data (18 GB uncompressed) because they were following the sample code.
The solution to this problem is to use a different in-memory object format.  There are two pieces to using Parquet: the object format and the storage format.  Parquet provides its famous binary columnar storage format and excellent compression.  It also provides an object model in the form of the “example” Group class.  Other object models exist, though, including Avro, Google Protocol Buffers, Thrift, Hive, and Pig.  You still get the benefits of Parquet’s efficient storage mechanism, but you get the added benefit of a more robust and versatile in-memory object model to manipulate data after you’ve loaded it.

After moving the object model for this particular client to Avro, the job duration dropped to under 7 minutes.  Check out how to use Parquet with an Avro object model instead on my GitHub and let me know if you have any questions.

How To Align Partitions on an Upgraded iPod Hard Drive in Windows

If you’ve upgraded your iPod using a newer, larger drive, there’s a chance that the new drive uses “Advanced Format” sectors.  Drives that are use Advanced Format have sectors larger than 512 bytes; generally 4K bytes.

If you upgraded your iPod using one of these drives, it will work, but performance will be degraded significantly.  This is because iTunes partitions the drive inside of your iPod with no consideration about whether it is an Advanced Format drive or not.  iTunes uses two partitions when formatting:

  1. The first partition is the “system” hidden partition.  This is where the iPod stores its operating system.
  2. The second partition is free space formatted as FAT32.  This is the partition we are concerned about.

I upgraded my 5G iPod Video 60GB to a 120GB drive using a Samsung HS12YHA, but I was really disappointed when syncing took forever and write speeds were less than 1MB/s.  This occurs when the logical sectors on the disk are unaligned with the physical sectors.  The iPod uses 2K sectors.  Here’s a diagram to show what I mean:

diagram that shows the different between aligned and unaligned sectors

If you take a look at the unaligned picture you can see that when the OS writes to disk (using logical sectors, as usual) it’s possible that the logical sector spans across two physical sectors.   Because a sector is the smallest unit that a disk can address, when a sector is written to the entire sector needs to be read, the change made, and then the sector written back to disk.  As such, if you change a logical sector that spans across multiple physical sectors, then both physical sectors need to be read entirely, changed, and written to disk again causing the drive to do a lot of extra work.

What we want to achieve is a drive that has its logical sectors aligned with its physical sectors.  To do this, the start of the data partition on the iPod needs to start at a physical sector.  Here’s how to do it on Windows.

WARNING: YOU WILL LOSE ALL DATA ON YOUR IPOD WHEN DOING THIS.  ALSO, BE EXTREMELY CAREFUL THAT YOU DO NOT EDIT THE WRONG DISK.  I am not responsible if something goes wrong.

  1. Restore your iPod using iTunes.
  2. Download Symantec’s Partition Table Editor, PTEDIT32, from this link.
  3. Open PTEDIT and select your iPod from the drop down menu.  You can do this by matching the disk size with your iPod side.  You may have to start PTEDIT as an administrator by right-clicking it and choosing “Run as administrator”.
  4. You now need to figure out which sector partition 2 should start on.  DO NOT alter partition 1!  To determine which sector to use, increase the number in “Sectors Before” for partition 2 until it is divisible by 8.  Keep track of how many sectors you add!  For instance, if your “Sectors Before” field reads 224910 (as mine did) you would increase this to 224912 because 8 divides evenly into it.
  5. Add the same number of sectors to the Starting Sector field for partition 2.  My disk started at sector 1, so I increased this to 3.
  6. Subtract the same number of sectors from the Sectors field for partition 2.  My partition had 234216737 sectors, so I made this 234216735.
    DO NOT alter any other fields!
    Here’s what my final partition table looked like:
  7. Save your changes.  Exit PTEDIT and “safely remove” your iPod.
  8. The iPod should reboot, but it’s probably going nuts trying to understand what happened to the partitions, as you’ve destroyed the file system.  You’ll need to boot the iPod into Disk Mode to continue.
  9. Connect the iPod to the computer.
  10. Download fat32format and extract it somewhere convenient.
  11. Open a command prompt and navigate to the spot where you extracted fat32format.
  12. Make SURE you type this command correctly.  Type:

    fat32format IPOD_DRIVE_LETTER_HERE:

  13. This will quickly format the iPod’s data partition.  When it’s done, continue.
  14. Disconnect the iPod and reconnect it to the computer.
  15. You should now be able to use your iPod at full speed!
As always, if you have any questions, post them here and I’ll do my best to help.

New (and awesome) Programming Jargon

I stumbled across this post the other day on globalnerdy.com.

There are some real gems in here, and I’m sure those of you that have developed software before can relate to at least a few of them. Here are some of my favorites:


Yoda Condidions

The act of using:

if (constant == variable)

instead of:

if (variable == constant)

It’s like saying “If blue is the sky”.


Bugfoot

A bug that isn’t reproducible and has been sighted by only one person.


Hindenbug

A catastrophic data-destroying bug. Oh, the humanity!

Sync your iTunes library and settings across multiple computers on Windows Vista/7.

You will need Windows Vista or Windows 7, the newest version of iTunes, and a Dropbox account for this tutorial. A Dropbox referral link is provided in this article which allows you and I to both get more storage space! This is also doable on Mac OS X, but I do not have access to a machine for screenshots and testing. It is also doable with Windows XP, but the commands at the CLI are a bit different. I’ll update this later with that information.

Note that both computers must have the SAME user name.

REMEMBER: When working with your personal data it is IMPERITIVE that you back everything up before you start working. Anything can happen, and I cannot be held accountable for data loss.
Let’s just take the situation of 2 PCs, a laptop and a desktop. You want your desktop to function as a media server with all of your music for your laptop to access over the network. Let’s also assume that you have a separate hard drive to make things easier.

Let’s assume again that your media is stored on a separate drive than your iTunes library. I use my M: drive. We need to have this path mapped on the laptop to access the music across the network. Map the drive to the same drive letter on both PCs. There are many how-to’s on the net for reference here, so it won’t be covered.

1. Next, download Dropbox from here:
Dropbox

2. Install Dropbox using the instructions provided on their website. Once it is done installing and you have logged in and gone through the tutorial, come back here.

Welcome back.

Now, we are going to use Dropbox to sync our iTunes library AND settings between multiple computers. I like to have exactly the same settings and library on both computers.

3. The first step is to move our library files (not the music files) into the cloud! Go to your Music folder (C:\Users\USERNAME\Music) and CUT the iTunes folder.
(note, it is shown as a shortcut in the screenshot, it will not be on your computer)

1iTunesDropbox

4. Now navigate to your new Dropbox folder, by default at “C:\Users\USERNAME\My Dropbox” and create a folder called “iTunes”

2iTunesDropbox

5. Now go to that folder and paste your iTunes folder that you just cut from your Music folder.

3iTunesDropbox

6. Now time for some command prompt fun! Go to the start menu and type “cmd” then right click cmd.exe and click “Run as administrator”

4iTunesDropbox

7. Click “Yes” and you will see this:

5iTunesDropbox

8. Now type:

cd C:\Users\USERNAME\Music\
mklink /d iTunes “..\My Dropbox\iTunes\iTunes”

You will see:

6iTunesDropbox

Now, when you double-click the link you just created, it will bring you to your iTunes music folder. Remember you moved it to the Dropbox folder. But look at the address bar!

7iTunesDropbox

Windows looks at this as an actual folder, but every file here is actually one of the files in your Dropbox iTunes folder! Any file created, deleted, or modified here has the same done to it in the Dropbox folder.

9. Now go to your laptop and install Dropbox on this computer.

10. (Assuming iTunes is already installed) Go to your music folder on this computer (C:\Users\USERNAME\Music) and delete the iTunes directory.

11. Follow steps 6 – 8 again on this computer.

12. Go back to your desktop and go to the start menu. Type “%appdata%” and hit enter.

8iTunesDropbox

13. Navigate to “Apple Computer” and CUT the iTunes folder from inside of it.

14. Go to your Dropbox and make a new folder titled “AppData” and paste the iTunes folder into it.

15. Start up the command prompt again (step 6). This time, use the following commands:

cd “C:\Users\USERNAME\AppData\Roaming\Apple Computer”
mklink /d iTunes “..\..\..\My Dropbox\iTunes\AppData”

16. Go to your laptop and follow steps 12 – 15.

Now it’s a waiting game. Wait until the Dropbox tray icon has stopped being a blue “recycling” or “refresh” icon and has become a green “check” icon. At this point, fire-up iTunes. You should be ready to go.
This is doable without Dropbox, but the benefit is that all of the library files are stored locally on the hard drive. This means that access to changing song tags and browsing the library in general is MUCH faster than over the network. Dropbox does a binary diff between files and only uploads the bits that have changed, so when uploading a change to your library it is much smaller than the entire file.

Corrupted email and Outlook.

I had a customer the other day that was complaining about an error message that was coming up with her Microsoft Outlook. We had just added her AOL account to Outlook using IMAP, and she was getting the following error message:

Your IMAP server wants to alert you the following: Can’t read attachment for message [messagenumber]

I did a search on this message and couldn’t find much, but did find a few other similar messages. As it turns out, looks like some of her emails or attachments were corrupted on AOL’s servers, and when Outlook was syncing them it notified her. The reason she never noticed it before was because when you browse your mail on AOL, it only downloads the email that you click on, not the entire message (obviously).

Luckily, out of her thousands of emails, only 17 returned the error.

Brian