So I'm moving out of my current apartment shortly, ahead of giving a talk at
Hadoop World NYC and going on a two week vacation overseas. I'm going to be putting my stuff in storage while I'm away, and this includes my hard drives with my photos and music on them. It occurred to me, well what if this gets stolen, or the drives damaged in transit?
It was time to make a backup. Long overdue, really.
There were several options. The easiest option probably would have been to go and pick up another 500GB external drive, copy the existing one that has the goodies on it, and leave that with a friend. The cheap option would perhaps be to back up onto DVDs.
But I've been messing around with Amazon's Simple Storage Service (S3) an awful lot lately so it occurred to me that I could just back up to the cloud like I've been doing at work. So I wrote it in Ruby, making use of Amazon's aws-s3 gem, which makes dealing with S3 almost trivial.
I did these steps on Ubuntu, adjust appropriately for whatever OS you use.
First things first - an Amazon account.If you're going to do things this way, you'll need an account on
Amazon Web Services (AWS). Note that this stuff isn't free - you will get charged $0.15/GB for storage. However, keeping it running is someone else's problem, not yours. If you've ever shopped at Amazon you can just use the same account you shop with to sign into AWS.
Once you're signed up and signed in, click on the Your Account menu near the top, and select Security Credentials. You'll see a section on this page called Access Credentials, and as part of that, Your Access Keys:

This contains the two pieces of information you need to be able to programmatically connect your machine to S3 - your "Access Key ID" and "Secret Access Key". These are pretty much like a username/password combination that provides access to the account identified by your email address. (You could have more than one, or even multiple accounts for one email address since Amazon accounts are really identified by a unique "Canonical ID", but I digress). Obviously, don't share these with anyone. Since you signed up with a credit card, you don't want anyone else storing stuff in S3 on your account and having you pay for it. So be careful.
Get this Access Key ID/Secret Access Key values and write them into a text file in your home directory called .awssecret. Two lines. Put the Access Key ID on the first line and the Secret Access Key on the second. Now you're ready to get the software on.
Required packagesYou'll need to have ruby, gem, and libopenssl-ruby installed. These might be called something different on your system, but on Ubuntu, you just run this:
# sudo apt-get install ruby gem libopenssl-ruby
Then you can install the aws-s3 gem trivially.
# sudo gem i aws-s3
Experimenting in the Ruby REPLProbably the easiest way to try anything out in Ruby is by using the interactive Ruby interpreter, irb. This is pretty much like a Ruby shell. Don't forget to fire it up with the -rubygems flag so that you can use the gem libraries.
irb -rubygems
At that point, you can use these commands to read your credentials from the file you saved earlier and get connected to S3. Using SSL is recommended.
require 'aws/s3'
include AWS::S3
creds = File.open("#{ENV['HOME']}/.awssecret") { |f| {
:access_key_id => f.readline.chomp,
:secret_access_key => f.readline.chomp,
:use_ssl => true
}
}
Base.establish_connection! creds
You can then start issuing commands to see what's around. This one gets a listing of buckets in your S3 account, or create one. The buckets are used to group objects that you store in S3.
Bucket.create 'backup.media.0001'
Bucket.list
=> [#"backup.media.0001", "creation_date"=>Fri Sep 25 03:27:55 UTC 2009}>]
Doing the backupI decided that since I was backing up photos, that I wanted to non-recursively tar up all the photos in each directory, uncompressed (JPG is already insanely compressed, don't squeeze rocks!), one tar file per directory. I caught the output of find to find out how many dirs I was dealing with.
irb> (dirs = `find /media/HD-HCIU2/photos -type d`.split("\n")) && nil
=> nil
irb> dirs.size
=> 2859
Since an S3 bucket has a maximum observable capacity of 1000 objects, I planned to create 3 buckets to hold the tar files for these 2859 directories. Here's the first one.
irb> bucketName="backup.media.0001"
irb> Bucket.create(bucketName)
OK, ready to go. What I needed next was a function which given a directory name and a bucket name, would tar up the contents of that directory and upload it to the bucket.
I decided to name the tar files as an md5 hash of the full path to avoid any complications from odd characters. This function after some experimentation and adjustments did the trick. It needs Digest for Md5 included. (It's a little verbose with logging what it's doing to stdout).
require 'digest' #for MD5
include 'Digest'
def uploadDirFiles(dir,bucketName)
dirkey=MD5::hexdigest dir
afilename="#{dirkey}.tar"
archive="/tmp/#{afilename}"
puts "chdir to #{dir}"
Dir.chdir dir
files = Dir.glob("*.*")
puts "#{files.size} in #{dir}"
if (files.size>0) then
qfiles = files.map{|f| "\"#{f}\""}.join " " # wrap filesnames in quotes, join with spaces
puts "creating archive #{archive}"
cmd="tar --create --verbose --no-recursion --file #{archive} #{qfiles}"
print `#{cmd}`
puts "uploading archive #{archive} - #{File.size(archive)} bytes long"
S3Object.store(afilename, open(archive), bucketName)
end
end
Notice that this handles one directory. If something goes wrong, the store operation should raise an exception. We don't catch it here, but handle it in a higher level function which records successes and failures for all the directories, cleans up the tar files, and returns the failed and successful directories:
def upload(dirs,bucketName)
failures=[]
successes=[]
dirs.each do |dir|
begin
uploadDirFiles(dir,bucketName)
successes<<dir
rescue=> ex
puts "#{dir} upload failed: #{ex}"
failures<<dir
end
sum=MD5::hexdigest dir
archive="/tmp/#{sum}.tar"
File.unlink(archive) if File.exists?(archive)
end
[failures,successes]
end
Then we just call that and let with the first 10 directories whose names we collected earlier to test.
(failz,winz)=upload(dirs.first(10),bucketName)
ResultsIt works quite well, and the uploader returns a pair of arrays indicating successes and failures on a per-directory basis. You can run the failure array through the uploader again to retry, whittling the failure list down until you're done. This is probably better suited for turning into a script rather than running in the irb shell.
So far I've just tested with uploading the first 10 directories of my music collection and did not encounter any failures. In an hour of testing I probably uploaded a CD's worth of data. Which isn't very fast, but this isn't the script's fault.
Maybe don't try this at homeThe main problem is that running this over your average household ADSL
completely sucks - your mileage may vary here, but my upload speed appears to be capped about about 1.5MBit/sec. If all goes well, 1GB would take about about 1.5 hours to upload. Which makes for a slow slow backup of 300GB (would take 18 days!)
For home data I think I'll just have to back up my media drive to another media drive and store it somewhere else.
So this technique is better for backing up to the cloud from your office with its big fat data pipe, right?
Resources