Sanjuro recently commented "Actually I find these insights into the management of a fairly big website very interesting. So keep'em coming!" I'm going to pre-apologize right now if this entry bores you to tears; just imagine the despair of the electrons who were fired onto you with futility! 

Tonight, I wrapped up a back-end project for Tabulas that's been ongoing for a couple of weeks now (dates back to early December). What I've successfully done is to place the burden of serving Tabulas images onto this server, instead of Amazon S3. Geniuses like PeteE or fdn are probably thinking to themselves: "Oh man, this would take me all of twenty minutes." Unfortunately, I was not blessed with smarts, but instead dashing good looks, so this took me a couple of weeks instead.

Back in the day before Amazon S3, everything was hosted in a datacenter on several servers I rented from EV1Servers. This included both the database, as well as the flat files. If you remember the old PDS (personal data servers), they were aces.tabulas.com, lca.tabulas.com, jbiel.tabulas.com - your account was tied to these servers.

This worked fine, but there was the issue of the servers one day going up in flames, and me losing all the data (I don't like RAID, and I only had the time/skills to backup the database and the raw files). There were also issues of scaling I never addressed (for example, what would happen when aces.tabulas.com ran out of disk space?).

Then Amazon S3 came along and I crapped my proverbial pants (and quite possibly my literal pants).

When an image is uploaded to Tabulas, I store 4 versions of the same file: a thumbnail size (small), a web size (medium), a large size (large), and the raw image. In the early days, I didn't expose the raw version, but kept it archived on a separate server. I didn't have the skills to keep file systems backed-up, so I figured if the PDS server ever went toast, I could use the raw versions to regenerate the different sizes.

So obviously, the first thing when S3 came out was to transition the raw files to an "original" bucket (ACL: private).

And then a couple of months later, I created the new bucket images.tabulas.co and started hosting images on S3 directly. I also exposed the "raw" format publicly, which pretty much deprecated the usefulness of the "original" bucket. And all this was working fine until a couple of weeks ago.

While S3 simplifies the maintenance of the server, it is still not very cost effective. The bandwidth/data storage costs are much higher than if you ran it yourself - but for a guy like me who is more interested in cutting cruddy code than maintaining servers, that added cost is fine. Well, to a certain point. When my S3 costs started spiraling into $300/month, I decided it'd be worth cutting some code.

So I created i2.tabulas.com, which was routed through my servers. i2.tabulas.com, without the math getting complicated, gives me "free" storage, with bandwidth costs of $0.0485/GB per month. S3 costs $0.17/GB per month. It's 4 times as cheaper, even excluding the storage costs.

So when a user requested a picture from Tabulas, it got routed to i2.tabulas.com; i2 would then ask, "Hmm, do I have a local copy of this file?" If so, it would simply serve that image out (using PHP's fpassthru) to the end-user. If it didn't, it would retrieve it from Amazon S3 once (and store it on the local server), then serve up the image.

I waited for people to complain about things not working, but there weren't any complaints. So I took it to the next level.

One of the problems I have is that people were using images.tabulas.co when referencing images - so even if I was telling Tabulas users the subdomain was i2.tabulas.com, people who had embedded images from other sites would constantly hit S3.

My goal was to have images.tabulas.co's DNS no longer point to S3, but to Tabulas.

But there was one caveat. When I serve up images from i2, I send HTTP headers that tell your browser the file size, as well as the filetype. I was missing this information for most images (I told you, I used to be quite lazy).

So I had to write a script to retrieve about half a million pictures on Amazon S3 and retrieve the file metadata. (Conceivably, I could use the local copy I get from Amazon to get this information, but I've had bad experiences with mimetype detection locally). When I ran this script, I noticed that roughly 6,000 images on Tabulas had data records, but missing files.

Being on a "spring cleaning" mode for the Tabulas database, I decided to fix these images by writing a script that would (1) go to the "original" backup bucket and retrieve the file and (2) regenerate the file images and (3) update the data records accordingly. Using this, I fixed roughly 5,000 of those images. The last 1,000 I just deleted from the database (hell, the images don't work, why would people want them in their gallery?).

While doing this, I realized how useless the "original" Amazon S3 bucket was - so I started running a script to delete that whole bucket (I think it weighs in at around 60GB or so - that'll save me a whopping $100/year, but it's a bucket I absolutely don't use, so it'll be good to do that).

Once I verified all images stored in the Tabulas DB has the appropriate metadata, I flipped the switch by removing the CNAME DNS record which points images.tabulas.co to Amazon S3 - now even images.tabulas.co points to the primary Tabulas server!

Of course, after this was done, I also decided to clean up any entries which had embedded images over the past couple of months - I wrote a script to go through all entries posted in the past three months, and fixed up all references from i2.tabulas.com to images.tabulas.co (although I plan on maintaining the i2.tabulas.com subdomain indefinitely, by ensuring all data inside Tabulas was referencing images.tabulas.co, I could cut down on some code.

Doesn't it seem weird that there's so much work being done just to maintain the status quo? All that work, and its success was judged by how little had changed.

But it was all worth it - I got to remove an unused Amazon S3 bucket, I cleaned up the Tabulas DB, and I made the data inside the images table of the database more consistent. And not only that, I added a feature that had long since bugged me: privacy controls on the images themselves.

In the past, you could set privacy controls on albums, but they would not be enforced on the images themselves. For example, if you got an image URL, you could easily just share it with somebody. The false sense of security = not cool.

Facebook still does this; I have an album that's set to "Friends Only", yet conceivably this link to an image in that album works. (I'm guessing the problem is compounded probably due to Akamai - CDN with auth will be hard).

Anyways, I finally got this implemented in Tabulas with just a few lines of code (G will probably snicker due to my usage of $wg, but I don't care!):

// do a privacy check if the album isn't public
if ($Image->getAlbum()->getStatus() != STATUS_PUBLIC) {
           
      // define the site user
      $wgSiteUser = User::fromId($wgTitle->getPath(0) /* userid */);

     // do the privacy check
      if (!$Image->getAlbum()->canView()) {
            http_status(403);
            exit();
      }
}

If you are logged into Tabulas and are a Tabulas friend, you can see this picture. Try logging out and hard-refreshing. Can you see it now? NOPE! Burn.

Of course, the one use case this breaks: users who uploaded background images for their Tabulas in their gallery and "Private"-ed the album to "hide" it. Maybe I'll add an "archive" or "hide" album feature instead.

I'm pretty sick of working on images, but there is one last thing I'd like to add: EXIF image parsing. There must be a wealth of knowledge there already.

So yeah, that was what I did yesterday and today. Fun!

Currently listening to: The Old 97s - Timebomb
Posted by roy on January 4, 2009 at 03:05 AM in Web Development, Tabulas | 1 Comments

Related Entries

Want to comment with Tabulas?. Please login.

sanjuro (guest)

Comment posted on January 4th, 2009 at 02:13 PM
So now I bear the responsibility whenever you bore your readers with technical lingo. It's making me proud... somewhat. :D