There was a new challenge ahead. Moving over 10 million images from a single Linux machine to the Cloud.
It was quite a daunting task to move so many files from a running production server to the cloud. Before I will start digging into the technical part I would first like to explain why it was necessary.
The single Linux server for media storage is, of course, a small practice. It’s a single point of failure, plus we needed to add disk space ourselves as the data set kept growing. We also needed to keep the old data because it was used on the website.
Before the start of the project, there was about 1.5 TB of images stored on this server. Mostly Roomstyler 3D images. The thumbnails for these images were created on the fly and cached on the same machine, but they were stored on a separate disk.
The main problem, next to upload all those files to the Cloud, was generating thumbnails for all these images. S3 does not support on the fly thumbnailing and we didn’t want a extra server running only for generating thumbs. We decided we needed to upload all the images to S3 plus generate thumbnails for them.
For most images, we decided we needed two different size thumbnails, which would mean we needed to store over 30 million files in the Cloud.
Creating the worker
First I decided on how to upload all these files to the Cloud. I looked online for help but didn’t find any good results. I decided to go my own way and do it on my own. I knew it would take some time so I would need a background service to process all the images. Since I’m familiar with Ruby I choose Rails + Sidekiq.
I set up a new server on AWS and connected the media disk with NFS. I installed Rails and Sidekiq and wrote a simple worker to upload an image to S3 by specifying a path to a file on the media disk. Because the workers were accessing the disk a lot. I limited the process to four simultaneous processes. I tried to change the numbers around, but four was just about the right number to not run out of memory.
Thumbnailing JPEGs fast, faster, fastest
As most images were JPEG, I needed a tool to thumbnail JPEGs really quick. I found a library called EPEG, an insanely fast JPEG modification library, which makes use of libjpeg, a Linux image library for JPEG. I also found a Ruby library for it, which I could use in the worker. By using a special algorithm it can compress JPEGs in mere milliseconds.
I also found out all the old images were uncompressed, by compressing the images by 10% resulted in a big change in their size, between 30 and 60 percent. I used the image_optimizer gem for it, which used the ‘jpegoptim’ Linux library for compressing images fast and painlessly.
There were also a couple of thousand PNG’s and GIF’s that needed to be thumbnailed and moved. I’ve used the good old, but the slow MiniMagick library for those cases.
Generating the big list
Retrieving a list of all the files on the disk was not as easy as I thought. Listing 10 million files from a single directory in Linux isn’t that easy. Plus sending all these files to the background processor would probably be a heavy duty task, which could take a long time and lot of memory.
I’ve found a way to list the files and put them into a big file. It took about 30 minutes to generate.
# Disable sorting and linebreak results ls -f1 &> files.list
Now that I have one big file. I would process each line with a Ruby script and insert it the path into the background job queue. Which looks a bit like this.
This took a lot of time but after three hours it was done. I started up the background manager so it could start processing. It worked! Until I encountered a file that was corrupt, causing all the worker processes to crash.
I needed to change something in the script and clear the queues. I needed to reimport all the 10 million paths again. I decided I needed more manageable data and decided to split up the big file. I went for 500.000 lines per file. Resulting in almost 20 split files.
# Splitting files by lines split -l 500000 files.list
Now importing them in the background job queue was much easier. And even after fixing a couple of more errors that occurred, it was much easier to work with smaller data sets.
Don’t forget the headers
After the whole process was complete I found out it was missing headers, which were used by our preview renderer. I needed to set the ‘Content-Type’ header for all the files. This is not possible in batches for S3. I found Python script by Dailymuse I altered to make it work for us. It took about 14 hours to run through all the files and set the content-type.
All in Time
It took me about a week to process all the images of the media disk. Sometimes the disk would slow down in peek periods on the website or when accessing a deteriorated disk. Here’s a breakdown of the time for the processes;
- 10ms for generating a single thumbnail
- 100ms for uploading the original image and it’s thumbnails
- 20 minutes for importing the filepaths into the background queue
- 6 hours for processing a single file with about 500.000 image paths
- 14 hours for adding the content-type header on the images
- 7 days for processing all the images
- Be sure to set the correct headers for files, it will save you a lot of time!
- Split large files up and execute in batches
- Sidekiq and Ruby are well suited for processing a large amount of small tasks
- Never underestimate the power of a good library, EPEG yea!