SmarterCSV: Importing and Parallel Processing CSV files in Chunks as Arrays of Hashes in Ruby

July 12, 2012 by Tilo Sloboda

Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),

As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Ruby on Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.

My requirements were:

be able to process large CSV-like files (a slight extension to CSV-format which allows comments)
return an attribute Hash for each line, so we can quickly use the results for either creating MongoDB entries, or further processing with Resque
be able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set.
allow re-mapping of "column names" to a set of new names we want to use internally (normalization)
be able to ignore "columns" in the input (delete columns)
be able to pass a block to the method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )

To achieve this I created a Ruby Gem smarter_csv which provides a method for smarter processing of CSV-files:

Example to populate a MySQL or MongoDB Database with SmarterCSV


    require "smarter_csv"
    filename = '/tmp/some.csv'
    n = SmarterCSV.process(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |array|
          # we're passing a block in to process each resulting hash / row (block takes array of hashes):
          MyModel.create( array.first )
    end

     => returns number of chunks we processed

Example to populate MongoDB Database in Chunks with SmarterCSV


    require "smarter_csv"
    filename = '/tmp/some.csv'
    n = SmarterCSV.process(filename, {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |chunk|
          # we're passing a block in to process each resulting hash / row (block takes array of hashes):
          MyModel.collection.insert( chunk )   # using low-level Mongo driver to create up to 100 entries at a time
    end

     => returns number of chunks we processed

Example using Chunking and Resque


    require "smarter_csv"
    filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
    n = SmarterCSV.process(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
                                    :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |chunk|
          # we're passing a block in to process each resulting chunk (array of hashes):
          Resque.enque( MyResqueWorkerClass, chunk )  # pass chunks of CSV-data to Resque workers for parallel processing
    end

     => returns number of chunks we processed

Installation

Now available as a Ruby Gem smarter_csv, and also available as source code on GitHub.

To install, just require smarter_csv, and use SmarterCSV.process()

Original Gist

Below is the original Gist, which is at the core of the 'smarter_csv' Gem, and one of the Stackoverflow questions which raised the question.