Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),
As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Ruby on Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
My requirements were:
To achieve this I created a Ruby Gem smarter_csv
which provides a method for smarter processing of CSV-files:
require "smarter_csv"
filename = '/tmp/some.csv'
n = SmarterCSV.process(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |array|
# we're passing a block in to process each resulting hash / row (block takes array of hashes):
MyModel.create( array.first )
end
=> returns number of chunks we processed
require "smarter_csv"
filename = '/tmp/some.csv'
n = SmarterCSV.process(filename, {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |chunk|
# we're passing a block in to process each resulting hash / row (block takes array of hashes):
MyModel.collection.insert( chunk ) # using low-level Mongo driver to create up to 100 entries at a time
end
=> returns number of chunks we processed
require "smarter_csv"
filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
n = SmarterCSV.process(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
:chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |chunk|
# we're passing a block in to process each resulting chunk (array of hashes):
Resque.enque( MyResqueWorkerClass, chunk ) # pass chunks of CSV-data to Resque workers for parallel processing
end
=> returns number of chunks we processed
Now available as a Ruby Gem smarter_csv
,
and also available as source code on GitHub.
To install, just require smarter_csv
, and use SmarterCSV.process()
Below is the original Gist, which is at the core of the 'smarter_csv' Gem, and one of the Stackoverflow questions which raised the question.
I hope you'll find this useful :)