Oct 06 2016

Processing CSV Files with Perl and Bash

Published by at 11:58 pm under Linux




Olivier, a friend of mine, had to parse a CSV file and took the opportunity to benchmark the performance of 3 programming languages.
 
The file contains server names and disks he needs to add up into a hash table in order to get the total disk space for each server. He assumes on his blog Perl, Python and Golang are much faster than Bash. He is definitely right but, how much faster?
 
The following (slightly modified) Perl script processed 600k lines in less than a second. Not bad knowing Perl is an interpreted language.

#!/usr/bin/perl
my $file = 'sample.csv';
my %data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while ( my ($server,$value)=split(/,/,<$fh>)) {
    $data{$server} += $value;
}
close ($file);

 
Now, here’s a similar code in bash

#!/bin/bash
file=sample.csv
declare -A data
while read -r line; do
  values=($(echo $line|awk -F, '{print $1" "$2}'))
  (( data[${values[0]}] += ${values[1]} ))
done < "$file"

The file was processed in over 19 minutes, or in other words, around 1200 times slower!
 
Let's see if we can improve the script's performance.
The read command man page states something of interest:
"The characters in IFS are used to split the line into words".
Setting comma as the default separator allows to build the $line variable as an array, saving the hassle of parsing each line and using a temporary variable.

#!/bin/bash
IFS=','
file=sample.csv
declare -A data
while read -a line; do
  (( data[${line[0]}] += ${line[1]} ))
done < "$file"

This new version runs in the smooth time of... 17s! This is 17 times slower than Perl, but 70 times faster than the original version.
 
No doubt Perl and Python are much faster than the shell family languages, but one needs to pay attention to small details when it comes to performance issues.


No responses yet

Trackback URI | Comments RSS

Leave a Reply