When parsing large text files, particularly CSV files, web developers often need to know the number of lines contained in the file up front. Knowing the CSV line count is useful for a number of reasons, for example, displaying a progress meter while parsing the large file, or estimating how long the parsing process might take. There are two ways to do this when using PHP for web development. Unfortunately, they both require using more memory than is necessary — just to get a line count. The good news is that the line count can be calculated very quickly using the PHP exec() function and some simple Linux utilities. But first, let’s take a look at the PHP methods…
Method #1: Parsing the entire CSV file at once using PHP
This first method uses the file() function to read the contents of the file into an array, from which the count() method can be used to derive the line count:
$linecount = count(file('filename.csv'));
Reading an entire file into memory just to get a line count? While it may not affect your code when parsing smaller files, larger CSV files will most definitely chew up memory and cause the maximum memory limit (designated in php.ini) to stop the script. And yet almost every forum out there suggests this solution when people ask how to calculate the number of lines in a CSV file. Sometimes, simple code is not so simple.
Method #2: Parsing the CSV file line-by-line using PHP
There is another method for calculating line count that requires less memory. Using the PHP’s filesystem functions — fopen(), fgets(), and fclose() to loop through the file, reading one line into memory at a time, and incrementing the line count as we do so:
$fh = fopen('filename.csv','rb') or die("ERROR OPENING DATA");
while (fgets($this->fh) !== false) $linecount++;
While this method is certainly faster than the previous, it does still require reading the entire line into memory just to count it. For CSV files with a small number of columns, this won’t matter. Parsing larger files with a greater number of columns will end up using memory that doesn’t necessarily need to be used, and will take up processing time doing so.
Method #3: Using Linux utilities to calculate the number of lines
Perhaps one of the most powerful, and most dangerous, functions available in PHP is the exec() function. This function allows us to drop down into the Linux environment running underneath PHP and issue it command line directives. Using the exec() function gives us access to the wc linux utility to print the number of newlines, words, and bytes in files.
When run on the command line, the output looks like this:
[jreeve@pelago ~]$ wc -l filename.csv
To run this in PHP and parse the output, we would do this:
list($linecount,$foo) = split(" ", exec('wc -l ' . escapeshellarg('filename.csv')));
There is only one problem with this approach. CSV files generated by different spreadsheet programs and OSes use different characters to represent a new line. While Linux uses the newline character, represented as ‘\n’, or decimal 12 (LF), Windows uses carriage-return newline pairs, represented as ‘\r\n’ or decimal 15,12 (CR+LF). To further complicate the matter, Mac OS, up until version 9, used carriage-return characters, represented as ‘\r’ or decimal 15 (CR). The wc utility has a difficult time counting the number of lines in CSV files generated using older versions of the Mac OS, and possibly other OSes and software, that use a sole carriage-return (CR) to denote each new line.
To remedy this, we use wc in combination with perl:
[jreeve@pelago ~]$ perl -pe 's/\r\n|\n|\r/\n/g' filename.csv | wc -l
By running the file through perl on the command line we are able to substitute variations of the newline character that we may encounter. It should be noted here that using the tr linux utility, which is often suggested, will not work because it can only substitute one character, and we need to substitute a two-character set.
To replicate this in PHP, we would do:
$linecount = exec('perl -pe \'s/\r\n|\n|\r/\n/g\' ' . escapeshellarg('filename.csv') . ' | wc -l');
Now we’ve got a fast and efficient method for getting around PHP and down into the Linux OS to do some of the heavy lifting for us. It makes sense. If there is a better tool for the job, use that tool, especially if that tool is default to the native environment we are accustomed to working in. Fortunately for us web developers, PHP and Linux go together like Ponch and Jon, making development challenges like this both efficient and fun.
Update: I neglected to mention in the article that when faced with this web development challenge we did end up using method #2 in our web-based application, Intervals. We were able to do it without too much of a performance hit. However, if performance did become an issue, I wouldn’t hesitate to implement method #3.
Photo credit: Eric Fischer