Calculate the number of lines in a CSV file using PHP and Linux… without loading the file into application memory

| June 21st, 2011 | , , , , , ,

Calculate the number of lines in a CSV file using PHP and Linux

When parsing large text files, particularly CSV files, web developers often need to know the number of lines contained in the file up front. Knowing the CSV line count is useful for a number of reasons, for example, displaying a progress meter while parsing the large file, or estimating how long the parsing process might take. There are two ways to do this when using PHP for web development. Unfortunately, they both require using more memory than is necessary — just to get a line count. The good news is that the line count can be calculated very quickly using the PHP exec() function and some simple Linux utilities. But first, let’s take a look at the PHP methods…

Method #1: Parsing the entire CSV file at once using PHP

This first method uses the file() function to read the contents of the file into an array, from which the count() method can be used to derive the line count:
$linecount = count(file('filename.csv'));
Reading an entire file into memory just to get a line count? While it may not affect your code when parsing smaller files, larger CSV files will most definitely chew up memory and cause the maximum memory limit (designated in php.ini) to stop the script. And yet almost every forum out there suggests this solution when people ask how to calculate the number of lines in a CSV file. Sometimes, simple code is not so simple.

Method #2: Parsing the CSV file line-by-line using PHP

There is another method for calculating line count that requires less memory. Using the PHP’s filesystem functions — fopen(), fgets(), and fclose() to loop through the file, reading one line into memory at a time, and incrementing the line count as we do so:
$fh = fopen('filename.csv','rb') or die("ERROR OPENING DATA");
while (fgets($this->fh) !== false) $linecount++;
fclose($fh);

While this method is certainly faster than the previous, it does still require reading the entire line into memory just to count it. For CSV files with a small number of columns, this won’t matter. Parsing larger files with a greater number of columns will end up using memory that doesn’t necessarily need to be used, and will take up processing time doing so.

Method #3: Using Linux utilities to calculate the number of lines

Perhaps one of the most powerful, and most dangerous, functions available in PHP is the exec() function. This function allows us to drop down into the Linux environment running underneath PHP and issue it command line directives. Using the exec() function gives us access to the wc linux utility to print the number of newlines, words, and bytes in files.

When run on the command line, the output looks like this:
[jreeve@pelago ~]$ wc -l filename.csv
371 filename.csv

To run this in PHP and parse the output, we would do this:
list($linecount,$foo) = split(" ", exec('wc -l ' . escapeshellarg('filename.csv')));

There is only one problem with this approach. CSV files generated by different spreadsheet programs and OSes use different characters to represent a new line. While Linux uses the newline character, represented as ‘\n’, or decimal 12 (LF), Windows uses carriage-return newline pairs, represented as ‘\r\n’ or decimal 15,12 (CR+LF). To further complicate the matter, Mac OS, up until version 9, used carriage-return characters, represented as ‘\r’ or decimal 15 (CR). The wc utility has a difficult time counting the number of lines in CSV files generated using older versions of the Mac OS, and possibly other OSes and software, that use a sole carriage-return (CR) to denote each new line.

To remedy this, we use wc in combination with perl:
[jreeve@pelago ~]$ perl -pe 's/\r\n|\n|\r/\n/g' filename.csv | wc -l
371

By running the file through perl on the command line we are able to substitute variations of the newline character that we may encounter. It should be noted here that using the tr linux utility, which is often suggested, will not work because it can only substitute one character, and we need to substitute a two-character set.

To replicate this in PHP, we would do:
$linecount = exec('perl -pe \'s/\r\n|\n|\r/\n/g\' ' . escapeshellarg('filename.csv') . ' | wc -l');

Now we’ve got a fast and efficient method for getting around PHP and down into the Linux OS to do some of the heavy lifting for us. It makes sense. If there is a better tool for the job, use that tool, especially if that tool is default to the native environment we are accustomed to working in. Fortunately for us web developers, PHP and Linux go together like Ponch and Jon, making development challenges like this both efficient and fun.

Update: I neglected to mention in the article that when faced with this web development challenge we did end up using method #2 in our web-based application, Intervals. We were able to do it without too much of a performance hit. However, if performance did become an issue, I wouldn’t hesitate to implement method #3.

Photo credit: Eric Fischer

Intervals blog updates in your inbox!

Lear

5 Responses to “Calculate the number of lines in a CSV file using PHP and Linux… without loading the file into application memory”

  1. Daniel Lyons says:

    I wouldn’t recommend getting into the habit of accessing the shell, particularly Unix utilities, from PHP. Running things from the shell via PHP introduces two new problems you don’t need to have: constructing secure, syntactically correct command lines, and managing implicit dependencies between your web app and your executing environment.

    In principle, I agree that you should use the right tool for the job, and in principle, Unix utilities are often the right tool for the job in Unix, but version #2 strikes me as the right balance. You may be accessing the entire file, but at least you never have more than one line in memory at a time. Version #2 is also portable across OSes; you can run that in PHP on IIS on Windows, for example.

    You also have to balance the overhead of creating a new process, and the number of processes you have running on your web host. File descriptors are a fixed quantity under most Unixes, as are process identifiers. It’s easy to accidentally use more of these resources than you think you are. There’s also a CPU overhead to launching a new process that I expect would defeat the performance improvement in the case of many small files.

    By using a pipeline, you’re making an implicit assumption about the type of shell context PHP is going to give you. It’s probably a safe assumption, and it’s probably safe to assume you have Perl and wc installed, but I would hesitate to recommend this practice for the general case. What if you’re assuming that a GNU version of the utility is underneath? You won’t know it doesn’t work until you test on your new BSD box. What if the system administrator upgrades the system and the flags you’re using have changed, or the utility is uninstalled because there don’t appear to be any users? To rely on this practice, you would have to work closely with your system administrator to ensure that your assumptions are safe and stable. If you’re running your own box, it’s less of an issue, but are you going to remember this dependency when you’re upgrading the app or moving it to another machine?

    By far the biggest concern for me is that a small mistake in escaping can quickly turn into handing an attacker a shell on your machine. Escape semantics for most Unix shells are very complex. In recommending this practice, I would definitely ensure that *no* user-supplied data winds up on the command line. In this case, you are safe if you’re dealing with the file name PHP gives you from (say) a file upload. Attackers are very persistent, and if you make a bad assumption about some piece of text not having quotes or backslashes, you could be handing keys to your server over to the anonymous web.

    In short, I would hesitate to do this, unless I knew that A) the CSV files are huge, B) there are few users, and C) users exert no control over the command line, and D) you can make strong guarantees about your app’s portability.

  2. jw says:

    There’s some discussion about this at Reddit: see http://www.reddit.com/r/programming/comments/i5h11

    Specifically, why would you execute that regular expression using Perl, and not in PHP directly? Doing this in PHP, you’d save the overhead of starting another process (Perl), only for the small task of replacing \r\n|\r with \n. Wouldn’t it be much faster and easier to just do this in PHP, thereby eliminating the dependency on Perl?

  3. John Reeve says:

    @Daniel, those are all very good points and should certainly be taken into consideration by any web developer. That’s why I alluded to the exec() function as a dangerous function. It needs to be used with extreme caution.

    @jw, as with method #2, this can be done entirely in PHP without using a regex at all. If the file were read in to a string and then regex was run on that, it would consume a lot of memory. While it is easier to do this in PHP, there are scenarios where it may not be faster. This is where perl and wc become helpful.

  4. Asaph says:

    CSV data can legitimately contain newlines and carriage returns within quoted values, not just a s line terminators, which means that all of the above mentioned methods will give you false record counts on CSV files containing such values. The _only_ way to accurately count the number of records in a CSV file is to parse it with a CSV parser. In PHP, this can be done with the help of the fgetcsv() function.

  5. Jeremy Freedman says:

    If you want to use the UNIX subshell in PHP just use intval(system(“wc -l fimename.csv”)) and the intval will stop once it hits a non numeric character or space

Leave a Reply

The Intervals Blog
A collection of useful tips, tales and opinions based on decades of collective experience designing and developing web sites and web-based applications.

What is Intervals?

Intervals is online time, task and project management software built by and for web designers, developers and creatives.
Learn more…

Contributor Profile
John Reeve

John is a co-founder, web designer and developer at Pelago. His blog posts are inspired by everyday encounters with designers, developers, creatives and small businesses in general. John is an avid reader and road cyclist.
» More about John Reeve
» Archived posts by John Reeve

Contributor Profile
Michael Payne

Michael is a co-founder and product architect at Pelago. His contributions stem from experiences managing the development process behind web sites and web-based applications such as Intervals. Michael drives a 1990 Volkswagen Carat with a rebuilt 2.4 liter engine from GoWesty.
» More about Michael Payne
» Archived posts by Michael Payne

help.myintervals.com
Videos, tips & tricks