Sunday, August 6, 2023

How to Batch Import Huge CSV Fast using PHP (Million Records in Seconds)

I know, I know! How can you put PHP and Fast in the same line? I would say, you are stuck in the past. You should try PHP 7 and experience it yourself. There will be a lot of criticism for this article. Kindly read through.

LOAD DATA INFILE is the best option to import a huge CSV file. In this article we are going to see how we are going to read a CSV file and INSERT records via PHP script. This can be used in such a special case.

In general PHP will not have such use cases. But you never know, when you will come across such a situation, a client requirement can come any time. When you are a “freelance programmer” (not playing a consultant role), you will be thrust upon with all sorts of crazy requirements.

Huge CSV Import PHP

I have written an article about all things CSV in PHP. You should go through that to learn about handling CSV files in PHP.

I am well aware about the micro-benchmarking and its pitfalls. So, I didn’t go in that direction for this articles. That’s why I have given in the title as just seconds. It’s an approximation. This is not for people who are going to see in micro, nano seconds.

If you want to insert a million records in few seconds, using PHP script programmatically, then this will definitely help. The use case is rare and so the solution. You may have to do some tweaks here and there in this sample code to fit for your case.

<?php

function file_get_contents_chunked($link, $file, $chunk_size, $queryValuePrefix, $callback)
{
    try {
        $handle = fopen($file, "r");
        $i = 0;
        while (! feof($handle)) {
            call_user_func_array($callback, array(
                fread($handle, $chunk_size),
                &$handle,
                $i,
                &$queryValuePrefix,
                $link
            ));
            $i ++;
        }
        fclose($handle);
    } catch (Exception $e) {
        trigger_error("file_get_contents_chunked::" . $e->getMessage(), E_USER_NOTICE);
        return false;
    }

    return true;
}
$link = mysqli_connect("localhost", "root", "pass", "huge-csv");
$success = file_get_contents_chunked($link, "sample-dataset.csv", 2048, '', function ($chunk, &$handle, $iteration, &$queryValuePrefix, $link) {
    $TABLENAME = 'tbl_lead';
    $chunk = $queryValuePrefix . $chunk;

    // split the chunk of string by newline. Not using PHP's EOF
    // as it may not work for content stored on external sources
    $lineArray = preg_split("/\r\n|\n|\r/", $chunk);
    $query = 'INSERT INTO ' . $TABLENAME . '(id, name, email) VALUES ';
    $numberOfRecords = count($lineArray);
    for ($i = 0; $i < $numberOfRecords - 2; $i ++) {
        // split single CSV row to columns
        $colArray = explode(',', $lineArray[$i]);
        $query = $query . '(' . $colArray[0] . ',"' . $colArray[1] . '","' . $colArray[2] . '"),';
    }
    // last row without a comma
    $colArray = explode(',', $lineArray[$i]);
    $query = $query . '(' . $colArray[0] . ',"' . $colArray[1] . '","' . $colArray[2] . '")';
    $i = $i + 1;

    // storing the last truncated record and this will become the
    // prefix in the next run
    $queryValuePrefix = $lineArray[$i];
    mysqli_query($link, $query) or die(mysqli_error($link));

    /*
     * {$handle} is passed in case you want to seek to different parts of the file
     * {$iteration} is the section of the file that has been read so
     * ($i * 4096) is your current offset within the file.
     */
});

if (! $success) {
    // It Failed
}

Two main things to note are,

  1. Read the file in chunks (batch).
  2. Insert multiple records in a single insert statement.

The above is key in speeding up the overall process. Reading line by line and iterating through a loop will slowdown the process. So everything boils down to reading chunks (batches) and multi insert. Then a third point worth mentioning is use PHP native functions wherever possible.

I have used a Regex to replace newline in the CSV file. If you got a better option please suggest via comments section below.

Now let me walkthrough the code. 

PHP’s fread allows to read in chunks of strings. Try experimenting with different chunk (batch) sizes. There is no particular right size. There are lots of variables, your server configuration, hardware, MySQL setup and lot more.

I have used a sample CSV file which I generated myself. I will detail the process below. You can use your own sample or real data and pass that as a parameter.

file_get_contents_chunked does the processing of CSV file and this has a callback function as the last argument. This callback takes care of parsing the record by the delimiter (comma in my example) and creating the multi-insert query and doing the actual insert.

You may have to modify the query part to suit your database table format. The overall structure of the script takes care of parsing the CSV in chunks (batches) and callbacks. Thanks to RobertPitt

One thing worth mentioning is, instead of reading line by line I have used the chunk as it is to improve the speed. All through the Internet users have suggested fgetcsv. But I have gone ahead with chunk (batch) read of fread. 

The fread when reading as chunk (batch) it will obviously have part a part of a CSV row truncated. I have stored that truncated last record in each chunk in $queryValuePrefix variable. It is retained by declaring as a reference via callback. 

Sample CSV dataset

I wrote a tiny PHP script to generate the required CSV dataset. There are a lot of sources like Government census record, now popular Covid data, weather data and lot more.

At this time of Artificial Intelligence era, where Data Mining is more popular than Twinkle Twinkle Little Star, getting a sample huge CSV file is just a click away.

But still for flexibility, I wrote the PHP script myself. 

<?php
$file = fopen('s-sample-dataset.csv', 'w');

for ($i = 1; $i <= 1000000; $i ++) {
    echo $i;
    $line[] = $i;
    $line[] = uniqid();
    $line[] = uniqid();
    fputcsv($file, $line);
    $line = null;
}
fclose($file);

This is a ruthless way of generating a sample data :-) At this juncture Faker is  worth mentioning. It is a good PHP library for generating nice fake data. It will look very close to real.

Last, for the nerds. This script ran and imported one million (1000000) records in 9 seconds. My machine is a MacBookPro with Catalina, 2.3 GHz i5, 16 GB RAM and PHP 7.4. This is just for an idea and please do not make much out of this.

Pour in all your view points and suggestions for improvement in the below comments. Happy parsing all your CSVs, chunk it out!

Comments to “How to Batch Import Huge CSV Fast using PHP (Million Records in Seconds)”

 

This article is taken from below website.

Credit goes to: Nikunj Jadav

SOURCE: https://phppot.com/php/how-to-batch-import-huge-csv-fast-using-php-million-records-in-seconds/

No comments:

Post a Comment

Please Comment Here!

How to backup and download Database using PHP

< ?php $mysqlUserName = 'databaseusername' ; $mysqlPassword = 'databasepassword' ; $mysqlHostNa...