The Innocent Loop

Loops are so common in programming that we often don’t give them a second thought, but a buggy loop has potentially infinite destructive power as it mindlessly repeats its task however many times it’s asked.

Back in the good or bad old days, depending on your perspective, the for loop with its manual index management was often the only game in town, with the occasional while loop thrown in for good measure. It was our responsibility to make sure those loops actually terminated by using explicit control conditions. These days, though, with functional programming principles in wide use, we enjoy a rich collection of functions that iterate through arrays, maps, sets, etc., making infinite loops rare.

With power comes responsibility, though. Can you spot what’s wrong with the following piece of code that filters application log entries coming from a database?

const entries = await getLogEntries(startDate, endDate, logLevel);
const results = entries.filter(entry => entry?.service === serviceName);

If you’ve guessed potential memory or performance issues, then you’re absolutely right. If getLogEntries returns millions of log entries, the application can run out of memory before we get to the loop. Even if memory isn’t exhausted, iterating through a massive array with the filter method can take a lot of time. This could block JavaScript’s own event loop and make the application unresponsive during the filtering process.

The first solution that comes to mind is to make the database do the filtering, but if we are unable to do that, we can try switching to chunked processing, which should look similar to the code below, assuming that getLogEntries always returns an array to simplify error handling:

const results = [];
const chunkSize = 10000;
let offset = 0;
while (true) {
    let entries = await getLogEntries(startDate, endDate, logLevel, offset, chunkSize);
	results.push(...entries.filter(entry => entry?.service === serviceName));
    if (entries.length < chunkSize) {
        break;
    }
    offset += entries.length;
}

Unfortunately, our code is no longer as neat as it once was, but at least it can now process large amounts of log entries without issues, or can it? If the results array becomes too large, we still risk memory exhaustion. In cases like these, it might help to impose hard limits on the number of results returned and either return an error or state that only the first X results were returned. Reasonable limits also offer some level of protection against malicious users.

Performance-wise, the above code can easily make hundreds of separate requests sequentially, which can increase the total processing time quite a bit, especially when those requests are made over a network connection.

Fortunately, JavaScript has Promise.all to execute requests in parallel, but since it has no built-in limit for the number of concurrent requests it makes, it’s all too easy to accidentally DoS your database server. Using a library like BlueBird or p-map that allows setting concurrency limits will be a lot safer. In the following example, we assume we can get the total number of log entries reasonably quickly to allow parallel fetching of chunks, with a concurrency limit of 10:

const chunkSize = 10000;
const totalEntries = await getTotalLogEntries(startDate, endDate, logLevel);
const pageCount = Math.ceil(totalEntries / chunkSize);
const allPages = Array.from({length: pageCount}, (_, i) => i);

const fetchPage = async page => {
    const entries = await getLogEntries(startDate, endDate, logLevel, page * chunkSize, chunkSize);
    return entries.filter(entry => entry?.service === serviceName);
};

const chunkedResults = await Promise.map(allPages, fetchPage, {concurrency: 10});
const results = chunkedResults.flat();

Up until now, we’ve seen how to process a large number of records without running out of memory and with reasonable performance, but there is one area where we must be extra careful: making destructive changes like database updates inside a loop, which can be difficult to revert if something goes wrong.

When there’s some business logic involved during batch database updates, it’s a good idea to have a dry run functionality not only for easier automated testing but also for manually verifying the results of an update before running it for real.

The following example demonstrates synchronizing employee data coming from an external source, such as a company’s Active Directory system, with the user data stored in an application’s database. If a user isn’t found in the database, a new record is created; if a user’s information has changed, such as the branch they belong to, their record is updated; and finally, if a user is no longer working at the company, their record is marked as disabled.

const usersToInsert = [];
const usersToUpdate = [];
const usersToDisable = [];

externalUsers.forEach(externalUser => {
	let existingUser = users.get(externalUser.id);
    if (existingUser) {
        let changes = getChanges(existingUser, externalUser);
        if (changes) {
            usersToUpdate.push({id: existingUser.id, changes});
        }
    } else {
        usersToInsert.push(externalUser);
    }
});

users.forEach(user => {
    if (!externalUsers.has(user.id) && user.status === 'active') {
        usersToDisable.push(user.id);
    }
});

By storing the necessary data for database operations in their respective arrays, we effectively separate our intent from execution. This dry run approach not only makes testing and verification easier but, perhaps even more importantly, also permits the auditing of past changes, assuming those changes have been logged.

Summary:

While modern loop constructs abstract away manual index management, performance and memory issues can still arise when processing large datasets.
Offloading work to the database is often the preferred solution. When feasible, leveraging the database’s optimization capabilities is generally more efficient.
Chunked processing is well-suited for handling large datasets. It allows for processing data in manageable portions, mitigating memory and performance issues.
Parallel execution can improve performance. However, uncontrolled parallelism can easily lead to performance problems.
Destructive operations in loops require extra caution: Separating intent from execution allows for easier testing and verification.

Related:

The Innocent Loop

Aycan Gulez