When building full-text search for uploaded documents, we needed to extract text page-by-page from PDFs so we could index each page as a separate chunk. The naive approach worked but was painfully slow. Here's how a single Unix insight cut it down to one process spawn.
The problem: N+1 process spawns
pdftotext is the standard Unix utility for extracting text from PDFs. It supports a -f (first page) and -l (last page) flag, so extracting a single page looks like this:
pdftotext -f 3 -l 3 document.pdf -
The natural implementation for paged extraction is to call this in a loop:
public function getPagedTextFromFile(string $path): Collection
{
$pagedText = collect();
$pageCount = $this->pdfInfoService->getPageCountFromFile($path);
for ($i = 1; $i <= $pageCount; $i++) {
$pagedText->put($i, $this->getRawTextFromFile($path, $i));
}
return $pagedText;
}
This is clean and obvious. It is also, for any document with real content, expensive:
- 1
pdfinfocall to get the page count - N
pdftotextcalls, one per page
Each call is a separate process spawn: the OS forks, loads the binary, opens the PDF, seeks to the requested page, extracts text, and exits. For a 50-page contract, that's 51 process spawns. On a server handling concurrent uploads, those 51 spawns happen in sequence, blocking the queue worker the entire time.
The insight: pdftotext already separates pages
Run pdftotext without page flags and pipe the output to a hex viewer:
pdftotext document.pdf - | cat -A | grep -P '\f'
You'll see form feed characters (\f, \x0C, ASCII 12) separating each page. This is standard — pdftotext has always done this. It's even documented in the man page, buried under the output format description.
That means the full multi-page text is already structured. We don't need N calls. We need one call and a string split.
The solution
public function getPagedTextFromFile(string $path): Collection
{
$pagedText = collect();
try {
if ($this->mimeTypeForPath($path) !== 'application/pdf') {
return $pagedText;
}
$result = $this->runExternalProcess(
[config('pdftools.pdftotext'), $path, '-'],
180
);
if ($result->getStdErr() !== '') {
throw new Exception($result->getStdErr());
}
// pdftotext separates pages with \f; rtrim strips any optional trailing \f
$pages = explode("\f", rtrim($result->getStdOut(), "\f"));
foreach ($pages as $i => $pageText) {
$pageText = trim($pageText, " \t\n\r\0\x0B");
if ($pageText !== '') {
$lines = $this->splitInLines($pageText);
if (!empty($lines) && preg_match(self::DOCUSIGN_HEADER_PATTERN, $lines[0])) {
array_shift($lines);
$pageText = trim(implode("\n", $lines), " \t\n\r\0\x0B");
}
}
$pagedText->put($i + 1, $pageText !== '' ? $pageText : null);
}
} catch (Exception $e) {
Log::error("pdftotext | {$e->getMessage()}");
}
return $pagedText;
}
Before: 51 process spawns for a 50-page PDF.
After: 1 process spawn, regardless of page count.
For a 100-page document the old approach invoked pdftotext 100 times, each time reloading and re-parsing the entire PDF file to seek to one page. The new approach loads it once and returns everything.
Edge cases worth knowing
Trailing form feed. Some versions of pdftotext append a \f after the last page. Splitting "page1\fpage2\f" naively gives ["page1", "page2", ""] — an extra empty element. The rtrim($output, "\f") before splitting removes it cleanly.
Blank pages. A blank page produces an empty string after trimming. Storing null for it preserves correct page numbering for subsequent pages (page 5 stays page 5, even if pages 3 and 4 are blank). The downstream indexing code skips nulls, so blank pages don't pollute the search index.
Page-level headers. We strip DocuSign envelope headers (DocuSign Envelope ID: XXXXXXXX-...) from the beginning of any page that has one. Since the single-call output is split by page before this check, the per-page stripping logic is identical to the per-call approach — just applied after the split instead of inside each getRawTextFromFile call.
Non-PDF files. The MIME type check at the top of the method returns early for anything that isn't application/pdf. This replaces the earlier dependency on pdfinfo to get the page count — for non-PDFs that check would have returned 0 and short-circuited the loop, but the MIME check is simpler and removes the pdfinfo dependency from this path entirely.
The broader pattern
This is an instance of a general optimisation: if a tool is designed to process a whole file, don't call it once per chunk. The tool already knows how to walk the file efficiently; let it.
The same principle applies to:
ffprobefor video metadata — one call for all streams, not one call per streamexiftool— batch mode processes a directory in one pass rather than per-file invocations- Database queries —
SELECTwithIN (...)instead of N individual selects - Meilisearch document uploads — one
addDocumentscall with a batch, not one call per document
In each case the per-item call pattern feels natural and is easy to reason about. But the overhead of repeatedly invoking a tool that was designed for whole-file processing accumulates fast once documents are large or queues are busy.
The fix, when it exists, is usually as simple as this one: read the man page, find the output format, split a string.
If this post was enjoyable or useful for you, please share it! If you have comments, questions, or feedback, you can email my personal email. To get new posts, subscribe use the RSS feed.