Why copying from PDF goes wrong
1. The illusion of text: why we see something that isn’t really “written”
When opening a PDF on screen, everything looks normal: a text arranged in paragraphs, with headings, columns, quotes… even underlines or footnotes. What we see looks like an editable document, as if it were Word or Google Docs. But it’s not.
PDF was not designed to be a source of text. It was designed to preserve the visual appearance of a document, not its logical structure. Internally, there is no clear notion of paragraph or reading order. What you see as a continuous block of text may actually be fragmented into dozens of drawing instructions scattered across the file, each with its own font, size, and XY position.
Copying from a PDF, therefore, is not copying text: it’s reconstructing an illusion. That illusion often breaks the moment we paste the copied content into another environment. Unexpected line breaks appear, words get split, strange characters show up. Not because there are “errors,” but because there was never a true textual structure to begin with.
A PDF is like a postcard: you can read what’s printed, but you can’t peel the letters off the cardboard.
2. When the “error” is the footprint of PDF architecture
The distortions that appear when copying or exporting a PDF —undesired line breaks, strange symbols, jumbled sentences— are not random glitches. They are the result of three overlapping internal layers that make up every PDF document:
Internal layer | What it controls | Extraction symptom |
---|---|---|
2.1 Visual fragmentation | How text is distributed on the page (coordinates, hyphens, indentation) | Split words, broken paragraphs |
2.2 Encoding and glyph mapping | Which Unicode —or non-Unicode— codes are assigned to each character | Invisible spaces, “garbage characters,” PUA fonts |
2.3 Logical layout | The reading order of blocks (table rows, columns, footnotes) | Mixed columns, disordered tables, reversed RTL text |
Each layer solves a different problem for the page designer; together, they explain why the text we see doesn’t match the text we copy.
We’ll now take a brief look at each of them.
A. Visual fragmentation induced by layout
In a PDF, “paragraphs” do not exist in a logical sense: what actually exists are dozens (or hundreds) of text fragments positioned using X-Y coordinates, each drawn like a small sticker on the page canvas. This purely graphical architecture causes five characteristic symptoms when we try to extract the content:
- Line breaks turn into paragraph breaks
The extractor inserts a “paragraph end” character whenever the Y coordinate changes beyond a certain threshold: perfect for screen viewing, disastrous for text flow. - Actual paragraphs merged into a single block
The opposite can also happen: if successive blocks are too close together, the algorithm assumes they form one continuous line and merges what were originally separate paragraphs. - Line-ending hyphens preserved as part of the word
The original layout inserts a visible hyphen or a soft hyphen to split the word. When copied, the hyphen travels with the text or remains hidden as U+00AD, breaking searches and matches. - Indents and bullets cause false carriage returns
Indented lists are often composed of two separate objects: the marker (•, -, 1.) and the line of text. The extractor inserts a break before or after the marker when it cannot group them. - Ligatures misdecoded in the text flow
Although the “fi” or “fl” ligature is a single glyph, in the PDF it's placed like another independent sticker. Some extractors fail to map it to “f” + “i” and simply remove it, cutting words apart.
Together, these five phenomena form the “visual fragmentation” layer: the text looks continuous, but its internal representation is fragmented due to design decisions (columns, hyphens, bullets, line spacing). When copying or exporting, what appears isn’t a viewer bug, but the logical result of this purely graphical model.
B. Ambiguous encoding and internal text manipulation
A PDF can look flawless on screen and yet contain a deeply altered text code. The reason is that the format allows each glyph to be mapped to any Unicode value —or even none at all—, to embed multiple overlapping text layers, or to replace fragments with simple vectors. This gives rise to the following phenomena, which are just as common and corrosive for any attempt at reliable extraction:
- Merged words due to lack of actual spaces
Some PDF generators remove the space character and rely on X-Y positioning to “draw” the separation. When copying, the words get stuck together with no remedy. - Unreadable text or “garbage characters”
When an embedded font uses a non-standard internal map, each glyph is assigned an arbitrary code. The result when copying may be a string of random symbols (✓, ®, ¤) instead of the original text. - Invisible characters that break search functionality
Soft hyphens, zero-width spaces, or non-breaking spaces are inserted for layout control. Once copied to the clipboard, they split words or block matches without the user realizing. - Fonts with custom encoding (PUA) that override Unicode mapping
Instead of mapping “a” to U+0061, a PDF might assign it to a point in the Unicode Private Use Area. The viewer displays the correct shape, but copying retrieves a strange code unreadable by any search tool. - Text converted into vectors or images
To ensure visual fidelity —or to prevent copying—, the creator may trace each letter as an outline or rasterize the whole paragraph. To an extractor, there's no longer any recoverable text. - Duplicate layers after unnecessary OCR
When OCR is applied to a PDF that already contains digital text, some programs add a second “ghost” layer. The result is duplicated or mixed phrases when exporting. - Deliberate obfuscation to prevent copying
Some paid documents use JavaScript scripts, substituted glyphs, or broken logical order to make copied text meaningless —a homemade but effective “antibot” strategy.
Together, these mechanisms form the ambiguous encoding layer: the text exists, but its representation is distorted, duplicated, or wrapped in forms a conventional extractor can’t parse. Raw copying here is like translating a language that follows neither Unicode rules nor human reading logic.
C. Non-linear layout and incorrect content reordering
In a PDF, the graphical position dictates the reading flow. When the layout is complex —tables, columns, headers, footnotes— the extractor must infer what comes first, and often gets it wrong. These are the most common issues:
- Tables read by columns instead of rows
Each cell in column 1 is concatenated with the cell in column 1 of the next row (column-major), so the result is “Col-0 Row-0, Col-0 Row-1, Col-1 Row-0, Col-1 Row-1...”. Horizontal record order is lost and the table becomes incoherent. - Fragment mix due to adjacent column overlap
In two-column layouts, if the gutter is narrow, the last line of the left column “jumps” to the right (or vice versa), producing sentences merged from different columns. - Headers, footers, and notes interrupting the main body
Peripheral elements —page numbers, running headers, footnote markers— are inserted where their X-Y coordinates place them, disrupting the main flow of text. - Reversed order in bidirectional writing systems (RTL)
In Arabic or Hebrew, some extractors mishandle the Bidi algorithm: sequences are read left-to-right and ligatures are broken, altering the intended meaning. - Vertical spacing misread as cell or column breaks
A blank area used for visual spacing is mistaken for a semantic divider; the following text is treated as a new column and reordered incorrectly. - Paragraphs scrambled by mismanaged page breaks
If the last line of one page and the first of the next form a sentence, the conversion may reverse or duplicate them. - Numbered headings mistaken for list items
“1 Introduction”, “2 Methodology” are interpreted as list items; the extractor groups the content beneath them, altering the document’s hierarchy.
Together, these issues form the non-linear layout layer: the text looks continuous, but the blocks that compose it have been repositioned or merged upon extraction, making subsequent analysis or search much harder.
3. Practical guide: which tool to use for each issue
Not all errors are fixed the same way. Some tools specialize in cleaning hyphens and ligatures; others focus on recovering tables, rebuilding logical structure, or extracting annotations only. The key is to identify what type of distortion the PDF contains —visual fragmentation, faulty encoding, or disordered layout— and apply the most suitable tool for that case.
The following practical guide offers a quick reference based on the symptom detected.
Are tables being mixed up or read incorrectly?
👉 Tabula, Camelot, or PDFTables allow you to extract tables in a clean format (CSV, Excel), avoiding disordered rows or mixed-up columns. Camelot offers more technical control for complex PDFs.
Text with visual errors when copying (hyphens, ligatures, strange symbols)?
👉 PDFlib TET and Apryse PDF2Text detect end-of-line hyphens, typographic ligatures, or invisible characters and fix them automatically. Ideal for bulk cleanup and faithful results.
Text cannot be selected because it's an image?
👉 ABBYY FineReader applies high-precision OCR and rebuilds the content as editable text while preserving the original layout. It's also useful if the PDF contains duplicated text due to overlapping layers.
Only need to extract comments or highlighted text?
👉 Foxit Reader and Sumnotes allow you to export only what is highlighted or annotated, without copying the rest of the content. Very useful when the main text is corrupted or disordered.
Want to insert PDF comments directly into the source document?
👉 InsertMyComments.com goes one step further: instead of exporting comments, it automatically inserts them into the source document (Word, Markdown…), respecting grammar and original location. Ideal for authors, editors, or proofreaders working with long or technical documents.
Looking for a complete—but not magical—solution?
👉 Adobe Acrobat Pro tries to reconstruct reading flow and preserve formatting, but doesn't always handle complex columns or hidden hyphens well. Suitable for moderate cases with good internal structure.
Just want raw text, fast and without extras?
👉 pdftotext (Poppler) extracts plain content in a scriptable and fast way. Perfect for batch processing, though it ignores notes, tables, and structure.
Need to clean extracted text from hidden hyphens or strange characters?
👉 qpdf + iconv scripts help eliminate invisible characters and normalize encoding. Useful as an intermediate step in technical workflows.
4. Conclusion
Errors when copying text from a PDF are not random glitches, but predictable consequences of how that format is built. Understanding the internal layers that cause them —visual, encoding, logical— not only helps diagnose symptoms more accurately, but also allows you to choose the right tool for each situation. There is no universal solution, but there are effective combinations depending on the type of distortion. This text serves as an initial guide to navigate that technical, often invisible, yet fundamental territory in any document-based workflow.