Despite the unequivocal superiority of TeX, one utility of Word (or other word processors such as OpenOffice/LibreOffice) is that either formatted or unformatted text can be easily exported. For example, copy&paste from Word by default yields the formatted text, and copy&Ctrl+Shift+V yields the unformatted text. For the TeX system, things seem complicated. Copy from PDF usually carries extra formatting, say, hyphens at line ends, or ligatures. There are PDF viewers e.g. Adobe Reader which offers "Copy" vs "Copy With Formatting", but this largely depends on the behavior of the PDF software. There are alternatives like the TeX package dvi-text
which converts DVI to plain text. But this still adds to the complexity.
Let's think about this use case. I write down something and make a PDF for my archive. Because I want to make things formatted, I typeset in TeX instead of plain text. And later I need to fill out some forms online, which I prefer copy&paste of the unformatted text from the PDF. What does the TeX community recommend for such task? Thanks in advance.
Update 1. This is a comparison between TeXLive 2017 (pdfTeX v1.40.18) and 2025 (pdfTeX v1.40.27) following @Mico's suggestion. The test Tex is
\documentclass[11pt,a4paper]{article}\usepackage{lipsum}\begin{document}\lipsum[1]\end{document}
Below shows what the PDF produced from 2017 vs 2025 looks like:
- A. PDF by TeXLive 2017Image may be NSFW.
Clik here to view. - B. PDF by TeXLive 2025Image may be NSFW.
Clik here to view.
With the same PDF reader (Acrobat v2025.001.20474), selection-right-click-"Copy" then Ctrl-V paste into Word (365 up-to-date version) produces the following snapshots:
- C. TeXLive 2017 pasted into Word (highlighted are hyphened words)Image may be NSFW.
Clik here to view. - D. TeXLive 2025 pasted into Word (highlighted are words supposed to be hyphened)Image may be NSFW.
Clik here to view.
Note that the page layout is set landscape to avoid word wrap so that each line in C and D is truly individual.
The result bears out @Mico's observation in that TeXLive 2025 can recognize and automatically strip the extra hyphens. But it worked only for Adobe, at least not for TeXworks or SumatraPDF, as far as I tested on Windows 10 (v2009). And even Adobe cannot handle the extra line endings, another annoyance that I resonate with @cfr.
Update 2. There have been several similar questions, in addition to the one shared by @Marijn (thanks). Some were asked 10 years ago...
- https://stackoverflow.com/questions/66602858/how-do-word-breaking-hyphens-work-with-copy-paste-in-latex-pdfs
- Copy from PDF without line breaks at end of each line
- https://forum.pdf-xchange.com/viewtopic.php?t=34108
- https://github.com/typst/typst/issues/5625
The last one is particularly interesting. It suggests that it's not a problem specific to TeX, but also to other PDF producers. Looks like something tough.