Text originated from MS Word yields problematic results when exported to HTML (excessive <span> statements)
I’ve stumbled upon a bug evident in text originated from MS-Word, which affects the HTML structure when exported from InDesign. Originally I encountered it with Hebrew text in heavily-style-driven paragraphs, however I re-created it in a completely “clean” style environment as well:
(1) Create a new file in MS-Word. Type some text into it (as “Lorem Ipsum”).
(2) Copy the text and Paste into InDesign.
(3) Reset the Paragraph Style; then apply a Character style to the text (it could be a Character style with no attributes).
(4) Paste the same text into Notepad, then Copy it again from Notepad into InDesign as a second paragraph (you could just as well re-type it in InDesign; the idea is to have an identical text created solely in InDesign rather than pasted from Word, to avoid any added hidden additional data which may be there).
(5) Export to HTML.
The paragraph pasted into InDesign from Word – even though it has been reset – seems to cause the HTML export process to re-state a <span> section for the Character Style – even though all of the text should be with the same attributes and the same character style – that is, the </span> of one character style is immediately followed by the next <span> of the same character style.
The next paragraph, containing the same text copied and pasted through a “cleanup mediator” (in the form of Notepad, for example) does not cause the same behaviour.
I could not find any difference in the attributes associated with either paragraph from within InDesign.
This makes the resulting structure of a text originated from Word much more cumbersome; in addition, at least when dealing with RTL text (such as Hebrew) this may cause problems in alignment – especially when used in floating text elements (through css).
InDesign version: 16.0.1 x64
-
Ofer Sheinberg commented
Note: Copying the problematic text from InDesign (or Word) and re-pasting it using “Paste without Formatting” still yields the same problematic results. Only copying it through a “cleanup mediator” as described above gets rid of whatever hidden attributes the text seems to be infested with.