Home code Finding bad Unicode inside an XML file – parse and validate XML

Finding bad Unicode inside an XML file – parse and validate XML

1915
0

A quick note as to one resource for online validation of XML (UTF-8).  Paste your XML into the source field and hit validate. It may take a few minutes to process larger strings of text. Cut the text down to smaller chunks if it doesn’t process at all.

http://www.validome.org/xml/validate/

Example of an error message produced by the validator:
Character reference “&#xD83D” is an invalid XML character

The error you see above I had stumbled on when trying to figure out why a report that parsed XML from a file describing an Exchange user’s Inbox was crapping out. The report would tell you how many messages each user had, the size of the mailbox, etc. Except for one user, whose mailbox had in it an email with bad unicode in it. However, once the validator told me what to look for, I searched the XML for the bad character &#xD83D. I located it about in the middle of the file and along with some related elements, here is what that text looked like:

<emailmarker id=”EF000000198262C0AA6611CD9BC800AA002FC45A0600FC1D00000100000000001C650100000010356A95″>
<attachment name=”OutlookEmoji-&#xD83D;&#xDE0A;.png” type=”1″ extension=”png” size=”734″ compressedsize=”734″ mimetype=”image/png” link=””>
</attachment>
</emailmarker>

At this point I knew that I was dealing with a bad image, apparently an emoji. I set out to find that emoji!

First I had to take note of the last time the parsing program ran by looking at the emailed report. That report indicated that the “Newest Message” processed was 18-Oct-2015″ and by looking at subsequent reports I saw that “Newest Message” didn’t update until 15-Nov-2015. That left me looking through a month’s worth of emails, trying to identify one with an emoji in it. Which I was able to do without much trouble.

Delete the email and your problem goes away.

LEAVE A REPLY

Please enter your comment!
Please enter your name here