Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a
XML Exporter
which creates feeds from my database and I have an escape method so that the XML-sensitive characters of my data do not conflict with the XML Markup.
This method is looking like this:
def escape(m_str):
m_str = m_str.replace("&", "&")
m_str = m_str.replace("\n", "<br />")
m_str = m_str.replace("<", "<")
m_str = m_str.replace(">", ">")
m_str = m_str.replace("\"", """)
return m_str
I'm using LXML library for this script and I have the following issue:
One of the description contains a \x03
(don't ask me why we have this character in a description but we have it).
For more visual people, here is a sample of the problematic description:
to_be_escaped
> 'gnebst G'
[(x,ord(x)) for x in to_be_escaped]
> <class 'list'>: [('g', 103), ('\x03', 3), ('n', 110), ('e', 101), ('b', 98), ('s', 115), ('t', 116), (' ', 32), ('G', 71)]
You can see that the first "space" is not really a space but a End of text
character (ref) and the second is a "normal space" (dec. 32, ref)
The problem is that lxml reacts pretty bad to it, here is the code:
description = et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))
which outputs (with this character):
PCDATA invalid Char value 3, line 1
My questions are:
Of course, I could just extend my escape method to solve the problem but what guarantees me that it will not happen with another character?
Where can I find a list of the "forbidden" characters in LXML?
Did someone else deal with this kind of issue and as an appropriate escape method for that (as the built-in one doesn't do better than mine)?
I found the beginning of an answer there (all credits to the guy for the very clear explanation).
The issue is basically that the mapping for the utf-8
characters is not good enough per default and we need to specify that the source is encoded as utf8.
We can do it by changing the following line:
et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))
et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description), parser=XMLParser(encoding='utf-8', recover=True))
in order to be much more resilient and robust.