python - XML Escaping character \x03 - Stack Overflow

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a XML Exporter which creates feeds from my database and I have an escape method so that the XML-sensitive characters of my data do not conflict with the XML Markup.

This method is looking like this:

def escape(m_str):
    m_str = m_str.replace("&", "&amp;")
    m_str = m_str.replace("\n", "<br />")
    m_str = m_str.replace("<", "&lt;")
    m_str = m_str.replace(">", "&gt;")
    m_str = m_str.replace("\"", "&quot;")
    return m_str
I'm using LXML library for this script and I have the following issue:
One of the description contains a \x03 (don't ask me why we have this character in a description but we have it).
For more visual people, here is a sample of the problematic description:
to_be_escaped
> 'gnebst G'
[(x,ord(x)) for x in to_be_escaped]
> <class 'list'>: [('g', 103), ('\x03', 3), ('n', 110), ('e', 101), ('b', 98), ('s', 115), ('t', 116), (' ', 32), ('G', 71)]
You can see that the first "space" is not really a space but a End of text character (ref) and the second is a "normal space" (dec. 32, ref)
The problem is that lxml reacts pretty bad to it, here is the code:
description = et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))
which outputs (with this character):
  PCDATA invalid Char value 3, line 1
My questions are:
Of course, I could just extend my escape method to solve the problem but what guarantees me that it will not happen with another character?
Where can I find a list of the "forbidden" characters in LXML?
Did someone else deal with this kind of issue and as an appropriate escape method for that (as the built-in one doesn't do better than mine)?
I found the beginning of an answer there (all credits to the guy for the very clear explanation).
The issue is basically that the mapping for the utf-8 characters is not good enough per default and we need to specify that the source is encoded as utf8.
We can do it by changing the following line:
et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))
et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description), parser=XMLParser(encoding='utf-8', recover=True))
in order to be much more resilient and robust.