How do I check the next character of a string from a file in Python?

Basically I need to replace a character (&) in a file with &. I have managed to do this by checking if '&' exists in the file or otherwise. However, I just realized that in the same XML file, it contains ', which is a single quote. Hence, I need to figure out a way to check if the ampersand, & has a '#' after it. If it does not, only then I will replace the character. Any ideas on tackling this problem?

Attachment image

3 Answers

Relevance
  • 2 months ago

    Nobody seems to want to tell you that Python has a regular expression module in its standard library.  If you're going to do this all from scratch,  that's the way to go.

    Are you sure that &#nn; decimal codes are the only HTML character entities used?  The normal way to represent an an ampersand in HTML is &amp; and the XML uses the same syntax.  An XML file needs to encode any < as &lt; (or &#60; or &#x3C;) and any > as &gt; (or &#62; or &#x3E;).  A quote character used in a tag attribute may also need to be encoded as a &blah; entity.

    A ' single quote (aka "apostrophe") is &apos; or &#39; or &#x26;.

    A " double quote (aka "quote") is &quot; or &#34; or &#x22;.

    The full-tilt expression to match an & that's NOT part of a HTML or XML character entity is easier to see if you look at what *is* a character entity.  It's basically an '&' and a ';' pair of characters with one of (1 or more letters, a '#' followed by 1 or more decimal digits, or a '#x' followed by 1 or more hexadecimal digits.

    The regex is to match for search and replace is:

    pat = r'&(?!(([a-zA-Z]+)|(#[0-9]+)|(#[xX][0-9a-fA-F]+));)'

    The leading & is the only thing matched.  The (?!...) group that follows is a "negative lookahead", matching what came before only if the ... stuff inside does NOT match.  Inside that group is ((...)|(...)|(...)) matching one of the three options above (all letters, # then digits or #x and hex digits) and a final ; to match at the end of the lookahead.

    Use that in a Python program by:

        import re # you need this at the top

        ....

        new_text = re.sub(pat, '&#38;', old_text)

    If you are quite sure that &#38; is the *only* context where you don't want to replace the ampersand with an entity, then the pattern is much simpler:

    pat = r'&(?!#38;)'

    Use it the same way as above.

    • Commenter avatarLog in to reply to the answers
  • EddieJ
    Lv 7
    2 months ago

    One way to consider is if you have good reason to believe something unusual will NOT be in the file.

    For example "[QQ]".  You could replace all "&#" with that string.  Then do the replace that you want, and then replace "[QQ]" with "&#".

    You can check for "[QQ]" first, and if that's found, use "[ZZ]" instead, or whatever you think would be unlikely.  

    • Commenter avatarLog in to reply to the answers
  • 2 months ago

    I wouldn't use python in the first place to be honest with you. I'd just sed or vim with regex. The regex being what's used to ensure you only pick & such that no # follows.

    However, I would like to ask, in theory, your file should either be decoded (i.e. it's &, ', etc.) or encoded (i.e. it's &#38;, &#39;, etc.). I.e. not a jumble of both.

    If it isn't a jumble of both, you could just map all &#38;, etc. to their appropriate values and not worry about running into a lone & because no lone & should exist, & would always appear as &#38;.

    Also surprised you didn't just google for a standard decode() function of some existing library. This feels so common a problem that the standard library would solve it.

    • husoski
      Lv 7
      2 months agoReport

      Interesting idea, but sed won't work.  It's RE syntax doesn't have negative lookahead groups.  Don't know (or much care) about vim.

    • Commenter avatarLog in to reply to the answers
Still have questions? Get answers by asking now.