How do I check the next character of a string from a file in Python?
Basically I need to replace a character (&) in a file with &. I have managed to do this by checking if '&' exists in the file or otherwise. However, I just realized that in the same XML file, it contains ', which is a single quote. Hence, I need to figure out a way to check if the ampersand, & has a '#' after it. If it does not, only then I will replace the character. Any ideas on tackling this problem?
- husoskiLv 72 months ago
Nobody seems to want to tell you that Python has a regular expression module in its standard library. If you're going to do this all from scratch, that's the way to go.
Are you sure that &#nn; decimal codes are the only HTML character entities used? The normal way to represent an an ampersand in HTML is & and the XML uses the same syntax. An XML file needs to encode any < as < (or < or <) and any > as > (or > or >). A quote character used in a tag attribute may also need to be encoded as a &blah; entity.
A ' single quote (aka "apostrophe") is ' or ' or &.
A " double quote (aka "quote") is " or " or ".
The full-tilt expression to match an & that's NOT part of a HTML or XML character entity is easier to see if you look at what *is* a character entity. It's basically an '&' and a ';' pair of characters with one of (1 or more letters, a '#' followed by 1 or more decimal digits, or a '#x' followed by 1 or more hexadecimal digits.
The regex is to match for search and replace is:
pat = r'&(?!(([a-zA-Z]+)|(#[0-9]+)|(#[xX][0-9a-fA-F]+));)'
The leading & is the only thing matched. The (?!...) group that follows is a "negative lookahead", matching what came before only if the ... stuff inside does NOT match. Inside that group is ((...)|(...)|(...)) matching one of the three options above (all letters, # then digits or #x and hex digits) and a final ; to match at the end of the lookahead.
Use that in a Python program by:
import re # you need this at the top
new_text = re.sub(pat, '&', old_text)
If you are quite sure that & is the *only* context where you don't want to replace the ampersand with an entity, then the pattern is much simpler:
pat = r'&(?!#38;)'
Use it the same way as above.
- EddieJLv 72 months ago
One way to consider is if you have good reason to believe something unusual will NOT be in the file.
For example "[QQ]". You could replace all "&#" with that string. Then do the replace that you want, and then replace "[QQ]" with "&#".
You can check for "[QQ]" first, and if that's found, use "[ZZ]" instead, or whatever you think would be unlikely.
- Mr.PersonaLv 52 months ago
I wouldn't use python in the first place to be honest with you. I'd just sed or vim with regex. The regex being what's used to ensure you only pick & such that no # follows.
However, I would like to ask, in theory, your file should either be decoded (i.e. it's &, ', etc.) or encoded (i.e. it's &, ', etc.). I.e. not a jumble of both.
If it isn't a jumble of both, you could just map all &, etc. to their appropriate values and not worry about running into a lone & because no lone & should exist, & would always appear as &.
Also surprised you didn't just google for a standard decode() function of some existing library. This feels so common a problem that the standard library would solve it.