The object returned by
urlopen()
is a byte object because
urlopen()
can not determine the encoding of the byte stream. You have to determine the encoding first (dynamically or using a fixed one when knowing it) and decode the data.
Example for UTF-8:
htmltext = s.read().decode('utf-8')
The encoding is often specified in the
charset
argument of a
Content-type
header. This can be accessed by
s.info().get_content_charset()
or
s.headers.get_content_charset()
. So you might check this first and use it if not
None
.