It appears that when a url encoded in unicode is requested with urllib2.urlopen .getUrl() returns unicode. However the url that we request in the biblegateway importer contains redirects. In this case the redirect uses an utf-8 url (im guessing that) urllib2 takes this as an utf-8 encoded url, so returns getUrl() encoded as utf-8.

bzr-revno: 2180
Fixes: https://launchpad.net/bugs/1251437
This commit is contained in:
Philip Ridout 2013-12-21 00:14:21 +02:00 committed by Raoul Snyman
commit 6d86abaa37
1 changed files with 5 additions and 1 deletions

View File

@ -464,7 +464,11 @@ def get_web_page(url, header=None, update_openlp=False):
log.debug(u'Downloading URL = %s' % url)
try:
page = urllib2.urlopen(req)
log.debug(u'Downloaded URL = %s' % page.geturl())
downloaded_url = page.geturl()
# Sometimes we get redirected, in this case page.geturl is encoded in utf-8
if not isinstance(downloaded_url, unicode):
downloaded_url = downloaded_url.decode('utf-8')
log.debug(u'Downloaded URL = %s' % downloaded_url)
except urllib2.URLError:
log.exception(u'The web page could not be downloaded')
if not page: