网页抓取中编码问题:UnicodeEncodeError: ‘gbk’ codec can’t encode character ****: illegal multibyte sequence
“UnicodeEncodeError: ‘gbk’ codec can’t encode character ****: illegal multibyte sequence”
上面这个问题是我在练习写python脚本抓取网页信息的时候,遇到的问题!我很郁闷,因为通过察看page info发现它现实的网页编码方式确实是gbk,我的处理是这样的:
def getResult(url): txtfile = open("cnbeta.txt", "w") html = urllib2.urlopen(url).read() html = html.decode('gbk').encode('utf-8') #print html analysisPage(html, txtfile) txtfile.close()
后来甚至将gbk改成gbk1213,gb等等编码方式,都显示错误,查了很多资料都不对。我甚至想到用另一种方式来抓取,但是不经意的搜网页,我发现这样可以解决:
def getResult(url): txtfile = open("cnbeta.txt", "w") html = urllib2.urlopen(url).read() html = html.decode('gbk', 'ignore').encode('utf-8') #print html analysisPage(html, txtfile) txtfile.close()
这段代码只需要修改一点点就ok了。分析原因:
在代码的整体实现上是没有问题的,主要是遇到了非法字符--------特别是在全角空格的问题上,因为其实现方式有很多种,比如\xa3 \xa0,或者\xa4\xa5,这写字符,看起来都像是全角空格,但是他们不是合法的(真正的全角空格是\xa1\xa1),因此在码的转换中出现了问题。而且这样的问题一出现,就会导致整个文件都不能转换。
这时,就不要忘了,decode的函数原型:decode([encoding], [errors='strict'])。默认参数是strict,代表遇到非法字符抛出异常;如果是ignore,则会忽略,直接输出;如果是replace,则会用?取代非法字符;如果设置成xmlcharrefreplace,则使用XML字符引用。
(全文参考:http://www.cnblogs.com/baiyuyang/archive/2011/10/29/2228667.html)
2022年9月02日 20:56
Chittagong is also another best education board under all education board Bangladesh, and this is also one of the divisions under eight education boards of the country, the Secondary and Higher Secondary Education Board has successfully completed those Junior School Certificate and Junior Dakil (Grade-8) annual final examination tests between 2nd to 11th November 2022 with the same schedule of all education board. JDC Result chittagong The School Education Department has announced there are lakhs of students are appeared and participated in the JSC & JDC terminal examinations 2022 from all districts of Chittagong division, the Grade 8th standard examinations are successfully completed and the students are waiting to get JSC Result 2022 with total marksheet with subject wise marks.