文章詳情頁

Python用多進程寫文件遇到編碼問題，而用多線程卻不會

瀏覽：110日期：2022-06-28 13:29:55

問題描述

用多進程爬取數據寫入文件，運行沒有報錯，但是打開文件卻亂碼。 Python用多進程寫文件遇到編碼問題，而用多線程卻不會

用多線程改寫時卻沒有這個問題，一切正常。下面是數據寫入文件的代碼：

def Get_urls(start_page,end_page): print ’ run task {} ({})’.format(start_page,os.getpid()) url_text = codecs.open(’url.txt’,’a’,’utf-8’)for i in range(start_page,end_page+1): pageurl=baseurl1+str(i)+baseurl2+searchword response = requests.get(pageurl, headers=header) soup = BeautifulSoup(response.content, ’html.parser’) a_list=soup.find_all(’a’) for a in a_list:if a.text!=’’and ’wssd_content.jsp?bookid’in a[’href’]: text=a.text.strip() url=baseurl+str(a[’href’]) url_text.write(text+’t’+url+’n’)url_text.close()

多進程用的進程池

def Multiple_processes_test(): t1 = time.time() print ’parent process {} ’.format(os.getpid()) page_ranges_list = [(1,3),(4,6),(7,9)] pool = multiprocessing.Pool(processes=3) for page_range in page_ranges_list:pool.apply_async(func=Get_urls,args=(page_range[0],page_range[1])) pool.close() pool.join() t2 = time.time() print ’時間：’,t2-t1

問題解答

回答1：

圖片上已經說了，文件以錯誤的編碼形式載入了，說明你多進程寫入的時候，編碼不是utf-8

回答2：

文件第一行添加:

#coding: utf-8回答3：

打開同一個文件，相當危險，出錯機率相當大，多線程不出錯，極有可能是GIL,多進程沒有鎖，因此容易出錯了。

url_text = codecs.open(’url.txt’,’a’,’utf-8’)

建議改為生產者消費都模式!

比如這樣

# -*- coding: utf-8 -* -import timeimport osimport codecsimport multiprocessingimport requestsfrom bs4 import BeautifulSoupbaseurl = ’’baseurl1 = ’’baseurl2 = ’’pageurl = ’’searchword = ’’header = {}def fake(url, **kwargs): class Response(object):pass o = Response() o.content = ’<a href='http://www.lshqa.cn/{}/wssd_content.jsp?bookid'>foo</a>’.format(url) return orequests.get = fakedef Get_urls(start_page, end_page, queue): print(’run task {} ({})’.format(start_page, os.getpid())) try:for i in range(start_page, end_page + 1): pageurl = baseurl1 + str(i) + baseurl2 + searchword response = requests.get(pageurl, headers=header) soup = BeautifulSoup(response.content, ’html.parser’) a_list = soup.find_all(’a’) for a in a_list:if a.text != ’’and ’wssd_content.jsp?bookid’in a[’href’]: text = a.text.strip() url = baseurl + str(a[’href’]) queue.put(text + ’t’ + url + ’n’) except Exception as e:import tracebacktraceback.print_exc()def write_file(queue): print('start write file') url_text = codecs.open(’url.txt’, ’a’, ’utf-8’) while True:line = queue.get()if line is None: breakprint('write {}'.format(line))url_text.write(line) url_text.close()def Multiple_processes_test(): t1 = time.time() manager = multiprocessing.Manager() queue = manager.Queue() print ’parent process {} ’.format(os.getpid()) page_ranges_list = [(1, 3), (4, 6), (7, 9)] consumer = multiprocessing.Process(target=write_file, args=(queue,)) consumer.start() pool = multiprocessing.Pool(processes=3) results = [] for page_range in page_ranges_list:result = pool.apply_async(func=Get_urls, args=(page_range[0], page_range[1], queue ))results.append(result) pool.close() pool.join() queue.put(None) consumer.join() t2 = time.time() print ’時間：’, t2 - t1if __name__ == ’__main__’: Multiple_processes_test()結果

foo /4/wssd_content.jsp?bookidfoo /5/wssd_content.jsp?bookidfoo /6/wssd_content.jsp?bookidfoo /1/wssd_content.jsp?bookidfoo /2/wssd_content.jsp?bookidfoo /3/wssd_content.jsp?bookidfoo /7/wssd_content.jsp?bookidfoo /8/wssd_content.jsp?bookidfoo /9/wssd_content.jsp?bookid

Python 編程

上一條：python小白問關于類里面屬性的問題下一條：python中class里面的self是什么意思？

排行榜

					
					css3 - 頁面布局問題
java - 關于柱狀圖表展示高度的問題。
javascript - 關于css絕對定位在ios瀏覽器被橡皮筋遮擋的問題
人工智能 - python 機器學習 醫療數據 怎么學
css3 - IE瀏覽器下，一個元素設置overflow:auto后，出現下拉滾動條，拖動滾動條圖片會移動，但文字不移動
如何解決Centos下Docker服務啟動無響應，且輸入docker命令無響應？
javascript - main head .intro-text{width:40%} main head{display:flex}為何無效？
angular.js - 用angular2-cli打包項目之后，跳轉路由刷新會報404的錯誤
python 計算兩個時間相差的分鐘數，超過一天時計算不對
如何修改phpstudy的phpmyadmin放到其他地方
爬蟲圖片 - 關于Python 爬蟲的問題
				

熱門標簽

色综合图-色综合图片-色综合图片二区150p-色综合图区-玖玖国产精品视频-玖玖香蕉视频

Python用多進程寫文件遇到編碼問題，而用多線程卻不會