文章詳情頁

基于python requests selenium爬取excel vba過程解析

瀏覽：3日期：2022-07-14 11:53:09

目的：基于辦公與互聯網隔離，自帶的office軟件沒有帶本地幫助工具，因此在寫vba程序時比較不方便(后來發現07有自帶，心中吐血，瞎折騰些什么）。所以想到通過爬蟲在官方摘錄下來作為參考。

目標網站：https://docs.microsoft.com/zh-cn/office/vba/api/overview/

所使工具:

python3.7,requests、selenium庫

前端方面：使用了jquery、jstree(用于方便的制作無限層級菜單

設計思路：

1、分析目標頁面，可分出兩部分，左邊時導航，右邊是內容顯示。

2、通過selenium對導航條進行深度遍歷，取得導航條所有節點以及對應的鏈接，并以jstree的數據格式存儲。

# 導航層級為<ul> <li> <a>... <span>....

3、使用requests遍歷所有鏈接取得相應主體頁面。

實現：

## parent 上級節點# wait_text 上級節點對應的xpath路徑的文本項# level,limit 僅方便測試使用#def GetMenuDick_jstree(parent,level,wait_text,limit=2): if level >= limit: return [] parent.click() l = [] num = 1 new_wati_text = wait_text + ’/following-sibling::ul’ # 只需要等待ul出來就可以了/li[’ + str(ele_num) + ’]’ try: wait.until(EC.presence_of_element_located((By.XPATH,new_wati_text))) # 查詢子節點所有的 a節點和span節點（子菜單） childs = parent.find_elements_by_xpath(’following-sibling::ul/li/span | following-sibling::ul/li/a’) for i in childs: k = {} if i.get_attribute(’role’) == None:k[’text’] = i.text# 如果是子菜單，進行深度遍歷k[’children’] = GetMenuDick_jstree(i,level+1,new_wati_text + ’/li[’ + str(num) + ’]/span’,limit) else:# 網頁訪問的Url無Html后綴，需要加上。去除無相關地址，形成相對路徑。url_text = str(i.get_attribute(’href’)).replace(’https://docs.microsoft.com/zh-cn/office/’, ’’,1) + ’.html’k[’text’] = i.textk[’a_attr’] = {'href':url_text,'target':'showframe'}lhref.append(str(i.get_attribute(’href’))) num = num + 1 l.append(k) parent.click() # 最后收起來 except Exception as e: print(’error message:’,str(e),’error parent:’ ,parent.text,’ new_wati_text:’,new_wati_text,’num:’,str(num)) lerror.append(parent.text) finally: return l

# data菜單，lhref為后續需要訪問的地址。# 找到第一個excel節點，從excel開始data = []lhref = []lerror = []k = {}browser.get(start_url)browser.set_page_load_timeout(10) #超時設置xpath_text = ’//li[contains(@class,'tree')]/span[text()='Excel'][1]’cl = browser.find_element_by_xpath(xpath_text)k = {’text’:’Excel’}k[’children’] = GetMenuDick_jstree(cl,1,xpath_text,20)data.append(k)# Writing JSON datawith open(r’templetedata.json’, ’w’, encoding=’utf-8’) as f: json.dump(data, f)

進行到這里，已經擁有了excel vba下所有的菜單信息以及對應的url。下來需要得到頁面主體。

實現思路：

1、遍歷所有url

2、通過url得到相應的文件名

## 根據網頁地址，得到文件名，并創建相應文件夾#def create_file(url): t = ’https://docs.microsoft.com/zh-cn/office/’ # 替換掉字眼，然后根據路徑生成相應文件夾 url = url.replace(t,'',1) lname = url.split(’/’) # 先判斷有沒有第一個文件夾 path = lname[0] if not os.path.isdir(path): os.mkdir(path) for l in lname[1:-1]: path = path + ’’ + str(l) if not os.path.isdir(path): os.mkdir(path) if len(lname) > 1: path = path + ’’ + lname[-1] + ’.html’ return path

3、訪問url得到主體信息儲存。

# requests模式# 循環遍歷,如果錯誤，記錄下來，以后再執行had_lhref = []error_lhref = []num = 1for url in lhref: try: had_lhref.append(url) path = create_file(url) resp = requests.get(url,timeout=5,headers = headers) # 設置訪問超時，以及http頭 resp.encoding = ’utf-8’ html = etree.HTML(resp.text) c = html.xpath(’//main[@id='main']’) # tostring獲取標簽所有html內容，是字節類型，要decode為字符串 content = html_head + etree.tostring(c[0], method=’html’).decode(’utf-8’) with open(path,’w’, encoding=’utf-8’) as f: f.write(content) except Exception as e: print(’error message:’,str(e),’error url:’,url) error_lhref.append(url) if num % 10 == 0 : print(’done:’,str(num) + ’/’ + str(len(lhref)),’error num:’ + str(len(error_lhref))) #time.sleep(1) # 睡眠一下，防止被反 num = num + 1

現在，菜單信息與內容都有了，需要構建自己的主頁，這里使用了jstree;2個html，index.html,menu.html。

index.html:使用frame頁面框架，相對隔離。

<!DOCTYPE html><html><head> <meta charset='UTF-8'> <meta name='viewport' content='width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no'> <title>參考文檔</title> <script src='http://www.lshqa.cn/bcjs/js/jquery.min.js'> </script></head><frameset rows='93%,7%'> <frameset cols='20%,80%' frameborder='yes' framespacing='1'> <frame src='http://www.lshqa.cn/bcjs/menu.html' name='menuframe'/> <frame name='showframe' /> </frameset> <frameset frameborder='no' framespacing='1'> <frame src='http://www.lshqa.cn/bcjs/a.html' /> </frameset></frameset></html>

menu.html:

1、引入了data.json，這樣在可以進行離線調用，使用ajax.get讀取json的話，會提示跨域失敗；

2、jstree會禁止<a>跳轉事件，所有需要通過監聽'change.tree'事件來進行跳轉。

<!DOCTYPE html><html lang='en'><head> <meta charset='UTF-8'> <title>Title</title> <script src='http://www.lshqa.cn/bcjs/js/jquery.min.js'></script> <link rel='stylesheet' href='http://www.lshqa.cn/bcjs/themes/default/style.min.css' rel='external nofollow' /> <script src='http://www.lshqa.cn/bcjs/js/jstree.min.js'></script> <script type='text/javascript' src='http://www.lshqa.cn/bcjs/data.json'></script></head><body> <div> <form id='s'> <input type='search' /> <button type='submit'>Search</button> </form> <div id='container'> </div> <div id='container'></div> <script> $(function () {$(’#container’).jstree({ 'plugins': ['search', 'changed'], ’core’: { ’data’: data, }}); }); $(’#container’).on('changed.jstree', function (e, data) {//console.log(data.changed.selected.length); // newly selected//console.log(data.changed.deselected); // newly deselectedif (data.changed.selected.length > 0){ // 說明轉換了，獲取url var url = data.node.a_attr.href // console.log(url) if (url == '#'){ }else{ parent[data.node.a_attr.target].location.href = url }}else{} }) $('#s').submit(function (e) {e.preventDefault();$('#container').jstree(true).search($('#q').val()); }); </script> </div></body></html>

以上，得到最后的本地版網頁excel vba參考工具。最后，部分office自帶本地版的vba參考工具，有點白干一場。

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持好吧啦網。

python

上一條：Python selenium爬取微信公眾號文章代碼詳解下一條：Python如何設置指定窗口為前臺活動窗口

相關文章：

1. PHP防XSS 防SQL注入的代碼2. idea設置自動導入依賴的方法步驟3. 淺談SpringMVC jsp前臺獲取參數的方式 EL表達式4. python pymysql鏈接數據庫查詢結果轉為Dataframe實例5. ASP刪除img標簽的style屬性只保留src的正則函數6. IDEA版最新MyBatis程序配置教程詳解7. 使用Python和百度語音識別生成視頻字幕的實現8. 教你如何寫出可維護的JS代碼9. idea不能自動補全yml配置文件的原因分析10. xml中的空格之完全解說

排行榜

					
					教你如何寫出可維護的JS代碼
python pymysql鏈接數據庫查詢結果轉為Dataframe實例
ASP刪除img標簽的style屬性只保留src的正則函數
淺談SpringMVC jsp前臺獲取參數的方式 EL表達式
使用Python和百度語音識別生成視頻字幕的實現
IDEA版最新MyBatis程序配置教程詳解
idea設置自動導入依賴的方法步驟
PHP防XSS 防SQL注入的代碼
idea不能自動補全yml配置文件的原因分析
python pyppeteer 破解京東滑塊功能的代碼
CSS可以做的幾個令你嘆為觀止的實例分享