文章詳情頁

Python中docx2txt庫的使用說明

瀏覽：2日期：2022-06-26 10:19:46

docx2txt的Github地址

docx2txt是基于python的從docx文件中提取文本和圖片的庫。

代碼是從python-docx中獲取的。它也可以從頁眉，頁腳和超鏈接中提取文本。它現在也可以提取圖像。

安裝

pip install docx2txt運行1、命令行運行

# extract textdocx2txt file.docx# extract text and imagesdocx2txt -i /tmp/img_dir file.docx2、在python中調用

# extract textdocx2txt file.docx# extract text and imagesdocx2txt -i /tmp/img_dir file.docx

補充：python docx提取word中的目錄及文本框中的文本

問題描述

python docx提取word中的目錄及文本框中的文本

解決方案

因未在docx庫找到直接識別word中目錄及文本框中文本的方法，所以采用了一個“笨”方法，docx庫可以把word文檔解析成xml格式，以解析xml的方式查找目錄及文本框中文本，具體做法：

迭代出文檔的所有element，其中目錄的tag為“std”，找到它后提出他的所有文本即為目錄文本；文本框的tag 為“textbox”，找到它后還要繼續下鉆尋找tag為 ’r’的element,提取其文本則為文本框中文本。

# 提取word目錄file = docx.Document(file_path)children = file.element.body.iter()child_iters = []for child in children: # 通過類型判斷目錄 if child.tag.endswith(’main}sdt’): for ci in child.iter(): if ci.text and ci.text.strip(): child_iters.append(ci)catalog = [ci.text for ci in child_iters]

# 提取word文本框中文本file = docx.Document(file_path)children = file.element.body.iter()child_iters = []for child in children: # 通過類型判斷目錄 if child.tag.endswith(’textbox’): for ci in child.iter(): if ci.tag.endswith(’main}r’): child_iters.append(ci)textbox = [ci.text for ci in child_iters]

文本域的標簽，第一次找的是AlternateContent，后來發現對有些文本域失效；第二次又找到了pict，基本覆蓋了測試的所有文本域；第三次把word文檔的標簽都找出來看了一下，發現textbox這個標簽看著更靠譜，用它測試了一下，也能覆蓋所有的測試文本域，決定就選擇這個標簽。

提取文本后，又有了新需求，提取的文本很多都不成句，呈短語或單詞的形式，需要把提取的文本還原成段落形式：

file = docx.Document(file_path)children = file.element.body.iter()child_iters = []tags = []for child in children: # 通過類型判斷目錄 if child.tag.endswith((’AlternateContent’,’textbox’)): for ci in child.iter(): tags.append(ci.tag) if ci.tag.endswith((’main}r’, ’main}pPr’)): child_iters.append(ci)text = [’’]for ci in child_iters : if ci.tag.endswith(’main}pPr’): text.append(’’) else: text[-1] += ci.text ci.text = ’’trans_text = [’***’+t+’***’ for t in text]print(trans_text)i， k = 0, 0for ci in child_iters : if ci.tag.endswith(’main}pPr’): i += 1 k = 0 elif k == 0: ci.text = trans_text[i] k = 1file.save(’E:/***/test.docx’)

把標簽pPr當做換行標志，把提取的文本每段前后都加了“***”后又寫回文檔中。

注：這里又發現AlternateContent這個標簽必須要帶上，否則可以提取文本域內的文字，但改變文字寫回去保存word不顯示更改后的文字。

以上為個人經驗，希望能給大家一個參考，也希望大家多多支持好吧啦網。如有錯誤或未考慮完全的地方，望不吝賜教。

Python 編程

上一條：基于python goto的正確用法說明下一條：python docx的超鏈接網址和鏈接文本操作

相關文章：

1. ASP編碼必備的8條原則2. 詳解php如何合并身份證正反面圖片為一張圖片3. 得到XML文檔大小的方法4. Laravel中數據庫遷移操作的示例詳解5. JS實現一個微信錄音功能過程示例詳解6. asp.net core項目授權流程詳解7. ASP錯誤捕獲的幾種常規處理方式8. asp錯誤 '80040e21' 多步 OLE DB 操作產生錯誤9. .NET 中配置從xml轉向json方法示例詳解10. 詳解JS前端使用迭代器和生成器原理及示例

排行榜

					
					改進JAVA字符串分解的方法
Python基礎之畫圖神器matplotlib
Python 如何將字符串每兩個用空格隔開
如何用python開發Zeroc Ice應用
python計算auc的方法
python實現猜數游戲(保存游戲記錄）
Python使用shutil模塊實現文件拷貝
Python切割圖片成九宮格的示例代碼
python實現梯度下降算法的實例詳解
利用python+request通過接口實現人員通行記錄上傳功能
Python sorted對list和dict排序