设为首页收藏本站

切换到窄版

论坛BBS

Discuz! Board»论坛 › 软件 › python › 开发日志：python word查重

发新帖

查看: 3284|回复: 4

开发日志：python word查重

4 主题	7 帖子	73 积分

Rank: 2

积分: 73

发消息

发表于 2023-9-21 13:13:28 | 显示全部楼层 |阅读模式

本帖最后由星云游于 2023-9-22 00:50 编辑

具体步骤
1.遍历所有文件
2.python 打开两个word的内容进行比对
3.1生成重复率标记文件
3.算出重复率，存储到矩阵中
4.导出excel

最终代码

from docx import Document
import os
import os.path
import difflib
path = ".\\files"
text = dict();
for parent,dirnames,filenames in os.walk(path):
for filename in filenames: #输出文件信息
document = Document(path+"\"+filename) # 读取现有的 word 建立文档对象
text[filename] = ""
tables = document.tables
for table in tables:
# 读取表格行
for row in table.rows:
# 读取每行的单元格数据
for cell in row.cells:
# 打印单元格内容
#print(cell.text)
text[filename] = text[filename]+ cell.text
pass
text[filename] = text[filename].replace("\n","")
text[filename] = text[filename].replace(" ","")
text[filename] = text[filename].replace('\xa0','')
text[filename] = text[filename].replace( '\u2716','')
for key1 in text.keys():
for key2 in text.keys():
if key1 < key2:
similarity = difflib.SequenceMatcher(None, text[key1], text[key2]).ratio()
print(key1+key2)
print(similarity)
if similarity>0.8:
print("重复率过高，抄袭的同学请主动修改！！！！")
differ = difflib.HtmlDiff()
html = differ.make_file(text[key1], text[key2]);
fid =open(key1+key2+'.html','w')
fid.write(html.replace('utf-8','gbk'))
fid.close

复制代码

回复

使用道具举报

4 主题	7 帖子	73 积分

Rank: 2

积分: 73

发消息

楼主| 发表于 2023-9-21 13:17:19 | 显示全部楼层

本帖最后由星云游于 2023-9-21 13:42 编辑

python 操作word

pip install python-docx

复制代码

读取现有文档

from docx import Document
document = Document("word.docx") # 读取现有的 word 建立文档对象

复制代码

读取文档内容

from docx import Document
document = Document("word.docx") # 读取现有的 word 建立文档对象
all_paragraphs = document.paragraphs
print(type(all_paragraphs))
for paragraph in all_paragraphs:
# print(paragraph.paragraph_format) # 打印出word中每段的样式名称
# 打印每一个段落的文字
print(paragraph.text)
# 循环读取每个段落里的run内容
# 一个run对象是相同样式文本的延续
for paragraph in all_paragraphs:
for run in paragraph.runs:
print(run.text) # 打印run内容

复制代码

查看源xml文件

print(paragraph._p.xml)

复制代码

回复

使用道具举报

4 主题	7 帖子	73 积分

Rank: 2

积分: 73

发消息

楼主| 发表于 2023-9-21 13:42:49 | 显示全部楼层

读取表格

tables = document.tables
for table in tables:
# 读取表格行
for row in table.rows:
# 读取每行的单元格数据
for cell in row.cells:
# 打印单元格内容
print(cell.text)

复制代码

回复

使用道具举报

4 主题	7 帖子	73 积分

Rank: 2

积分: 73

发消息

楼主| 发表于 2023-9-21 13:50:06 | 显示全部楼层

文字查重

import difflib
text1 = 'Python is a programming language'
text2 = 'Python is a programming language.'
# 计算文本相似度
similarity = difflib.SequenceMatcher(None, text1, text2).ratio()
# 打印相似度
print(similarity)

复制代码

回复

使用道具举报

412 主题	1281 帖子	4157 积分

Rank: 9 Rank: 9 Rank: 9

积分: 4157

发消息

发表于 2024-1-14 02:18:24 | 显示全部楼层

查重并写进csv

from docx import Document
import os
import os.path
import difflib
import csv
path = ".\\files"
text = dict();
for parent,dirnames,filenames in os.walk(path):
for filename in filenames: #输出文件信息
document = Document(path+"\"+ filename)
print(path+"\"+ filename)
text[filename] = ""
tables = document.tables
for table in tables:
# 读取表格行
for row in table.rows:
# 读取每行的单元格数据
for cell in row.cells:
# 打印单元格内容
#print(cell.text)
text[filename] = text[filename]+ cell.text
pass
text[filename] = text[filename].replace("\n","")
text[filename] = text[filename].replace(" ","")
text[filename] = text[filename].replace('\xa0','')
text[filename] = text[filename].replace( '\u2716','')
file = open('output.csv', mode='w', newline='', encoding='utf-8')
for key1 in text.keys():
print(key1+"...")
for key2 in text.keys():
if key1 < key2:
similarity = difflib.SequenceMatcher(None, text[key1], text[key2]).ratio()
writer = csv.writer(file)
writer.writerow([key1,key2,similarity])

复制代码

回复

使用道具举报

发新帖

Archiver|手机版|小黑屋|DiscuzX

GMT+8, 2025-9-13 17:31 , Processed in 0.037798 second(s), 18 queries .

Powered by Discuz! X3.4

Copyright © 2001-2021, Tencent Cloud.

快速回复 返回顶部 返回列表