文章目录
- 概念介绍
- 形状对象
- 读取图片
- 自定义图形
概念介绍
从概念上来讲,word文档分为两层,一个文本层,一个绘画层;
- 文本层,从上到下,从左到右,流式排版,本页填满则开启新页面;
- 图层,绘画的对象也称作shapes,可以放在任意的位置。有时也被称为浮动的shapes;
- 图片,可以出现在文本层、图层,在文本层时称为inline shapes(内联的图形),被当做一个大的文本字符,行高相应增加以适应当前图片,图片会放在一行,并且有合适的宽度;
- 通常,图片单独放在一个段落中,当然它的前面、后面也可以有文本。
- 写入文本时,python-docx仅支持内联图片(inline pictures),通过
add_picture()
方法添加文档末尾的一个段落中,也可以在该段落内图片的两边添加文本。
形状对象
- InlineShapes 对象
- 类 docx.shape.InlineShapes(body_elm: CT_Body, parent: StoryPart)
- InlineShape对象的序列,可迭代,可索引访问;
- Document实例调用inline_shapes 获取所有的图片;
- InlineShape 对象
- 类 docx.shape.InlineShape(inline: CT_Inline)
- 是inline图形对象的容器,代理底层xml <wp:inline>元素;通过
_inline
属性获取底层xml元素(<CT_Inline ‘wp:inline’ at 0x16f8a7c4e50>) - width/height (读写)属性,是一个Length对象,显示的宽度、高度;
- type, 只读属性,类型; docx.enum.shape.WD_INLINE_SHAPE
>>> inline_shape.height # Length对象
914400
>>> inline_shape.height.inches # 英寸
1.0
读取图片
读取word中粘贴的图片,插入的图片等;
# __author__ = "laufing"
# 抽取 docx 中的图片
from typing import List
from docx import Document
from docx.enum.shape import WD_INLINE_SHAPE_TYPE
from docx.image.image import Image
from docx.oxml import CT_Inline, CT_GraphicalObject, CT_GraphicalObjectData, CT_Picture, CT_BlipFillProperties, CT_Blip
def extract_pics(file_path: str) -> None:
doc = Document(file_path)
for idx, shape in enumerate(doc.inline_shapes):
# 获取图片的rid
rid = shape._inline.graphic.graphicData.pic.blipFill.blip.embed
# 图片的名称
name = shape._inline.docPr.name
# 根据rid获取图片对象
image_obj = doc.part.rels.get(rid).target_part.image
print("image:", image_obj)
print("file_name:", image_obj.filename)
print("扩展名:", image_obj.ext)
print("图片二进制数据:", image_obj.blob)
print("图片二进制数据hash:", image_obj.sha1)
print("内容类型:", image_obj.content_type)
print("像素宽高{}x{}:".format(image_obj.px_width, image_obj.px_height))
print("分辨率{} x {}".format(image_obj.horz_dpi, image_obj.vert_dpi))
# 解析图片信息
width = shape.width, # Lendth对象 可以继续调用.inches/.pt/.cm
height = shape.height
print("\n")
if __name__ == '__main__':
word_path = r"C:\Users\lenovo\Desktop\cc\lauf_chapter_old.docx"
extract_pics(word_path)
输出:
自定义图形
- 底层xml
word/excel 本身是一个压缩文件,将word改名为.zip, 即可解压,在document.xml中可以查看到如下内容:
<mc:AlternateContent>
<mc:Choice Requires="wps">
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251659264" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="7EAC963D" wp14:editId="53638C19">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:posOffset>612267</wp:posOffset>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>117933</wp:posOffset>
</wp:positionV>
<wp:extent cx="2128723" cy="658368"/>
<wp:effectExtent l="0" t="0" r="24130" b="27940"/>
<wp:wrapNone/>
<wp:docPr id="1491957454" name="矩形 1"/>
<wp:cNvGraphicFramePr/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<wps:wsp>
<wps:cNvSpPr/>
<wps:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="2128723" cy="658368"/>
</a:xfrm>
<a:prstGeom prst="rect">
<a:avLst/>
</a:prstGeom>
</wps:spPr>
<wps:style>
<a:lnRef idx="2">
<a:schemeClr val="accent1">
<a:shade val="15000"/>
</a:schemeClr>
</a:lnRef>
<a:fillRef idx="1">
<a:schemeClr val="accent1"/>
</a:fillRef>
<a:effectRef idx="0">
<a:schemeClr val="accent1"/>
</a:effectRef>
<a:fontRef idx="minor">
<a:schemeClr val="lt1"/>
</a:fontRef>
</wps:style>
<wps:txbx>
<w:txbxContent>
<w:p w14:paraId="29343860" w14:textId="0DB31DD1" w:rsidR="00BD204B" w:rsidRDefault="00BD204B" w:rsidP="00BD204B">
<w:pPr>
<w:jc w:val="center"/>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>这是自己添加的图形</w:t>
</w:r>
</w:p>
</w:txbxContent>
</wps:txbx>
<wps:bodyPr rot="0" spcFirstLastPara="0" vertOverflow="overflow" horzOverflow="overflow" vert="horz" wrap="square" lIns="91440" tIns="45720" rIns="91440" bIns="45720" numCol="1" spcCol="0" rtlCol="0" fromWordArt="0" anchor="ctr" anchorCtr="0" forceAA="0" compatLnSpc="1">
<a:prstTxWarp prst="textNoShape">
<a:avLst/>
</a:prstTxWarp>
<a:noAutofit/>
</wps:bodyPr>
</wps:wsp>
</a:graphicData>
</a:graphic>
</wp:anchor>
</w:drawing>
</mc:Choice>
<mc:Fallback>
<w:pict>
<v:rect w14:anchorId="7EAC963D" id="矩形 1" o:spid="_x0000_s1026" style="position:absolute;left:0;text-align:left;margin-left:48.2pt;margin-top:9.3pt;width:167.6pt;height:51.85pt;z-index:251659264;visibility:visible;mso-wrap-style:square;mso-wrap-distance-left:9pt;mso-wrap-distance-top:0;mso-wrap-distance-right:9pt;mso-wrap-distance-bottom:0;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-position-vertical:absolute;mso-position-vertical-relative:text;v-text-anchor:middle" o:gfxdata="UEsDBBQABgAIAAAAIQC2gziS/gAAAOEBAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbJSRQU7DMBBF 90jcwfIWJU67QAgl6YK0S0CoHGBkTxKLZGx5TGhvj5O2G0SRWNoz/78nu9wcxkFMGNg6quQqL6RA 0s5Y6ir5vt9lD1JwBDIwOMJKHpHlpr69KfdHjyxSmriSfYz+USnWPY7AufNIadK6MEJMx9ApD/oD OlTrorhX2lFEilmcO2RdNtjC5xDF9pCuTyYBB5bi6bQ4syoJ3g9WQ0ymaiLzg5KdCXlKLjvcW893 SUOqXwnz5DrgnHtJTxOsQfEKIT7DmDSUCaxw7Rqn8787ZsmRM9e2VmPeBN4uqYvTtW7jvijg9N/y JsXecLq0q+WD6m8AAAD//wMAUEsDBBQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAX3JlbHMvLnJl bHOkkMFqwzAMhu+DvYPRfXGawxijTi+j0GvpHsDYimMaW0Yy2fr2M4PBMnrbUb/Q94l/f/hMi1qR JVI2sOt6UJgd+ZiDgffL8ekFlFSbvV0oo4EbChzGx4f9GRdb25HMsYhqlCwG5lrLq9biZkxWOiqY 22YiTra2kYMu1l1tQD30/bPm3wwYN0x18gb45AdQl1tp5j/sFB2T0FQ7R0nTNEV3j6o9feQzro1i OWA14Fm+Q8a1a8+Bvu/d/dMb2JY5uiPbhG/ktn4cqGU/er3pcvwCAAD//wMAUEsDBBQABgAIAAAA IQC4Xd63ZAIAAB4FAAAOAAAAZHJzL2Uyb0RvYy54bWysVE1v2zAMvQ/YfxB0X22nX1lQpwhadBhQ tMXaoWdFlmoDsqhRSuzs14+SHadoix2GXWRKJB+p50ddXPatYVuFvgFb8uIo50xZCVVjX0r+8+nm y5wzH4SthAGrSr5Tnl8uP3+66NxCzaAGUylkBGL9onMlr0Nwiyzzslat8EfglCWnBmxFoC2+ZBWK jtBbk83y/CzrACuHIJX3dHo9OPky4WutZLjX2qvATMmpt5BWTOs6rtnyQixeULi6kWMb4h+6aEVj qegEdS2CYBts3kG1jUTwoMORhDYDrRup0h3oNkX+5jaPtXAq3YXI8W6iyf8/WHm3fXQPSDR0zi88 mfEWvcY2fqk/1ieydhNZqg9M0uGsmM3PZ8ecSfKdnc6Pz+aRzeyQ7dCHbwpaFo2SI/2MxJHY3vow hO5DKO9QP1lhZ1RswdgfSrOmihVTdpKGujLItoJ+qpBS2VAMrlpUajguTvM8/V3qZ8pI3SXAiKwb YybsESDK7j320OsYH1NVUtaUnP+tsSF5ykiVwYYpuW0s4EcAhm41Vh7i9yQN1ESWQr/uKSSaa6h2 D8gQBol7J28aov1W+PAgkDRN6qc5Dfe0aANdyWG0OKsBf390HuNJauTlrKMZKbn/tRGoODPfLYnw a3FyEocqbU5Oz2e0wdee9WuP3bRXQH+soBfByWTG+GD2pkZon2mcV7EquYSVVLvkMuB+cxWG2aUH QarVKoXRIDkRbu2jkxE8Ehxl9dQ/C3Sj9gKp9g728yQWbyQ4xMZMC6tNAN0kfR54HamnIUwaGh+M OOWv9ynq8Kwt/wAAAP//AwBQSwMEFAAGAAgAAAAhABcwhdzeAAAACQEAAA8AAABkcnMvZG93bnJl di54bWxMj0FPwzAMhe9I/IfISNxYunZUXWk6TQgOu7Ex7Zw1XluROFWTbYVfjzmxm/3e0/PnajU5 Ky44ht6TgvksAYHUeNNTq2D/+f5UgAhRk9HWEyr4xgCr+v6u0qXxV9riZRdbwSUUSq2gi3EopQxN h06HmR+Q2Dv50enI69hKM+orlzsr0yTJpdM98YVOD/jaYfO1OzsFP5uTTD7CW7Ffb5bPWb+1h4O2 Sj0+TOsXEBGn+B+GP3xGh5qZjv5MJgirYJkvOMl6kYNgf5HNeTiykKYZyLqStx/UvwAAAP//AwBQ SwECLQAUAAYACAAAACEAtoM4kv4AAADhAQAAEwAAAAAAAAAAAAAAAAAAAAAAW0NvbnRlbnRfVHlw ZXNdLnhtbFBLAQItABQABgAIAAAAIQA4/SH/1gAAAJQBAAALAAAAAAAAAAAAAAAAAC8BAABfcmVs cy8ucmVsc1BLAQItABQABgAIAAAAIQC4Xd63ZAIAAB4FAAAOAAAAAAAAAAAAAAAAAC4CAABkcnMv ZTJvRG9jLnhtbFBLAQItABQABgAIAAAAIQAXMIXc3gAAAAkBAAAPAAAAAAAAAAAAAAAAAL4EAABk cnMvZG93bnJldi54bWxQSwUGAAAAAAQABADzAAAAyQUAAAAA " fillcolor="#4472c4 [3204]" strokecolor="#09101d [484]" strokeweight="1pt">
<v:textbox>
<w:txbxContent>
<w:p w14:paraId="29343860" w14:textId="0DB31DD1" w:rsidR="00BD204B" w:rsidRDefault="00BD204B" w:rsidP="00BD204B">
<w:pPr>
<w:jc w:val="center"/>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>这是自己添加的图形</w:t>
</w:r>
</w:p>
</w:txbxContent>
</v:textbox>
</v:rect>
</w:pict>
</mc:Fallback>
</mc:AlternateContent>
- 代码解析
def extract_shapes(file_path: str) -> None:
""" 抽取图形 """
doc = Document(file_path)
# 在段落中抽取图形
for idx, para in enumerate(doc.element.body):
for run in para:
# 找到run
if isinstance(run, CT_R):
for inner_run in run:
tag_name = inner_run.tag.split("}")[1]
if tag_name == "AlternateContent": # lxml.etree._Element对象
fallback = inner_run[1]
pict = fallback[0]
shape = pict[0] # 内部图形
print("图形数据:", shape.items())
textbox = list(shape) # shape内部是否有文本,需判断
if textbox:
textbox = textbox[0]
# 图形内的段落文本
para_list = list(textbox[0])
print("图形内部文本:", [para.text for para in para_list])