I'm trying to do automatic document annotation
In my python code I have added function
set_paragraph_bg_color
to use annotation colors, annotate header and paragraphs using
process_docx
and one more function
annotate_images
for images only. I'll add more to annotate in detail but I wanted to proceed with this code first. I find a problem here that images (should be annotated in light blue) isn't getting annotated and header (annotated in yellow) too isn't getting annotated.
In the input word document, I had added 1 header using insert header settings, 2 paragraphs and 1 image that was taken from web.
In the output word document, only the paragraphs are getting annotated.
Here's the input docx snap
Also the output docx snap
annotate.py :
<pre>from docx import Document
from docx.shared import RGBColor
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
def set_paragraph_bg_color(paragraph, color_hex):
shading_elm = OxmlElement('w:shd')
shading_elm.set(qn('w:val'), 'clear')
shading_elm.set(qn('w:color'), 'auto')
shading_elm.set(qn('w:fill'), color_hex)
paragraph._element.get_or_add_pPr().append(shading_elm)
def annotate_images(doc):
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
image = rel.target_part
for paragraph in doc.paragraphs:
for elem in paragraph._element.iter():
if elem.tag.endswith('}inline'):
if elem.attrib.get(qn('r:embed')) == rel.rel_id:
set_paragraph_bg_color(paragraph, 'ADD8E6')
break
def process_docx(input_path, output_path):
doc = Document(input_path)
for paragraph in doc.paragraphs:
if paragraph.style.name.startswith('Heading'):
set_paragraph_bg_color(paragraph, 'FFFF00')
elif paragraph.text.strip():
set_paragraph_bg_color(paragraph, 'D3D3D3')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if paragraph.text.strip():
set_paragraph_bg_color(paragraph, 'FFFFFF')
annotate_images(doc)
doc.save(output_path)
input_path = 'C:\\Users\\Me<big></big>\\Desktop\\Document annotate
tool\\New_Doc.docx'
output_path = 'annotated_document.docx'
process_docx(input_path, output_path)
print(f"Annotated Document saved to {output_path}")
What I have tried:
I've tried to fix the
annotate_images
function by hashing when
image.blob
was used in the function. This didn't work when generating the output docx.
import hashlib
def annotate_images(doc):
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
image = rel.target_part
image_hash = hashlib.sha256(image.blob).hexdigest()
for paragraph in doc.paragraphs:
paragraph_text_hash = hashlib.sha256(paragraph.text.encode('utf-8')).hexdigest()
if image_hash == paragraph_text_hash:
annotate_paragraph(paragraph, 'ADD8E6')
For the header part I'm really confused.