Click here to Skip to main content
16,022,339 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm trying to do automatic document annotation

In my python code I have added function set_paragraph_bg_color to use annotation colors, annotate header and paragraphs using process_docx and one more function annotate_images for images only. I'll add more to annotate in detail but I wanted to proceed with this code first. I find a problem here that images (should be annotated in light blue) isn't getting annotated and header (annotated in yellow) too isn't getting annotated.

In the input word document, I had added 1 header using insert header settings, 2 paragraphs and 1 image that was taken from web.

In the output word document, only the paragraphs are getting annotated.

Here's the input docx snap

Also the output docx snap

annotate.py :

Python
<pre>from docx import Document
from docx.shared import RGBColor
from docx.oxml import OxmlElement
from docx.oxml.ns import qn


def set_paragraph_bg_color(paragraph, color_hex):
    shading_elm = OxmlElement('w:shd')
    shading_elm.set(qn('w:val'), 'clear')
    shading_elm.set(qn('w:color'), 'auto')
    shading_elm.set(qn('w:fill'), color_hex)
    paragraph._element.get_or_add_pPr().append(shading_elm)

def annotate_images(doc):
    for rel in doc.part.rels.values():
        if "image" in rel.target_ref:
            image = rel.target_part
            for paragraph in doc.paragraphs:
                for elem in paragraph._element.iter():
                    if elem.tag.endswith('}inline'):
                        if elem.attrib.get(qn('r:embed')) == rel.rel_id:    
                            set_paragraph_bg_color(paragraph, 'ADD8E6')
                            break
                        
def process_docx(input_path, output_path):
    doc = Document(input_path)

    for paragraph in doc.paragraphs:
        if paragraph.style.name.startswith('Heading'):
            set_paragraph_bg_color(paragraph, 'FFFF00') # Yellow
        elif paragraph.text.strip(): #Ensures non-empty paragraphs
            set_paragraph_bg_color(paragraph, 'D3D3D3') # Light grey

    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    if paragraph.text.strip():
                        set_paragraph_bg_color(paragraph, 'FFFFFF') #White

    annotate_images(doc)
    
    doc.save(output_path)

input_path = 'C:\\Users\\Me<big></big>\\Desktop\\Document annotate
tool\\New_Doc.docx'
output_path = 'annotated_document.docx'

process_docx(input_path, output_path)

print(f"Annotated Document saved to {output_path}")



What I have tried:

I've tried to fix the annotate_images function by hashing when image.blob was used in the function. This didn't work when generating the output docx.

Python
import hashlib
def annotate_images(doc):
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
image = rel.target_part
image_hash = hashlib.sha256(image.blob).hexdigest()
for paragraph in doc.paragraphs:
paragraph_text_hash = hashlib.sha256(paragraph.text.encode('utf-8')).hexdigest()
if image_hash == paragraph_text_hash:
annotate_paragraph(paragraph, 'ADD8E6')


For the header part I'm really confused.
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900