Header and image not getting annotated

Question

0.00/5 (No votes)

See more:

I'm trying to do automatic document annotation

In my python code I have added function set_paragraph_bg_color to use annotation colors, annotate header and paragraphs using process_docx and one more function annotate_images for images only. I'll add more to annotate in detail but I wanted to proceed with this code first. I find a problem here that images (should be annotated in light blue) isn't getting annotated and header (annotated in yellow) too isn't getting annotated.

In the input word document, I had added 1 header using insert header settings, 2 paragraphs and 1 image that was taken from web.

In the output word document, only the paragraphs are getting annotated.

Here's the input docx snap

Also the output docx snap

annotate.py :

Python

<pre>from docx import Document
from docx.shared import RGBColor
from docx.oxml import OxmlElement
from docx.oxml.ns import qn


def set_paragraph_bg_color(paragraph, color_hex):
    shading_elm = OxmlElement('w:shd')
    shading_elm.set(qn('w:val'), 'clear')
    shading_elm.set(qn('w:color'), 'auto')
    shading_elm.set(qn('w:fill'), color_hex)
    paragraph._element.get_or_add_pPr().append(shading_elm)

def annotate_images(doc):
    for rel in doc.part.rels.values():
        if "image" in rel.target_ref:
            image = rel.target_part
            for paragraph in doc.paragraphs:
                for elem in paragraph._element.iter():
                    if elem.tag.endswith('}inline'):
                        if elem.attrib.get(qn('r:embed')) == rel.rel_id:    
                            set_paragraph_bg_color(paragraph, 'ADD8E6')
                            break
                        
def process_docx(input_path, output_path):
    doc = Document(input_path)

    for paragraph in doc.paragraphs:
        if paragraph.style.name.startswith('Heading'):
            set_paragraph_bg_color(paragraph, 'FFFF00') # Yellow
        elif paragraph.text.strip(): #Ensures non-empty paragraphs
            set_paragraph_bg_color(paragraph, 'D3D3D3') # Light grey

    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    if paragraph.text.strip():
                        set_paragraph_bg_color(paragraph, 'FFFFFF') #White

    annotate_images(doc)
    
    doc.save(output_path)

input_path = 'C:\\Users\\Me<big></big>\\Desktop\\Document annotate
tool\\New_Doc.docx'
output_path = 'annotated_document.docx'

process_docx(input_path, output_path)

print(f"Annotated Document saved to {output_path}")

What I have tried:

I've tried to fix the annotate_images function by hashing when image.blob was used in the function. This didn't work when generating the output docx.

Python

import hashlib
def annotate_images(doc):
for rel in doc.part.rels.values():
if "image" in rel.target_ref:
image = rel.target_part
image_hash = hashlib.sha256(image.blob).hexdigest()
for paragraph in doc.paragraphs:
paragraph_text_hash = hashlib.sha256(paragraph.text.encode('utf-8')).hexdigest()
if image_hash == paragraph_text_hash:
annotate_paragraph(paragraph, 'ADD8E6')

For the header part I'm really confused.

Posted 16-Jun-24 4:29am

Sagardeep Das

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)