Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out invisible text rendered with Tr(3) #1230

Closed
svaaraniemi opened this issue Dec 3, 2024 · 5 comments
Closed

Filter out invisible text rendered with Tr(3) #1230

svaaraniemi opened this issue Dec 3, 2024 · 5 comments
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@svaaraniemi
Copy link

Some PDFs have text rendered with the Invisible Text operator Tr(3). Note that the stroke_color of such text is usually the same as the color of normal text. Can such text be filtered out from the text extract?

The attached PDF is one page extracted from Texas regulation (public domain) which has examples of such text, e.g., the 6th line extracts text like so, using the pdfplumber page.extract_text_lines() method:

"text": "Sec. 545.001. AA DEFINITIONS. AA In this chapter:",

Here the two instances of 'AA' are invisible text.

Chapter_545-p1.pdf

@svaaraniemi svaaraniemi added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Dec 3, 2024
@jsvine
Copy link
Owner

jsvine commented Dec 9, 2024

Hi @svaaraniemi, and great question. It seems that doing so would require pdfminer.six to pass the self.textstate.render value:

    def do_Tr(self, render: PDFStackT) -> None:
        """Set the text rendering mode"""
        self.textstate.render = cast(int, render)

... to LTChar objects when they're created. Unfortunately, the library doesn't currently seem to do that,

@svaaraniemi
Copy link
Author

Thanks for confirming what I was suspecting. Perhaps this way of placing hidden text in a PDF is not common enough and it never caught pdfminer.six team's attention.

@jsvine
Copy link
Owner

jsvine commented Dec 16, 2024

Thanks regardless, @svaaraniemi. Wondering if @dhdaines has any thoughts on this re. PLAYA?

@dhdaines
Copy link
Contributor

dhdaines commented Dec 16, 2024

Thanks regardless, @svaaraniemi. Wondering if @dhdaines has any thoughts on this re. PLAYA?

Hi! The rendering mode is parsed by pdfminer.six but it isn't accessible from the layout analyser which is what pdfplumber uses to create its objects. In order to get around this you have to override a couple of methods, render_string (which gets passed a textstate argument that contains the rendering mode) and also render_char. We could do this in pdfplumber.page.PDFPageAggregatorWithMarkedContent which is already overriding render_char but it could be a bit fragile.

This is why people say that implementation inheritance is considered harmful ;-)

In the PLAYA branch it would be simple, just add obj["render"] = textstate.render_mode to process_object...

@dhdaines
Copy link
Contributor

I should mention while I'm here that there are some other common ways of hiding text that aren't supported by pdfminer.six (or PLAYA) either:

  1. A technique I affectionately refer to as "whiteout" - set the stroking and non stroking colors to DeviceGray(1.0) :-)
  2. Setting the clipping path to, well, anything that excludes the text you want to hide. Sometimes this can be as simple as setting it to an empty path, but it can also be any arbitrary path (often a rectangle of the margins of the page). I made an attempt to support this in Very approximate support for hiding text using clipping path pdfminer/pdfminer.six#1026 but it is kind of a hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants