diff --git a/README.md b/README.md index 6ff878fc..e5098b27 100644 --- a/README.md +++ b/README.md @@ -324,7 +324,7 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr |--------|-------------| |`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.| |`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.| -|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'. | +|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).| |`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.| |`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. | |`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)| diff --git a/pdfplumber/utils/text.py b/pdfplumber/utils/text.py index 2c436da0..d928c155 100644 --- a/pdfplumber/utils/text.py +++ b/pdfplumber/utils/text.py @@ -15,6 +15,16 @@ DEFAULT_X_DENSITY = 7.25 DEFAULT_Y_DENSITY = 13 +LIGATURES = { + "ff": "ff", + "ffi": "ffi", + "ffl": "ffl", + "fi": "fi", + "fl": "fl", + "st": "st", + "ſt": "st", +} + class TextMap: """ @@ -136,6 +146,7 @@ def to_textmap( y_shift: T_num = 0, y_tolerance: T_num = DEFAULT_Y_TOLERANCE, presorted: bool = False, + expand_ligatures: bool = True, ) -> TextMap: """ Given a list of (word, chars) tuples (i.e., a WordMap), return a list of @@ -177,6 +188,8 @@ def to_textmap( if not len(self.tuples): return TextMap(_textmap) + expansions = LIGATURES if expand_ligatures else {} + if layout: if layout_width_chars: if layout_width: @@ -236,10 +249,13 @@ def to_textmap( x_dist = (word["x0"] - x_shift) / x_density if layout else 0 num_spaces_prepend = max(min(1, line_len), round(x_dist) - line_len) _textmap += [(" ", None)] * num_spaces_prepend + line_len += num_spaces_prepend + for c in chars: - for letter in c["text"]: + letters = expansions.get(c["text"], c["text"]) + for letter in letters: _textmap.append((letter, c)) - line_len += num_spaces_prepend + len(word["text"]) + line_len += 1 # Append spaces at end of line if layout: @@ -271,6 +287,7 @@ def __init__( vertical_ttb: bool = True, # Should vertical words be read top-to-bottom? extra_attrs: Optional[List[str]] = None, split_at_punctuation: Union[bool, str] = False, + expand_ligatures: bool = True, ): self.x_tolerance = x_tolerance self.y_tolerance = y_tolerance @@ -287,6 +304,8 @@ def __init__( else (split_at_punctuation or "") ) + self.expansions = LIGATURES if expand_ligatures else {} + def merge_chars(self, ordered_chars: T_obj_list) -> T_obj: x0, top, x1, bottom = objects_to_bbox(ordered_chars) doctop_adj = ordered_chars[0]["doctop"] - ordered_chars[0]["top"] @@ -295,7 +314,9 @@ def merge_chars(self, ordered_chars: T_obj_list) -> T_obj: direction = 1 if (self.horizontal_ltr if upright else self.vertical_ttb) else -1 word = { - "text": "".join(map(itemgetter("text"), ordered_chars)), + "text": "".join( + self.expansions.get(c["text"], c["text"]) for c in ordered_chars + ), "x0": x0, "x1": x1, "top": top, diff --git a/tests/pdfs/issue-598-example.pdf b/tests/pdfs/issue-598-example.pdf new file mode 100644 index 00000000..3762c961 Binary files /dev/null and b/tests/pdfs/issue-598-example.pdf differ diff --git a/tests/test_issues.py b/tests/test_issues.py index c6719635..cfbaedda 100644 --- a/tests/test_issues.py +++ b/tests/test_issues.py @@ -224,6 +224,24 @@ def test_issue_463(self): annots = pdf.annots annots[0]["contents"] == "日本語" + def test_issue_598(self): + """ + Ligatures should be translated by default. + """ + path = os.path.join(HERE, "pdfs/issue-598-example.pdf") + with pdfplumber.open(path) as pdf: + page = pdf.pages[0] + a = page.extract_text() + assert "fiction" in a + assert "fiction" not in a + + b = page.extract_text(expand_ligatures=False) + assert "fiction" in b + assert "fiction" not in b + + assert page.extract_words()[53]["text"] == "fiction" + assert page.extract_words(expand_ligatures=False)[53]["text"] == "fiction" + def test_issue_683(self): """ Page.search ValueError: min() arg is an empty sequence