Enhance `#scan_integer` to check for valid character following it #119

gettalong · 2024-11-28T22:57:51Z

Recently the new method #scan_integer was introduced (see #113) to optimize scanning integer values.

The current implementation works regardless of what follows the integer, i.e. scanning 123, 123 something, 123,something, 123.32 and 123something all work and would return 123.

However, in - I suspect - many cases an integer may only be a valid integer if it is (not) followed by certain characters. One example is the input 123d which leads to an error when interpreted as Ruby code.

My use case is PDF syntax. There a token is an integer only when it is followed by a whitespace (ASCII decimal 0, 9, 10, 12, 13 and 32) or a delimiter (( ) < > [ ] / %) character (otherwise it is a generic token). To handle this the implementation using #scan_integer looks like this:

    # Parses the number (integer or real) at the current position.
    #
    # See: PDF2.0 s7.3.3
    def parse_number
      prepare_string_scanner(20)
      pos = self.pos
      if (tmp = @ss.scan_integer)
        if @ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE)
          # Handle object references, see PDF2.0 s7.3.10
          prepare_string_scanner(10)
          if @ss.scan(REFERENCE_RE)
            tmp = if tmp > 0
                    Reference.new(tmp, @ss[1].to_i)
                  else
                    maybe_raise("Invalid indirect object reference (#{tmp},#{@ss[1].to_i})")
                    nil
                  end
          end
          return tmp
        else
          self.pos = pos
        end
      end

      val = scan_until(WHITESPACE_OR_DELIMITER_RE) || @ss.scan(/.*/)
      if val.match?(/\A[+-]?(?:\d+\.\d*|\.\d+)\z/)
        val << '0' if val.getbyte(-1) == 46 # dot '.'
        Float(val)
      else
        TOKEN_CACHE[val] # val is keyword
      end
    end
  end

As you can see we

need to store the current scan position,
check if scanning an integer works at the current position,
scan the content after the integer to verify that it is indeed an integer and work with it, or
if the previous step didn't work, reset the scan position.

This could be simplified to just a call of #scan_integer if this method would optionally check the contents after it. Something like #scan_integer(separator: SEPARATOR_PATTERN) or maybe #scan_integer(separator_chars: STRING) (where STRING contains separator characters, similar to whole String#tr works).

Would it make sense to include such functionality?

The text was updated successfully, but these errors were encountered:

kou · 2024-12-12T02:16:14Z

I understand the motivation but I feel that we can find better API for this...

gettalong · 2024-12-12T22:26:16Z

I'm not sure about the argument name but I think it does make sense to extend the functionality of #scan_integer and not introduce another method like #scan_delimited_integer.

kou · 2024-12-14T02:06:39Z

Can we process a generic token and then an integer?

token = scanner.scan(/[\da-zA-Z]?[a-zA-Z]/)
return token if token
integer = scanner.scan_integer
return integer if integer

gettalong · 2024-12-15T11:59:21Z

It is the other way around: The thing at pos might be a number (e.g. 1234) but if that number is directly followed eg. by a letter (e.g. 1234a) it is a token/keyword. So the next step in your example would be to scan for a generic token/keyword again (that's the last else path in my original example).

kou · 2024-12-16T07:09:41Z

Can we reconstruct the processing order? I feel that parse_number processes a token too.

loop do
  token = scanner.scan(/[\da-zA-Z]?[a-zA-Z]/)
  if token
    process_token(token)
    next
  end
  number = scanner.scan_integer
  if number
    process_number
    next
  end

gettalong · 2024-12-16T08:12:52Z

For HexaPDF, tokenization starts at #next_token (https://github.com/gettalong/hexapdf/blob/master/lib/hexapdf/tokenizer.rb#L118). There whitespace at point is ignored/parsed and then the sub-processing routine is chosen based on the byte at point. For the bytes 0-9 - + . the #parse_number is invoked and its result returned. So if it turns out that the token at point is not a number, the last step in #parse_number is to return a generic token (i.e. a PDF keyword).

So yes, #parse_number might also return a generic token (keyword). It is not intended as a standalone routine.

With the currently implemented #scan_integer method I can already omit line https://github.com/gettalong/hexapdf/blob/master/lib/hexapdf/tokenizer.rb#L286 and try to parse the token at point as integer. The benefit of #scan_integer in this case is to avoid creating a string when not necessary, i.e. the second line in your last example.

gettalong mentioned this issue Nov 28, 2024

Provide an efficient way to parse Integers (and Floats)? #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `#scan_integer` to check for valid character following it #119

Enhance `#scan_integer` to check for valid character following it #119

gettalong commented Nov 28, 2024

kou commented Dec 12, 2024

gettalong commented Dec 12, 2024

kou commented Dec 14, 2024

gettalong commented Dec 15, 2024

kou commented Dec 16, 2024

gettalong commented Dec 16, 2024

Enhance #scan_integer to check for valid character following it #119

Enhance #scan_integer to check for valid character following it #119

Comments

gettalong commented Nov 28, 2024

kou commented Dec 12, 2024

gettalong commented Dec 12, 2024

kou commented Dec 14, 2024

gettalong commented Dec 15, 2024

kou commented Dec 16, 2024

gettalong commented Dec 16, 2024

Enhance `#scan_integer` to check for valid character following it #119

Enhance `#scan_integer` to check for valid character following it #119