Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of escaped characters in quoted strings and backslashes in unquoted strings #335

Open
OndrejSpanel opened this issue Jul 31, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@OndrejSpanel
Copy link

In the following code the escaped characters are ignored and parsed as a double backslash and a backslash followed by b, instead of a backslash and a backspace:

import org.virtuslab.yaml.*

val yaml = """
regexBoundary: "\\b"
backspace: "\b"
regexBoundaryUnquoted: \b
"""

case class Example(
    regexBoundary: String,
    backspace: String,
    regexBoundaryUnquoted: String,
) derives YamlCodec

val example = yaml.as[Example].toOption.get

println(example.regexBoundary)
println(example.backspace)
println(example.regexBoundaryUnquoted)

See also https://scastie.scala-lang.org/OndrejSpanel/jfavH3u2Sq29Gs5nQHmXKQ/92

The output is:

\\b
\b
\\b

None of this is correct. There should be no double backslashes and there should be a backspace, not \b in the second line.

Note: even the unquoted string is parsed wrong. The single backslash is converted to a double backslash in the case class value.

@lbialy lbialy self-assigned this Jul 31, 2024
@lbialy lbialy added the bug Something isn't working label Jul 31, 2024
@lbialy
Copy link
Contributor

lbialy commented Jul 31, 2024

First of all, there's some interference from triplequote here:

scala> "\\b"
val res0: String = \b

scala> """\\b"""
val res1: String = \\b

scala> val yaml = """
     | regexBoundary: "\\b"
     | backspace: "\b"
     | regexBoundaryUnquoted: \b
     | """
val yaml: String = "
regexBoundary: "\\b"
backspace: "\b"
regexBoundaryUnquoted: \b
"

This can be fixed with:

scala> val yaml = s"""
     | regexBoundary: "${"\\b"}"
     | backspace: "${"\b"}"
     | regexBoundaryUnquoted: ${"\b"}
     | """
val yaml: String = "
regexBoundary: "\b"
backspace: "
regexBoundaryUnquoted:
"

When we try to parse that we get an error:

scala> val example = yaml.as[Example]
val example: Either[org.virtuslab.yaml.YamlError, Example] = Left(org.virtuslab.yaml.ConstructError: Could't construct java.lang.String from null (tag:yaml.org,2002:null)
regexBoundaryUnquoted:
                       ^
)

this is probably a mistake as there is a character (the \b) as a value of regexBoundaryUnquoted mapping.
If we drop the unquoted field we get:

scala> case class Example(
     |     regexBoundary: String,
     |     backspace: String,
     | ) derives YamlCodec
// defined case class Example


scala> val example = yaml.as[Example].right.get
val example: Example = Example(\b)

scala> example.regexBoundary
val res0: String = \b

scala> example.backspace
val res1: String =

scala> YamlEncoder.escapeSpecialCharacters(example.backspace)
val res2: String = \u0008

Which is what you'd expect, I guess. I think there are issues around escaping for Scalar nodes due to this code in ScalarStyle.scala:

sealed abstract class ScalarStyle(indicator: Char)
object ScalarStyle {
  case object Plain        extends ScalarStyle(' ')
  case object DoubleQuoted extends ScalarStyle('"')
  case object SingleQuoted extends ScalarStyle('\'')
  case object Folded       extends ScalarStyle('>')
  case object Literal      extends ScalarStyle('|')

  def escapeSpecialCharacter(scalar: String, scalarStyle: ScalarStyle): String =
    scalarStyle match {
      case ScalarStyle.DoubleQuoted => scalar
      case ScalarStyle.SingleQuoted => scalar
      case ScalarStyle.Literal      => scalar
      case _ =>
        scalar.flatMap { char =>
          char match {
            case '\\'  => "\\\\"
            case '\n'  => "\\n"
            case other => other.toString
          }
        }
    }

but those are limited to some escapes for unquoted strings which do make sense to be honest but I'm not sure if they are 100% correct as they were here before I started maintaining the lib.

@OndrejSpanel
Copy link
Author

interference from triplequote

Triplequotes prevent backslashes to be used as escapes, they are used as literals instead. This is expected, as triple quotes define raw string literals.

@lbialy
Copy link
Contributor

lbialy commented Jul 31, 2024

yeah, but it's not a problem with yaml parser, you get what you see

@lbialy
Copy link
Contributor

lbialy commented Jul 31, 2024

I think \b does get borked in the escaping of unquoted strings:

scala> val yaml = s"""
     | regexBoundary: "${"\\b"}"
     | backspace: "${"\b"}"
     | regexBoundaryUnquoted: some${"\b"}text${"\b"}
     | """
val yaml: String = "
regexBoundary: "\b"
backspace: "
regexBoundaryUnquoted: somtext
"

notice somtext due to backspace control char being correctly rendered by terminal here

scala> YamlEncoder.escapeSpecialCharacters(yaml)
val res6: String = "
regexBoundary: "\b"
backspace: "\u0008"
regexBoundaryUnquoted: some\u0008text\u0008
"

scala> val example = yaml.as[Example].right.get
val example: Example = Example(\b,somtext)

scala> YamlEncoder.escapeSpecialCharacters(example.regexBoundaryUnquoted)
val res8: String = some\u0008text

notice missing \u0008 after text here. It got trimmed. I have to go through the spec on parsing to understand what is the correct (or rather: spec-compliant) behavior here.

@OndrejSpanel
Copy link
Author

OndrejSpanel commented Jul 31, 2024

yeah, but it's not a problem with yaml parser, you get what you see

I am not sure I understand. Compare this with Circe behaviour in https://scastie.scala-lang.org/OndrejSpanel/cz1HDKc7RoaOgEnrg6YcQA/5.

Instead of triple quotes I could use an input from a file. When there is a backslash in the quoted string, it should be processed as an escape by the YAML parser. When it is in an unquoted input, it should be processed as a backslash character.

What I see instead is it is processed as a backslash character in a quoted string and as two backslashes in an unquoted string.

@OndrejSpanel
Copy link
Author

OndrejSpanel commented Jul 31, 2024

Note: it is not my intention to have backspace characters present in my input. I want \b escaped sequence to be present there, which is exactly what triple quotes allow me to do - you can imagine you are reading the input from a file instead. The code using interpolation places a backspace character into the input, which is not what I am interested about and I have no idea how such thing should be handled by a parser.

@OndrejSpanel
Copy link
Author

From specs: https://yaml.org/spec/1.2.2/#57-escaped-characters

Note that escape sequences are only interpreted in double-quoted scalars. In all other scalar styles, the “\” character has no special meaning and non-printable characters are not available.

@lbialy
Copy link
Contributor

lbialy commented Jul 31, 2024

Ahhh, I misunderstood your intent. Ok, I get it now.

@OndrejSpanel
Copy link
Author

Another example which is related, but perhaps simpler: at the moment I cannot find a way to represent a backslash character in my input. Using \ in unquoted strings results in a double backslash. using double backslash in quoted strings results in a crash or strange behaviour:

Check:

import org.virtuslab.yaml.*

val yaml = """value: \"""
case class Example(value: String) derives YamlCodec
yaml.as[Example].toOption.get

Or even worse:

import org.virtuslab.yaml.*

val yaml = """
quoted: "\\"
unquoted: \
"""

case class Example(quoted: String) derives YamlCodec

yaml.as[Example].toOption.get

Which results in the strange:

Example(" unquoted: \ )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants