Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The import parser has false positives and false negatives #79

Open
pygy opened this issue Apr 24, 2020 · 6 comments
Open

The import parser has false positives and false negatives #79

pygy opened this issue Apr 24, 2020 · 6 comments
Labels

Comments

@pygy
Copy link
Contributor

pygy commented Apr 24, 2020

As mentioned earlier, here are a few bugs I found:

import {
  //**/
  foo
} from "./foo.js"

Gives

Error resolving module specifier: ./foo.js ... at .js:4:7

Live here

If you point me to the parser (I suppose it is RegExp-based) I can upgrade it.

@porsager
Copy link
Owner

Ah, I see..

Yeah, they're here:

const isModuleRegex = /(^\s*|[});\n]\s*)(import\s*\(?(['"]|(\*[\s\S]+as[\s\S]+)?(?!type)([^"'()\n;]+)[\s\S]*from[\s\S]*['"]|\{)|export\s\s*(['"]|(\*[\s\S]+as[\s\S]+)?(?!type)([^"'()\n;]+)[\s\S]*from[\s\S]*['"]|\{|default|function|class|var|const|let|async[\s\S]+function|async[\s\S]+\())/
, topoSortRegex = new RegExp('import\\s*[{}$\\w*,\\s]*\\s*(?: from |)[\'"]\\.?\\/(.*\\.?[a-z]*)[\'"]', 'g')
, staticImportRegex = new RegExp('(import\\s*[{}$\\w*,\\s]*\\s*(?: from |)[\'"])([\\w@][\\w@/.-]*)([\'"])', 'g')
, dynamicImportRegex = new RegExp('(import\\([\'"])([\\w@][\\w@/.-]*)([\'"]\\))', 'g')

@pygy
Copy link
Contributor Author

pygy commented Apr 24, 2020

Ok, I'm on it... Why are you using the RegExp constructor rather than literals?

@pygy
Copy link
Contributor Author

pygy commented Apr 25, 2020

Well, if I follow the JS Spec with a hack for identifiers, I end up with this for just detecting imports:
/import(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:"(?:\\[^]|.)*"|'(?:\\[^]|.)*'|`(?:\\[^]|[^])*`)|(?:(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*,(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:\*(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+|(?:\{(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*\}|\{(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:(?:[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+|[^,{}\/\s]+)(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*,(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)*(?:[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+|[^,{}\/\s]+)(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:,(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)?\})(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)|(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+)|\*(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+|(?:\{(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*\}|\{(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:(?:[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+|[^,{}\/\s]+)(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*,(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)*(?:[^,{}\/\s]+(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+as(?:(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)+[^,{}\/\s]+|[^,{}\/\s]+)(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:,(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)?\})(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*)from(?:\s|\/\/[^\n]*\n|\/\*[^]*?\*\/)*(?:"(?:\\[^]|.)*"|'(?:\\[^]|.)*'|`(?:\\[^]|[^])*`))/

... which is probably a bit much... Also, I'd need to triple check and make sure it doesn't backtrack pathologically (it probably does). There's probably a smarter way to do it...

@porsager porsager added the bug label Mar 29, 2022
@pygy pygy changed the title Comments thwart the import parser The import parser has false positives and false negatives Feb 4, 2024
@pygy
Copy link
Contributor Author

pygy commented Feb 4, 2024

So, I'm about to implement something similar for the online version of bunchmark, and while working on an improved version of the above, I found a case where Flems modifies code it shouldn't and adds https://unpkg.com/...?module where it shouldn't.

Here is a reduced example.

My WIP parser had variables with import in the name, and the string 'from' which triggered the bug, I've replaced the problematic names with umport for now.

The new parser should be ReDOS resistant, but I still have to test it.

I end up with these ginormous RegExps:

const searcher = /\bimport(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\1\*\/)*(?:"(?:\\.|[^"])*"|'(?:\\.|[^'])*'|(?:\{(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\2\*\/)*(?:(?:(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\3(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\4\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\5\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\6|(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\7)(?:(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\8\*\/)*,(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\9\*\/)*(?:(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\10(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\11\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\12\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\13|(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\14))*)?(?:(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\15\*\/)*,)?(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\16\*\/)*\}|\*(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\17\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\18\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\19|(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\20(?:(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\21\*\/)*,(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\22\*\/)*(?:\{(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\23\*\/)*(?:(?:(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\24(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\25\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\26\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\27|(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\28)(?:(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\29\*\/)*,(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\30\*\/)*(?:(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\31(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\32\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\33\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\34|(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\35))*)?(?:(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\36\*\/)*,)?(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\37\*\/)*\}|\*(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\38\*\/)*as(?!\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\39\*\/)*(?=((?:\p{ID_Start}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))(?:\p{ID_Continue}|\\u(?:[0-9A-Fa-f]{4}|\{[0-9A-Fa-f]{1,6}\}))*))\40))?)(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\41\*\/)*from(?:\s|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\42\*\/)*(?<moduleSpecifier>"(?:\\.|[^"])*"|'(?:\\.|[^'])*'))|\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\44\*\/|"(?:\\.|[^"])*"|'(?:\\.|[^'])*'|`(?=((?:\\[^]|(?!`|\$\{)[^])*))\45`|`(?=((?:\\[^]|(?!`|\$\{)[^])*))\46\$\{(?<openTemplate>)/dgu

... and then this const searcher2 = /\/\/.*[\n\r]|\/\*(?=((?:(?!\*\/)[^])*))\1\*\/|"(?:\\.|[^"])*"|'(?:\\.|[^'])*'|`(?=((?:\\[^]|(?!`|\$\{)[^])*))\2`|`(?=((?:\\[^]|(?!`|\$\{)[^])*))\3\$\{(?<openTemplate>)|\}(?=((?:\\[^]|(?!`|\$\{)[^])*))\5\$\{|\}(?=((?:\\[^]|(?!`|\$\{)[^])*))\6`(?<closeTemplate>)/dgu for skipping template literals with embedded expressions.

Edit: adding support for dynamic imports as well.

@pygy
Copy link
Contributor Author

pygy commented Feb 4, 2024

The package is live as [email protected].

https://github.com/pygy/unpkgify (it has tests and seems quite robust).

@pygy
Copy link
Contributor Author

pygy commented Feb 4, 2024

My approach still has regexp-related corner cases (you need a full JS parser to tell them apart from successive divisions).

If a RegExp contains a pattern that looks like a bit of template string, or even simply a brace, it can throw my parser off. I think I'm going to use Babel for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants