Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pattern matches does not work for google search results #74

Open
brandonbrown5 opened this issue May 16, 2023 · 1 comment
Open

Pattern matches does not work for google search results #74

brandonbrown5 opened this issue May 16, 2023 · 1 comment

Comments

@brandonbrown5
Copy link

Google search result URL raises Invalid URI error. It appears the Regex expression here does not recognize this as a valid URL, however, you are able to navigate to it via a browser.

URL: https://www.google.com/search?q=capt.%20jacks%20family%20buffet&rlz=1C2CHBF_enUS902US902&sxsrf=APwXEdehG3ObQHEcqZT0clDT-XUDJ2iaXg:1681756568453&source=hp&ei=jpE9ZIioNOvGkPIPzP2ayAE&iflsig=AOEireoAAAAAZD2fnm-EI4rFn06RvhHNRndJIcwCmIRY&oq=capt.+jack&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQgAQyCgguEIAEENQCEAoyBwgAEIAEEAoyCwguEIAEEMcBEK8BMgUIABCABDIFCAAQgAQyBQgAEIAEMgcIABCABBAKMgoILhCABBDUAhAKMggIABCKBRCGAzoHCCMQ6gIQJzoECCMQJzoICAAQigUQkQI6CAgAEIAEELEDOhEILhCABBCxAxCDARDHARDRAzoOCC4QgAQQsQMQxwEQ0QM6DgguEIoFEMcBENEDEJECOg4ILhCABBDJAxDHARCvAToFCC4QgAQ6DgguEIoFEMcBEK8BEJECOgsIABCKBRCxAxCRAjoOCC4QgAQQsQMQgwEQ1AI6CwguEK8BEMcBEIAEOg0ILhCABBDHARCvARAKOgcILhCABBAKOggILhCABBDUAlC3DliGK2DeN2gBcAB4AIABnwGIAaIJkgEDMi44mAEAoAEBsAEK&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=425615111808136386&lqi=ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA&ved=2ahUKEwjS1_G2x7H-AhUxtTEKHclvB_cQvS56BAgWEAE&sa=X&rlst=f#rlfi=hd:;si:425615111808136386,l,ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA;mv:[[30.1955067,-85.7794086],[30.161907099999993,-85.8386264]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4

URI.parse(url) raises the following error: lib/uri/rfc3986_parser.rb:66:in split'`. I believe this is caused by the Regex expression not matching this URL.

@duerst
Copy link
Member

duerst commented May 17, 2023

And the reason that the regular expression does not match the URI is that the relevant grammar (in RFC 3986) does not allow '[' or ']' in the fragment part (the part after the '#'). See https://www.rfc-editor.org/rfc/rfc3986#appendix-A, in particular see https://www.rfc-editor.org/rfc/rfc3986#appendix-A, and look for 'fragment' and 'gen-delims'. The '[' and ']' characters are in gen-delims, but gen-delims isn't allowed in fragment. As the filename where the error message originates makes clear, it's a parser for RFC 3986 URIs, so it better follow that spec. That means that we can close this issue, because the Regexp matches the spec.

The grammar in RFC 2396 (https://www.rfc-editor.org/rfc/rfc2396) is more lenient, and is available in lib/uri/rfc2396_parser.rb, so you may want to try it.

[In Thunderbird, where I saw your message first, the URI is colored up to just before the first ':' in the fragment, and when I click on it, only the part before that ':' is sent to the browser, but both RFC 3986 and RFC 2396 allow ':' in fragments, so this behavior is difficult to explain.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants