Pattern matches does not work for google search results #74

brandonbrown5 · 2023-05-16T14:34:49Z

Google search result URL raises Invalid URI error. It appears the Regex expression here does not recognize this as a valid URL, however, you are able to navigate to it via a browser.

URL: https://www.google.com/search?q=capt.%20jacks%20family%20buffet&rlz=1C2CHBF_enUS902US902&sxsrf=APwXEdehG3ObQHEcqZT0clDT-XUDJ2iaXg:1681756568453&source=hp&ei=jpE9ZIioNOvGkPIPzP2ayAE&iflsig=AOEireoAAAAAZD2fnm-EI4rFn06RvhHNRndJIcwCmIRY&oq=capt.+jack&gs_lcp=Cgdnd3Mtd2l6EAEYADIFCAAQgAQyCgguEIAEENQCEAoyBwgAEIAEEAoyCwguEIAEEMcBEK8BMgUIABCABDIFCAAQgAQyBQgAEIAEMgcIABCABBAKMgoILhCABBDUAhAKMggIABCKBRCGAzoHCCMQ6gIQJzoECCMQJzoICAAQigUQkQI6CAgAEIAEELEDOhEILhCABBCxAxCDARDHARDRAzoOCC4QgAQQsQMQxwEQ0QM6DgguEIoFEMcBENEDEJECOg4ILhCABBDJAxDHARCvAToFCC4QgAQ6DgguEIoFEMcBEK8BEJECOgsIABCKBRCxAxCRAjoOCC4QgAQQsQMQgwEQ1AI6CwguEK8BEMcBEIAEOg0ILhCABBDHARCvARAKOgcILhCABBAKOggILhCABBDUAlC3DliGK2DeN2gBcAB4AIABnwGIAaIJkgEDMi44mAEAoAEBsAEK&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=425615111808136386&lqi=ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA&ved=2ahUKEwjS1_G2x7H-AhUxtTEKHclvB_cQvS56BAgWEAE&sa=X&rlst=f#rlfi=hd:;si:425615111808136386,l,ChljYXB0LiBqYWNrcyBmYW1pbHkgYnVmZmV0IgOIAQFI2Oj5sryugIAIWioQABABEAIQAxgAGAEYAhgDIhhjYXB0IGphY2tzIGZhbWlseSBidWZmZXSSARFidWZmZXRfcmVzdGF1cmFudJoBI0NoWkRTVWhOTUc5blMwVkpRMEZuU1VOUGNUUTJXa0pSRUFFqgEjEAEyHxABIhtkYOxvAEEUmUj2WhSyHC6JH-F_P7crMXEaKS_gAQA;mv:[[30.1955067,-85.7794086],[30.161907099999993,-85.8386264]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4

URI.parse(url) raises the following error: lib/uri/rfc3986_parser.rb:66:in split'`. I believe this is caused by the Regex expression not matching this URL.

The text was updated successfully, but these errors were encountered:

duerst · 2023-05-17T05:44:16Z

And the reason that the regular expression does not match the URI is that the relevant grammar (in RFC 3986) does not allow '[' or ']' in the fragment part (the part after the '#'). See https://www.rfc-editor.org/rfc/rfc3986#appendix-A, in particular see https://www.rfc-editor.org/rfc/rfc3986#appendix-A, and look for 'fragment' and 'gen-delims'. The '[' and ']' characters are in gen-delims, but gen-delims isn't allowed in fragment. As the filename where the error message originates makes clear, it's a parser for RFC 3986 URIs, so it better follow that spec. That means that we can close this issue, because the Regexp matches the spec.

The grammar in RFC 2396 (https://www.rfc-editor.org/rfc/rfc2396) is more lenient, and is available in lib/uri/rfc2396_parser.rb, so you may want to try it.

[In Thunderbird, where I saw your message first, the URI is colored up to just before the first ':' in the fragment, and when I click on it, only the part before that ':' is sent to the browser, but both RFC 3986 and RFC 2396 allow ':' in fragments, so this behavior is difficult to explain.]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern matches does not work for google search results #74

Pattern matches does not work for google search results #74

brandonbrown5 commented May 16, 2023

duerst commented May 17, 2023

Pattern matches does not work for google search results #74

Pattern matches does not work for google search results #74

Comments

brandonbrown5 commented May 16, 2023

duerst commented May 17, 2023