-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Tumblr import #548
base: master
Are you sure you want to change the base?
Fix Tumblr import #548
Conversation
jekyll_url = if Jekyll.const_defined? :Post | ||
Jekyll::Post.new(site, site.source, "", "tumblr/#{post[:name]}").url | ||
else | ||
Jekyll::Document.new(site.in_source_dir(relative_path), :site => site, :collection => site.posts).url | ||
"/" + Date.parse(post[:date]).to_s.tr("-", "/") + "/" + post[:slug] + ".html" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these two changes output the files in different places than the current code does. Perhaps we should check this with a test to ensure a proper Jekyll site is output. I believe the goal was to write all the posts into the _posts directory so that they can be paginated and have all the goodies that come with documents in collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When run with the flag --rewrite_urls true
, the Tumblr importer outputs files in two different places:
_posts
for the full text blog articlespost
for the redirects. These are simple files that do not contain any text of the blog posts, but instead only contain an HTML meta tag which causes the browser to redirect to the new, canonical blog post URL. The purpose of this is to avoid breaking URLs in the style that Tumblr used to use, prior to the blog being migrated to Jekyll. Tumblr-style URLs look likehttps://blog.domain.example/post/1234567890/blog-post-title-slug
. So, these need to be redirected and this is how it is done.
When run without the flag --rewrite_urls true
, only the former file location is used, and redirects are not generated.
The changes in my PR only apply to the latter file location. The former location still continues to generate as normal, and is not affected by my changes (that is handled by different code in a different method). It is true that the latter code does not output to the standard _posts
location, but as you can see, there is a reason for this.
My PR does not change this overall behavior, it is the same as before; it was already being written to both _posts
and post
. However, the existing code generates incorrect HTML, as I wrote earlier, and my PR fixes that. My PR does not change the directory structure or file locations.
My apologies for not writing a unit test for this. I ran into some difficulty with the development dependencies, because I don’t have MySQL, PostgreSQL, etc. installed (they are not necessary for the Tumblr importer) and I didn’t have time to look into that. As a result I was unable to get the unit tests running locally.
Let me know what you think about my above explanation, and if you would prefer, I can try to look more into getting some test coverage for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my read of the code, tumblr_url
is the redirect page, and jekyll_url
(the line commented on) is the "new" post URL from _posts
. In the diff I see, both are modified, one to url_with_slug
(redirect page), and one stops using Jekyll::Document. The thing I'm considering here is the "hard-coded" new jekyll_url
you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml
file which could alter the document url, right? Perhaps it's an edge case, but it's possible.
In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post
and instead use the same jekyll_url
for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the test dependencies. One thing I'd try is to comment out the dev dependencies you don't need, and to run bundle exec ruby test/test_tumblr_importer.rb
directly to run just the one file. That way, mysql
& friends won't be require
'd. These tests aren't particularly well-kept, so apologies for that! If you're open to it, you might also try a GitHub Codespace so you don't need to modify your own dev environment. We don't have a devcontainer with all the dependencies setup at this time, but hopefully a sudo apt-get install mysql-client
etc could do the trick. I think each gem's repo has installation instructions for Linux (with dependencies) if you go this route. Ruby with C extensions is definitely not the developer's paradise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my read of the code, tumblr_url is the redirect page, and jekyll_url (the line commented on) is the "new" post URL from _posts.
I think that's right. In this case, jekyll_url
is being used to generate the content of the HTML meta tag which redirects the user from the old URL to the new URL.
In the diff I see, both are modified, one to url_with_slug (redirect page), and one stops using Jekyll::Document.
tumblr_url
contains the same value as it did before; the change is a result of the fact that I renamed :slug
to :url_with_slug
, which also contains the same value as it did before. :slug
was a confusing name for it because it actually already contained a full URL, hence why I renamed it. Furthermore, I also needed a hash key with just the slug (not the entire URL), and the only logical name for that was :slug
, which was already in use... (see lines 175-176 on the right side of the diff to see what I mean)
The thing I'm considering here is the "hard-coded" new jekyll_url you have here. I think it's possible to import into a Jekyll site path that has a _config.yaml file which could alter the document url, right? Perhaps it's an edge case, but it's possible.
Very good point. I was worried there might be something like that. Do you have any suggestion for how to do it properly? I'm afraid I'm not familiar enough with this API.
In any event, if it's best to follow this path, then we shouldn't check for Jekyll::Post and instead use the same jekyll_url for both the old versions of Jekyll (with Post) and the new versions of Jekyll (with Document).
I was hesitant to change that part of the branch because I didn't have an old version of Jekyll to test with. But I can do this if that's what you recommend.
I was trying to run the Tumblr importer under Ruby 3 and the latest Jekyll, and I ran into these two problems, when using the option
--rewrite_urls true
:There was a call to
URI.encode()
which was removed in Ruby 3. This causes an uncaught exception. This PR fixes this by replacing it withURI::Parser.new.escape()
.The redirect URLs generated were incorrect, which caused a 404 when a user tries to follow one of the old Tumblr-style links.
(Neither problem occurs when not using that option. I needed that option, though.)
It was generating URLs like this:
/2024/08/22/2019-04-03-how-i-handle-errors-in-ios-apps.html
(Today's date is incorrectly used as the directory path, and then the actual original date of the post is incorrectly inserted into the slug)It should have been generating URLs like this:
/2019/04/03/how-i-handle-errors-in-ios-apps.html
(Correctly matches where the imported post ends up afterjekyll build
)The second issue was somewhat trickier to fix. I'm not familiar with the
Jekyll::Document
API. I couldn't figure out why it returns the wrong result. Instead, I resorted to just assembling the string manually (which required passing a little more data torewrite_urls_and_redirects()
than was being done previously).I'm open to feedback for how to improve this further. Thanks for taking a look!