Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should open .puz with iso-8859-1 encoding interpretation #12

Open
piyo opened this issue Jan 28, 2021 · 3 comments
Open

Should open .puz with iso-8859-1 encoding interpretation #12

piyo opened this issue Jan 28, 2021 · 3 comments

Comments

@piyo
Copy link

piyo commented Jan 28, 2021

Thanks for this enjoyable Emacs package.

I notice that puz files are not opened with the correct encoding. At best (1) some hints and copyright symbol will be garbled. At worst (2) the puz file's width cannot be read correctly and crossword-summary will abort.

Recipe for duplicating (1):

  1. On Debian 10, Emacs 27.1 self-compiled.
  2. Start emacs with export LANG=ja_JP.UTF-8 ; export LANGUAGE=ja_JP:ja ; emacs -nw
  3. Download Matt Jones's crossword for 2021-01-21 using crossword-download.
  4. Open crossword-summary.
  5. Open the puzzle file by pressing Enter.
  6. Notice the copyright symbol is not correct. Also the hint for Across 24 does not show the é character, but \xe9.

Recipe for duplicating (2):
Cannot re-verify at the moment. It was on an Ubuntu 16.04 with Emacs 27.1 self-compiled.

Possible Cause:
puz files are opened with an encoding which is user-config dependent.

Possible solution:
Locally set the coding-system-for-read before using insert-file-contents.
Define a package default variable that holds the desired coding for known puzzle sources.

(defvar crossword-puzzle-file-coding 'iso-8859-1)

;; in crossword--start-game-puz
(let ((coding-system-for-read crossword-puzzle-file-coding))
  (insert-file-contents (setq crossword-file puz-file)))

;; in crossword--summary-data-puz
(let ((coding-system-for-read crossword-puzzle-file-coding))
  (insert-file-contents file))

Workaround 1:
Define the default coding for read for "*.puz" files.

(with-eval-after-load 'crossword
  (add-to-list 'file-coding-system-alist '("\\.puz\\'" . iso-8859-1)))

Version:
crossword.el was version crossword-20210126.1409 from MELPA or
Package-Commit: a0bbd80.

Other notes:
Once puz files are decoded correctly, puz-emacs files will be saved with the correct string data.

Reference:
According to https://code.google.com/archive/p/puz/wikis/FileFormat.wiki the (de facto?) encoding of puzzle files is iso-8859-1.

All strings are encoded in ISO-8859-1 and end with a NUL.

@Boruch-Baum
Copy link
Owner

Boruch-Baum commented Jan 28, 2021 via email

Boruch-Baum added a commit that referenced this issue Jan 29, 2021
+ Addresses github issue #12, in which the copyright symbol was
  improperly decoded when the locale was:
  + LANG=ja_JP.UTF-8
  + LANGUAGE=ja_JP:ja

+ May call for re-evaluation of commit 0f39144 "Decode strings to
  coding system 'prefer-utf-8-dos'"

+ Expect more of this to crop up as we get feedback from more users
  with non-English environments and puzzles...
@piyo
Copy link
Author

piyo commented Jan 29, 2021

I see the commit you added for this issue. You may want to git-amend (or magit-reword) it after this reply, sorry.

At worst (2) the puz file's width cannot be read correctly and crossword-summary will abort.

I have the exact steps to repro this problem (starting from emacs -Q), and it has nothing to do with LC_ALL and LANGUAGE environment variables. However your commit should essentially fix the problem. Here is the code:

;; Steps to reproduce:

;; Command line to start Emacs:
;; env LANG=en_US.UTF-8 LANGUAGE=en_US:en emacs -Q -nw

;; run each command one at a time, i.e C-x C-e (eval-last-sexp)
(progn
  (setq crossword-save-path (expand-file-name (substitute-in-file-name "$HOME/.cache/emacs/var/crossword/")))
  ;; Previously, the puzzle 2020-12-31 from Matt Jones was downloaded via (crossword-download).
  ;; crossword-save-path already contains "jz201231.puz" and nothing else
  (setq dbg-crossword-load-path (substitute-in-file-name "${HOME}/.local/share/emacs/27/site-lisp/elpa/crossword-20210126.1409"))
  (setq load-path (append load-path (list dbg-crossword-load-path)))
  (load (expand-file-name "crossword-autoloads.el" dbg-crossword-load-path))

  ;; user setting:
  (set-coding-system-priority
   'utf-8
   'cp932 ;; this encoding wins for "jz201231.puz"
   'euc-jp
   'iso-2022-jp
   'iso-8859-1 ;; however, *.puz needs this, but is not evaluated
   )
  (toggle-debug-on-error)
  ;; Expect: shows 1 entry.
  ;; Actual: shows no entries. error dialog.
  (crossword-summary) ;; bug repro'd here.

  ;; Additional debug
  ;; (dired crossword-save-path)
  ;; (dired dbg-crossword-load-path)

  ;; Expect: summary forms about jz201231.puz.
  ;; Actual: (nil)
  (find-file-other-window (expand-file-name "puz-emacs.data" crossword-save-path))

  (find-file-other-window (expand-file-name "jz201231.puz" crossword-save-path))

  ;; Expect: contains "© 2020 Matt Jones"
  ;; Actual: contains "ゥ 2020 Matt Jones"

  ;; Expect: 15
  ;; Actual: 80
  (with-current-buffer "jz201231.puz"
    (char-after 45))

  ;; Expect: iso-latin-1-unix (iso-8859-1-unix).
  ;; Actual: japanese-cp932-unix.
  (with-current-buffer "jz201231.puz"
    (require 'mule-diag)
    (if (local-variable-p 'buffer-file-coding-system)
	    (print-coding-system-briefly buffer-file-coding-system)
	  (princ "Not set locally, use the default.\n")))
  )


;; Steps for expected results

;; command line to start emacs:
;; env LANG=en_US.UTF-8 LANGUAGE=en_US:en emacs -Q -nw

(progn
  (setq crossword-save-path (expand-file-name (substitute-in-file-name "$HOME/.cache/emacs/var/crossword/")))
  (setq dbg-crossword-load-path (substitute-in-file-name "${HOME}/.local/share/emacs/27/site-lisp/elpa/crossword-20210126.1409"))
  (setq load-path (append load-path (list dbg-crossword-load-path)))
  (load (expand-file-name "crossword-autoloads.el" dbg-crossword-load-path))
  ;; do not (set-coding-system-priority ...)
  ;; or:
  (set-coding-system-priority
     'utf-8
     'iso-8859-1 ;; this encoding wins for "jz201231.puz"
     'cp932	 ;; this encoding is never evaluated
     'euc-jp
     'iso-2022-jp
     )
  (toggle-debug-on-error)
  (crossword-summary) ;; no problem, shows 1 entry.
  )

What version of the package are you using?

crossword.el was version crossword-20210126.1409 from MELPA or Package-Commit: a0bbd80.

I do see what may be a Japanese character in place of the copyright symbol, so that's a bug; however, I do get the correct é character in clue 24 across.

This is strange, I either get both wrong, or with the workaround, both correct. Please see the repro steps above including the part where I open the puz file.

I would like to push the commit, crediting you. How do you want to listed in the Changelog file (ie. name, email)?

Thank you for the credit in that commit, that's fine.

My differences are to use a defcustom instead of defvar to be more user-friendly if it needs to be changed on-the-fly, and to set the locally-scoped definition in the pre-existing 'let's.

Looks good for now. I would rather reduce the effect of coding-system-for-read but then you probably don't need to read other files in the same function, right? Also I would just make the functionality of reading a puz file into a temporary buffer into a separate function so that people can defadvice just that part.

I expect this to get more complex as I get more feedback from people using the package for non-English language puzzles. It may be that I eventually need to do something like add an element to 'crossword-download-puz-alist' for encodings and then upon reads compare file names to the file-specs in that list in order to get that file's encoding.

I was going to speculate but nah. YAGNI, until you do. I suspect supporting other single-byte encodings will be easier than multibyte encodings, though.

Reference: According to [1]https://code.google.com/archive/p/puz/wikis/FileFormat.wiki the (de facto?) encoding of puzzle files is iso-8859-1.
I don't think I ever saw that document before. Thank you. It seems a much more thorough analysis than anything I've read before.

It's a link in a link in your README.md.
->
[2] http://fileformats.archiveteam.org/wiki/PUZ_%28crossword_puzzles%29
->
https://code.google.com/archive/p/puz/wikis/FileFormat.wiki

OFF-TOPIC: Are you seeing screen-flicker when using the crossword
package with emacs 27?

I am using a terminal emulator, so no. I.e. Windows 10 -> Msys mintty.exe -> Debian 10 -> screen -> Emacs 27.1 -> crossword.el. Glorious wide vision,195 by 60 with HiDPI fonts. ;-P

Another off-topic question: Do you have non-English crossword puzzle files? I'd like to get a source of them in order to further test / develop the package.

No.
I was curious and I searched in the language I can use, Japanese. The first couple of hits just use hiragana and katakana, but of course the hints use the full Japanese language (kanji). Like http://cross-word.info/ (I lost a couple of minutes just playing with this! heh) If I could play these with the excellent interface your package provides I would be happy. But alas it would need multibyte support in this modern era.

@Boruch-Baum
Copy link
Owner

Boruch-Baum commented Feb 3, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants