-
Notifications
You must be signed in to change notification settings - Fork 26
File encoding bug in Atom 1.3.3 when switching between UTF-8 and 'Western (Mac Roman)' #28
Comments
I believe that this is a side-effect of #7 not existing and therefore this may be a duplicate of #7. In essence, Atom doesn't convert between encodings, the encoding-selector only tells Atom how to interpret the character codes that already exist. Since the encoding for the characters in step 3 result in invalid UTF-8 sequences (the hex values for them are
The reproduction steps can be simplified to:
Since the encoding-selector currently is only intended to inform Atom of the encoding being employed by the file and not to convert between encodings, it seems as if this is the best possible behavior. If the encoding-selector is given the ability to convert between encodings as recommended in #7, the behavior should end up more to your liking @afischli. Please let me know if I've misunderstood things. |
I fear you may have complicated things unnecessarily. While I fully concur that Atom does and should not convert between encodings, some of the statements you make, need some amendments.
On the other hand for the debugging of this bug, of course, there the hex package may come very handy to learn why and when exactly Atom produces this bug and converts bytes wrongly it should leave untouched. My suspicion with no further testing is that it makes the mistake at the very beginning, i.e. uses UTF-8 or something else in the middle of "Western (Mac Roman)" editing by using hex 8A for Umlaut 'ä' (lower case a) instead of the correct hex E4 etc. Or something else is going wrong here. |
You make some good points. And if in step 6 (your repro steps) you didn't save the file, I might agree with you that this is a bug. But when you tell Atom to save the file with the encoding UTF-8, it should save a valid UTF-8 file. The only way to do that is to convert the invalid bytes to the U+FFFD character. The unique thing about UTF-8 is that some byte sequences are invalid. Single-byte encodings (such as Mac Roman) do not have this quality. This is why you see the change happen in the byte sequence one way and not the other. Atom can only work with the information you give it. When you tell Atom to save a file as UTF-8 that has invalid characters in it, it will perform a conversion as expected under Unicode so that the file is a valid UTF-8 file. |
It does use |
I made some further tests by repeating my sequence of steps 1. to 7. and noticed that changing the encoding does not alter the display. This is wrong, see my previous arguments 4) and 5). The new test steps 1. to 7. without savings show that obviously Atom makes a mistake by showing on the display always the same characters regardless of current encoding and only starts "converting" internally to actually encode the bytes when it saves the file. Then the current encoding matters. In my view this is wrong and Atom should encode upon entering some text by storing the proper byte sequence in the buffer, so that saving or not saving does not matter. As soon as you change the encoding, the display should change accordingly. Meaning, if I enter some Umlaute while "Western (Mac Roman)" encoding is active, say "äöü" should immediately change its display as I change to UTF-8 encoding. In this example the displayed string should change immediately to "���" when switching encoding to UTF-8 and back to "äöü" when switching encoding back to "Western (Mac Roman)". Thus to simplify the reproduction of the bug these steps can do it:
This is the bug. Correct would be the display of "���", which switches back to "äöü" when going back to "Western (Mac Roman)" encoding. File saving should not matter at all, which of course does also produce incorrect behavior, perhaps another bug in the sense that internally different parts of Atom's program code are involved. But in terms of handling non-converting encoding correctly throughout, it is the same bug. Atom does not do what it should. It should never convert and encode everything always in the current encoding scheme. |
I am not aware that Unicode requires byte streams, say a large file of bytes, to be encoded throughout in a consistent manner. AFAIK Unicode is merely an internal standard, which universally assigns ordinal numbers to characters. Encoding of Unicode is an entirely different story and can follow different schemes, e.g. UTF-8 as well as UTF-16 are both Unicode encoding schemes. But they do that differently. For details read perhaps my article on encoding «Encoding or "if what you see is not what you want"» (Evernote requires me to know your e-mail for that). Why and on what basis should Atom suddenly start converting inconsistently encoded byte sequences, and try to enforce the current encoding on everything entered so far? This would create quite a mess. Say you enter some text in ISO-Latin, then switch to "Western (Mac Roman)", then UTF-16 and now save this to a disk file. Why should UTF-16 be the correct encoding for all? It isn't and it can't. For each portion first of all the encoding should be used that was in use when the text was entered. Any encoding that overlaps with that text fully could also be used of course. But it is easy to find an example with only 3 characters where above sequence is wrong, regardless which of the 3 encoding schemes you choose, one character is always in the way and conversion fails. Of course Atom could detect such encoding inconsistencies and warn the user about the fact or refuse to save (of course not a good idea) or whatever. But the problem remains that many encoding schemes overlap only partly, misleading the user to switch encoding while typing within the overlap. In other words, only humans can fix such mixed encodings, Atom can't do that for you. Making these arguments I would say a good solution would probably be to try to "test convert" the current file to the new encoding as the user wants to switch, say in a temporary storage only, and then warn the user if the conversion fails (iconv -f -t ). That would also be a good moment when Atom could recommend that he/she should not change the encoding, since previously entered text can no more be properly displayed in the new encoding, so that she/he can still abort the switch. Then a user would still have a choice to either force the switch at the risk that some text, say some Umlaute, would have become "unreadable" until fixed by the author by properly converting to the new encoding or give up on the encoding switch. Of course, in case all previously entered text can also be encoded in the new encoding, i.e. all text is within overlapping domains, e.g. ASCII only can fully be encoded without change in UTF-8, then the switch would not warn the Atom user and all is anyway fine, changing encodings do not matter. It seems to me the problems in Atom as I have pointed them out are real and should be addressed in the manner as I described. Your "justification" that Atom has to convert seems to me not to be justified, because Atom can never do it right. **Summary:
Of course a third optional measure would be to offer conversion facilities from within Atom, but I believe this is a feature request and has little to do with the bug(s) I discuss here. Moreover, I guess most Atom users can always go to a shell and use 'iconv' to make a conversion. |
P.S.: Please remember, 'ä' is Unicode code point U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) and its UTF8 encoding requires two bytes, i.e. hex C3 A4. The only alternative encoding also using Unicode of 'ä' I know of is the 3 byte encoding using two Unicode code points, i.e. U+0061 (LATIN SMALL LETTER A) combined with U+0308 (COMBINING DIARESESIS). So if I'm not mistaken hex 8A can't be UTF-8 encoding for 'ä', since that would need to be at least 2 bytes. It is rather some 1 byte encoding scheme, probably whatever the browser is currently using, e.g. Western (ISO Latin 1) or whatever. Can't say at the moment. |
Having given this a bit further thought I have come to a somewhat modified view of this problem, I would call this a conceptual bug, which clearly needs fixing. First I suggest another procedure to reproduce the bug and demonstrate its effect using a file ('testUmlaute.zip' attached) consisting of 4 bytes, i.e. string "äöü" in hex 'C3 A4 9A FC' meaning first 2 bytes represent Unicode Code point U+0061 in UTF-8 encoding, 3rd byte represents Unicode Code point U+00F6 in 'Western (Mac Roman)' encoding and 4th byte Unicode Code point U+00FC in 'Western (Windows 1252)' encoding (cf. e.g. http://www.utf8-chartable.de). With this file reproduce the bug and demonstrate its effect by these steps:
Atom 1.3.3, OS X 10.9.5 (Mavericks) The procedure given above is much cleaner, since it is independent of any entering of text and thus narrows the scope of where Atom changes the content considerably. Note, actually only steps 1) and 5) are needed to reproduce the bug, but to demonstrate clearly the effect of the bug the other steps are listed too. This means, up to and including step 4) all is as it should be. Step 5) loses the information and as of step 6) content is altered and displays become accordingly different. The content of the file has obviously changed, albeit its content was not edited in any way. In my view this is not acceptable. Investigating the content of the newly saved file 'testUmlaute-savebug' is: hex 'C3 A4 EF BF BD EF BF BD'. The original information is lost. The edit buffer is treated differently from what is saved to a file. I call this therefore a conceptual bug. Comment: Then, while I have some sympathy for not losing the users text while switching to an encoding that can not represent all text, i.e. keeping all entered text intact in the edit buffer, the user must be warned at least that upon saving the file, the editing will be converted and irrecoverably corrupted. Better would be to warn the user right away also when she attempts to switch. Then a 2nd time if there are still '�' characters present in the UTF-8 text at any moment before saving the file. The same warnings need of course to be issued for any other encodings, not only for UTF-8, that cannot fully encode what was originally entered. |
Perhaps with above contribution I best reopen the issue? |
From @afischli on January 23, 2016 12:22
Repeat the bug:
Bug: Originally with 'Western (Mac Roman)' encoding entered string 'äöüÄÖÜ' should now be restored as originally entered (consistent with behavior of Atom in steps 6..8). However, the string changes to unreadable wrong sequence 'ÔøΩÔøΩÔøΩÔøΩÔøΩÔøΩ'. Perhaps this is a bug, which is related to Atom not understanding the difference between iconv options 'MACROMAN' and 'UTF-MAC' encoding.
I found out later, if you repeat step 4) and 7) (file saved and left unchanged since step 6) you regain the originally entered string. The bug appears only every 2nd time of switching. Since this is quite confusing I believe this is a bug, which should be fixed.
System: Atom 1.3.3, OS X 10.9.5 (Mavericks)
Copied from original issue: atom/settings-view#728
The text was updated successfully, but these errors were encountered: