Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

gb2312 / GB18030 encoding detection conflict #65

Open
1 task done
wesinator opened this issue Oct 30, 2018 · 4 comments
Open
1 task done

gb2312 / GB18030 encoding detection conflict #65

wesinator opened this issue Oct 30, 2018 · 4 comments

Comments

@wesinator
Copy link

Prerequisites

Description

A text file with encoding detected as GB18030 on Github is auto-detected as gb2312 in atom.

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

  1. Edit this file on Github
    image
  2. Open in Atom
  3. Select "Auto Detect" encoding,

Expected behavior: Atom detects the encoding of the file as GB18030.

Actual behavior: Atom auto detects the encoding as gb2312, 'undefined encoding'
atom_gb2312_undefined

iconv fails to convert from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

Works: iconv -f GB18030 -t UTF-8 userdb_panda.yar

Reproduces how often: Always

Versions

apm 2.1.2
atom 1.32.0 x64
Ubuntu 18.04

Additional Information

@rsese
Copy link

rsese commented Nov 1, 2018

Thanks for the report! I can reproduce as described with macOS 10.12.6 with 1.34.0-nightly5.

@rsese rsese added the triaged label Nov 1, 2018
@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Nov 13, 2018

Here is the library that we're using for encoding detection: https://github.com/aadsm/jschardet.

@wesinator
Copy link
Author

@maxbrunsfeld Do I need to open a new issue there ?

@maxbrunsfeld
Copy link
Contributor

@wesinator Yeah, that's what I would recommend.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants