Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSSION: Add column(s) to vocabularies table to identify vocabularies that came from a specific database #112

Open
marc-outins opened this issue Jun 6, 2018 · 8 comments

Comments

@marc-outins
Copy link

Since we are going to adding database specific vocabularies I think we should store what database they came from in the vocabularies table.

@aguynamedryan
Copy link
Contributor

So, say we have two SEER databases, ALL and CLL, which share some of the same vocabularies, what do we put in this column?

@marc-outins
Copy link
Author

SEER since ALL, CLL, etc. is just a cut of the whole SEER Medicare database. One question is do we distinguish between SEER and SEER Medicare. Currently vocabs that come from the SEER medicare data I'm giving ids of SEER_ .

@aguynamedryan
Copy link
Contributor

So is this column storing the data vendor's name? Or the dataset name?

@marc-outins
Copy link
Author

whatever we want to call it, definitely didn't mean we would have different vocabularies for each version of a database if said vocabulary is the same across databases. But it would be nice to know the source of the vocabulary.

@markdanese
Copy link
Contributor

markdanese commented Jun 6, 2018

I think we need to simply name the vocabulary properly and have a description or source field. SEER is confusing because some things come from SEER, some are NAACCR, some are adapted from NAACCR, and some are adapted from other sources (e.g., AJCC). In other words, just because a vocabulary is in SEER doesn't mean it comes from SEER. I think this would be better called "source", which may, or may not, relate to the database precisely. I think source could be more of a description to say something like "NAACCR grade adapted by SEER version 2".

@markdanese
Copy link
Contributor

A related example would be something like CMS place of service codes which may be in Medicare or other databases. But the source description is "CMS place of service". In other words, I think we should say what it is, and where it comes from in a way to identify it clearly.

@marc-outins
Copy link
Author

In the CMS place of service situation the vocabularies table would contain a vocabulary for CMS place of service and the "Source" column would be CMS and we are doing that with any vocab use in SEER that is defined somewhere else. I'm talking about vocabularies that are defined by the organization who cut the data. So for SEER Medicare there is a variable marst1-10 which contain marital status as defined by SEER. So I create a vocab called SEER_MARST.

@markdanese
Copy link
Contributor

markdanese commented Jun 6, 2018

In that case, I would call the source SEER (or NCI, but I prefer SEER since NCI might have more than 1 version). Having the vocab name include the source, when relevant, is a good reminder. Although I wonder if we should be more precise about this.

Consider a variable for Sex, which will be in every database and defined differently in many, but not all. Do we name them differently for each datasource? In other words do we give them more generic names like "Sex_M_F" and "Sex_0_1" and "Sex_Male_Female"? Or do we call them all "Sex" and then use a source field to distinguish them? Or do we need a "type" field to categorize them?

I don't think there is a perfect answer here. I think we need something that can be clearly implemented and searched. So, I guess I am leaning toward something like "Vocabulary Name" (SEER_MARST or CMS_SEX), Source (SEER or CMS), Type (Marital Status or Sex), and "Description" (SEER marital status variable or CMS sex variable).

Obviously in these examples, there is redundancy, but in other situations it will be helpful. I am thinking of Vocabulary = "SEER_grade", Source = "SEER", Type = "Cancer grade", and Description = "NAACCR grade adapted by SEER".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants