-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISCUSSION: Add column(s) to vocabularies table to identify vocabularies that came from a specific database #112
Comments
So, say we have two SEER databases, ALL and CLL, which share some of the same vocabularies, what do we put in this column? |
SEER since ALL, CLL, etc. is just a cut of the whole SEER Medicare database. One question is do we distinguish between SEER and SEER Medicare. Currently vocabs that come from the SEER medicare data I'm giving ids of SEER_ . |
So is this column storing the data vendor's name? Or the dataset name? |
whatever we want to call it, definitely didn't mean we would have different vocabularies for each version of a database if said vocabulary is the same across databases. But it would be nice to know the source of the vocabulary. |
I think we need to simply name the vocabulary properly and have a description or source field. SEER is confusing because some things come from SEER, some are NAACCR, some are adapted from NAACCR, and some are adapted from other sources (e.g., AJCC). In other words, just because a vocabulary is in SEER doesn't mean it comes from SEER. I think this would be better called "source", which may, or may not, relate to the database precisely. I think source could be more of a description to say something like "NAACCR grade adapted by SEER version 2". |
A related example would be something like CMS place of service codes which may be in Medicare or other databases. But the source description is "CMS place of service". In other words, I think we should say what it is, and where it comes from in a way to identify it clearly. |
In the CMS place of service situation the vocabularies table would contain a vocabulary for CMS place of service and the "Source" column would be CMS and we are doing that with any vocab use in SEER that is defined somewhere else. I'm talking about vocabularies that are defined by the organization who cut the data. So for SEER Medicare there is a variable marst1-10 which contain marital status as defined by SEER. So I create a vocab called SEER_MARST. |
In that case, I would call the source SEER (or NCI, but I prefer SEER since NCI might have more than 1 version). Having the vocab name include the source, when relevant, is a good reminder. Although I wonder if we should be more precise about this. Consider a variable for Sex, which will be in every database and defined differently in many, but not all. Do we name them differently for each datasource? In other words do we give them more generic names like "Sex_M_F" and "Sex_0_1" and "Sex_Male_Female"? Or do we call them all "Sex" and then use a source field to distinguish them? Or do we need a "type" field to categorize them? I don't think there is a perfect answer here. I think we need something that can be clearly implemented and searched. So, I guess I am leaning toward something like "Vocabulary Name" (SEER_MARST or CMS_SEX), Source (SEER or CMS), Type (Marital Status or Sex), and "Description" (SEER marital status variable or CMS sex variable). Obviously in these examples, there is redundancy, but in other situations it will be helpful. I am thinking of Vocabulary = "SEER_grade", Source = "SEER", Type = "Cancer grade", and Description = "NAACCR grade adapted by SEER". |
Since we are going to adding database specific vocabularies I think we should store what database they came from in the vocabularies table.
The text was updated successfully, but these errors were encountered: