Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Data Between Users Across Dev & Prod #711

Merged
merged 20 commits into from
Sep 26, 2022
Merged

Copy Data Between Users Across Dev & Prod #711

merged 20 commits into from
Sep 26, 2022

Conversation

noschiff
Copy link
Member

@noschiff noschiff commented Sep 2, 2022

Summary

This PR creates a TypeScript script for the command line to copy a user's data from one user to another, optionally across dev and prod Firebases. Firebase makes it very difficult to copy data between documents in their web interface since there's no way to easily download documents, and it's nearly impossible to upload data using the web interface. The only practical way to move data on Firebase is to use a program with a service account that can get the data for a document and set the contents of another document.

To further investigate #672, we need to make changes to the data for a user with a broken account on prod. However, we cannot and should not change data on prod, so this script is needed to transfer the data to dev where we can safely experiment with it. Only the TPM should use this with prod. However, this script will also allow us to make copies of documents on dev for developers to experiment with.

Test Plan

Run the script from the command line with various arguments in preview mode and actually copy between documents on dev (e.g. copy from your account on dev to another google account that you have). You don't actually have to copy to another user if you want; the "user" only refers to the name of a document, so you can name your documents what you want. However, you'll need an email that corresponds to the document name to actually use the data. I recommend only having the dev service account in your repo at first until you have thoroughly read through the code and are confident that it will behave as it is supposed to once a prod service account credentials are in the repo.

Again, it is very important to read through the code because a prod service account grants the ability to mess with our data on prod if the script has an error that I didn't notice.

Example script: preview a copy from a dummy account to your data on dev:

npm run ts-node -- scripts/copy-user-data.ts -f dev/dummyaccount -t dev/[email protected] -o "log.json"

Example script: copy from a dummy account to your data on dev:

npm run ts-node -- scripts/copy-user-data.ts -f dev/dummyaccount -t dev/[email protected] -o "log.json" --execute

@noschiff noschiff requested a review from a team as a code owner September 2, 2022 04:47
@dti-github-bot
Copy link
Member

dti-github-bot commented Sep 2, 2022

[diff-counting] Significant lines: 123.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 2, 2022

Visit the preview URL for this PR (updated for commit fad5d3c):

https://cornelldti-courseplan-dev--pr711-copy-user-data-on0f6uti.web.app

(expires Wed, 26 Oct 2022 02:49:00 GMT)

🔥 via Firebase Hosting GitHub Action 🌎

@zachary-kent
Copy link
Collaborator

If we're going to be parsing command line arguments, I think it would be better to use a CLI library. That way, we can have named arguments for the source and destination, as well as switches for boolean arguments.

src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
@noschiff
Copy link
Member Author

noschiff commented Sep 2, 2022

@zachary0kent

If we're going to be parsing command line arguments, I think it would be better to use a CLI library. That way, we can have named arguments for the source and destination, as well as switches for boolean arguments.

That sounds good! I've only ever manually handled command line arguments, let alone done anything with the command line in TypeScript. What do you suggest?

@zachary-kent
Copy link
Collaborator

@zachary0kent

If we're going to be parsing command line arguments, I think it would be better to use a CLI library. That way, we can have named arguments for the source and destination, as well as switches for boolean arguments.

That sounds good! I've only ever manually handled command line arguments, let alone done anything with the command line in TypeScript. What do you suggest?

I think the minimist package would be good here. yargs is fun and pirate-themed, but also likely a little overkill.

@noschiff
Copy link
Member Author

noschiff commented Sep 4, 2022

@zachary0kent

If we're going to be parsing command line arguments, I think it would be better to use a CLI library. That way, we can have named arguments for the source and destination, as well as switches for boolean arguments.

That sounds good! I've only ever manually handled command line arguments, let alone done anything with the command line in TypeScript. What do you suggest?

I think the minimist package would be good here. yargs is fun and pirate-themed, but also likely a little overkill.

Would that work fine with how we run typescript using npm run ts-node which then calls ts-node -T -P tsconfig.node.json?

@zachary-kent
Copy link
Collaborator

@zachary0kent

If we're going to be parsing command line arguments, I think it would be better to use a CLI library. That way, we can have named arguments for the source and destination, as well as switches for boolean arguments.

That sounds good! I've only ever manually handled command line arguments, let alone done anything with the command line in TypeScript. What do you suggest?

I think the minimist package would be good here. yargs is fun and pirate-themed, but also likely a little overkill.

Would that work fine with how we run typescript using npm run ts-node which then calls ts-node -T -P tsconfig.node.json?

Yeah I think it should be fine.

@handotdev
Copy link
Member

Wow PR #711. Go team!

Just wanted to say hi this is Han I was a former PM at CoursePlan

Copy link
Collaborator

@zachary-kent zachary-kent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimist can be a dev dependency. Also, I think fromUser and toUser should be separate arguments, instead of being bundled into the argument for the source/destination environment.

@noschiff
Copy link
Member Author

noschiff commented Sep 5, 2022

minimist can be a dev dependency. Also, I think fromUser and toUser should be separate arguments, instead of being bundled into the argument for the source/destination environment.

I think it's nice to have them in one argument because it refers to a pseudo path to the user's data. Also, the command gets very long with 5 separate arguments.

@zachary-kent
Copy link
Collaborator

minimist can be a dev dependency. Also, I think fromUser and toUser should be separate arguments, instead of being bundled into the argument for the source/destination environment.

I think it's nice to have them in one argument because it refers to a pseudo path to the user's data. Also, the command gets very long with 5 separate arguments.

That's true--I think what I mainly take issue with is how easy it is to enter arguments in an incorrect format. How would you feel about having the from and to environment arguments being switches instead (i.e. --fromDev or --fromDev=true)? I also think we should print an error if any arguments (like the user) are invalid.

Copy link
Collaborator

@benjamin-shen benjamin-shen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the idea of copying data and I think it's a good idea! Some questions:

  • Did you test this script successfully?
  • What happens if one of the users doesn't exist?
  • Can you put some example CLI commands for running this script in the PR description?

How would you feel about having the from and to environment arguments being switches instead (i.e. --fromDev or --fromDev=true)?

I like this approach better, except it should be for dev by default and there should be a switch to run it in prod. Similarly, for execute=true, it should be false by default (our previous scripts used a --dry-run switch but I think it should be a dry run by default).

Some broader thoughts:

  • We should keep in mind that dev and prod could have different schemas. Maybe we should keep a log of database copies between environments just in case something breaks
  • There is currently no way to roll back changes, eg. if the script fails halfway through (db atomicity). Should there be? Maybe we want a way to back up data 😱
  • In theory it's possibly for two people to run the same or unrelated scripts at the same time, which is an issue for scripts that affect multiple collections (db isolation). Maybe we should post in slack whenever we run a script that writes to database

src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
}
}
}
return copied;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much data does this look like for an average user? Is it small enough that it's readable in the console?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, way too much to read in the console. But I wanted the user to be able to see what they will be copying over first. Maybe I should write it to a file? Firestore doesn't exactly use JSONs, but I could probably make something work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's way too much to read in the console, then we probably shouldn't print everything. Maybe we can write it to a file or multiple files (one per collection) -- if you go this route, make sure to add the output to the gitignore. If possible, we should still have some console outputs (eg. alerts for starting and ending copying a collection) that'll help ensure correctness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it writes to a file now based on the -o argument.

src/admin-copy-user-data.ts Outdated Show resolved Hide resolved
@noschiff
Copy link
Member Author

noschiff commented Sep 9, 2022

@benjamin-shen I'll read more later. Thanks for the review Ben!!!

There is currently no way to roll back changes, eg. if the script fails halfway through (db atomicity). Should there be? Maybe we want a way to back up data 😱

Yes, this is dangerous. I could catch all the errors, store the data I'm overwriting, and then write that data back to the doc. It's an interesting problem because there's many things that could fail… even just writing the data back to the doc could cause issues.

We should keep in mind that dev and prod could have different schemas. Maybe we should keep a log of database copies between environments just in case something breaks

I was thinking that maybe we could run through the data we're replacing to "understand" if our new data will match the schema, but honestly that wouldn't be too useful especially if we're trying to copy to dev to understand a broken doc. Writing to prod is a lot more dangerous and it would be good to have an invariant checker to ensure that the data we write is okay, but honestly we should never write to prod from this. So, I think we can maybe get away with blindly copying data to dev because there's no harm in breaking our own dev users. Can you explain what you're thinking with the log? Are you talking about a way to revert back?

@benjamin-shen
Copy link
Collaborator

Yes, this is dangerous. I could catch all the errors, store the data I'm overwriting, and then write that data back to the doc. It's an interesting problem because there's many things that could fail… even just writing the data back to the doc could cause issues.

We can probably just accept that we can't guarantee atomicity and rerun the script if it failed midway.

Writing to prod is a lot more dangerous and it would be good to have an invariant checker to ensure that the data we write is okay, but honestly we should never write to prod from this. So, I think we can maybe get away with blindly copying data to dev because there's no harm in breaking our own dev users.

This is a good point. I agree

How would you feel about having the from and to environment arguments being switches instead (i.e. --fromDev or --fromDev=true)?

I like this approach better, except it should be for dev by default and there should be a switch to run it in prod. Similarly, for execute=true, it should be false by default (our previous scripts used a --dry-run switch but I think it should be a dry run by default).

Could you incorporate this? Let me know if you need some clarification

@benjamin-shen
Copy link
Collaborator

benjamin-shen commented Sep 26, 2022

@noschiff looking good! Can you

  1. change the example command in the script documentation to not use your email addresses and to not include --execute
  2. pretty print the output json so it's not on a single line

@noschiff noschiff merged commit 4d2991b into master Sep 26, 2022
@noschiff noschiff deleted the copy-user-data branch September 26, 2022 03:49
noschiff added a commit that referenced this pull request Oct 26, 2022
* write ts script to copy user data

* rewrite to use command line arguments

* update credential input to modular SDK

* move core logic into a function

* clean up code

* add option to preview changes

* validate arguments outside of function

* narrow types with union

* rename function arguments

* rename variable

* put arguments into options object

* refactor command line arguments to use minimist

* add minimist types

* make minimist dev dependency

* update package-lock.json

* move to scripts directory

* rename variables

* improve output logging for user

* pretty-print output

* update example in comments
@noschiff noschiff mentioned this pull request Oct 27, 2022
43 tasks
@noschiff noschiff added the util Utility and tools for development, like scripts or workflow automation label Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
util Utility and tools for development, like scripts or workflow automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants