Skip to content

Commit

Permalink
remove garbage words and ensure some very common words in every game
Browse files Browse the repository at this point in the history
  • Loading branch information
dylandhall committed Jun 27, 2024
1 parent 5f82c4d commit 426e843
Show file tree
Hide file tree
Showing 7 changed files with 9,096 additions and 280 deletions.
80 changes: 80 additions & 0 deletions asset_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,84 @@ var finalUncommonWords = uncommonWordsSet
await File.WriteAllLinesAsync(@"C:\code\wordgame\assets\common-long-words.txt", finalCommonWords);
await File.WriteAllLinesAsync(@"C:\code\wordgame\assets\uncommon-long-words.txt", finalUncommonWords);
```

Later on I found the google 10000 most common words actually includes a lot of proper nouns and porn sites, so I created a list of words that would be removed, so I could whitelist them:

```
var veryCommon = @"C:\code\wordgame\asset_generation\google-10000-english.txt";
var allOtherWords = new string[] {
@"C:\code\wordgame\asset_generation\popular.txt",
@"C:\code\wordgame\asset_generation\english2.txt",
@"C:\code\wordgame\asset_generation\usa.txt",
@"C:\code\wordgame\asset_generation\english3.txt",
@"C:\code\wordgame\asset_generation\engmix.txt",
@"C:\code\wordgame\asset_generation\ukenglish.txt",
@"C:\code\wordgame\asset_generation\usa2.txt"
};
var allLines = await File.ReadAllLinesAsync(veryCommon);
var googleWords = allLines
.Where(w => w.Length > 3 && !nonAlpha.IsMatch(w))
.Select(w => w.ToLower())
.OrderBy(w => w)
.ToList();
var commonWordsSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var list in allOtherWords)
{
commonWordsSet.UnionWith(await File.ReadAllLinesAsync(list));
}
var nonAlpha = new Regex(@"[^\w]");
var validOtherWords = commonWordsSet
.Where(w => w.Length > 3 && !nonAlpha.IsMatch(w))
.Select(w => w.ToLower())
.OrderBy(w => w)
.ToList();
var invalidGoogleWords = googleWords.Except(validOtherWords, StringComparer.OrdinalIgnoreCase).ToList();
await File.WriteAllLinesAsync(@"C:\code\wordgame\asset_generation\whitelisted-google-words.txt", invalidGoogleWords);
await File.WriteAllLinesAsync(@"C:\code\wordgame\asset_generation\invalid-google-words.txt", invalidGoogleWords);
```

Then I went through and deleted all the ones I didn't want to keep in the whitelist file, leaving the ones I'd be annoyed if I found but weren't "real words".

The main one that convinced me I needed to do this was "inbox" - I know, it's supposed to be hyphenated, but in the year 2024 it's such a common term it just seems wrong to not give credit. Same for things like screensaver.

Then I used those lists to remove the garbage from the main common word files, and additionally add a very common word file, which will be used to ensure randomly generating a game includes a certain number of familiar words:

```
var commonWordsFilename = @"C:\code\wordgame\assets\common-long-words.txt";
var existingCommonWordFile = await File.ReadAllLinesAsync(commonWordsFilename);
var google10KFile = await File.ReadAllLinesAsync(@"C:\code\wordgame\asset_generation\google-10000-english.txt");
var whitelistedFile = await File.ReadAllLinesAsync(@"C:\code\wordgame\asset_generation\whitelisted-google-words.txt");
var toRemoveFile = await File.ReadAllLinesAsync(@"C:\code\wordgame\asset_generation\invalid-google-words.txt");
toRemoveFile = toRemoveFile.Except(whitelistedFile, StringComparer.OrdinalIgnoreCase).ToArray();
var nonAlpha = new Regex(@"[^\w]");
var updatedCommonWords = existingCommonWordFile.Except(toRemoveFile, StringComparer.OrdinalIgnoreCase)
.OrderBy(w => w)
.ToList();
var googleWords = google10KFile
.Where(w => w.Length > 3 && !nonAlpha.IsMatch(w))
.Select(w => w.ToLower())
.Except(toRemoveFile, StringComparer.OrdinalIgnoreCase)
.OrderBy(w => w)
.ToList();
await File.WriteAllLinesAsync(@"C:\code\wordgame\assets\very-common-long-words.txt", googleWords);
await File.WriteAllLinesAsync(commonWordsFilename, updatedCommonWords);
```
Loading

0 comments on commit 426e843

Please sign in to comment.