Skip to content

sehmilo/Customer-Segmentation-in-the-US

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Project

Table of Contents

  1. Overview
  2. About
  3. Dataframe Structure
  4. Data Dictionary
  5. Acknowledgement

Overview

In this project, I used data obtained from 2019 Survey of Consumer Finances data from the US Federal Reserve.

This project is an example of unsupervised learning, specifically CLUSTERING. It can be used in commercial contexts for marketing or customer segmentation or in sociological contexts to study social stratification.

Objectives include:

  • Compare characteristics across subgroups using a side-by-side bar chart.
  • Build a k-means clustering model.
  • Conduct feature selection for clustering based on variance.
  • Reduce high-dimensional data using principal component analysis (PCA).

About

The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families’ balance sheets, pensions, income, and demographic characteristics. Information is also included from related surveys of pension providers and the earlier such surveys conducted by the Federal Reserve Board. No other study for the country collects comparable information. Data from the SCF are widely used, from analysis at the Federal Reserve and other branches of government to scholarly work at the major economic research centers. The survey has contained a panel element over two periods.

  • Respondents to the 1983 survey were re-interviewed in 1986 and 1989.
  • Respondents to the 2007 survey were re-interviewed in 2009.

The study is sponsored by the Federal Reserve Board in cooperation with the Department of the Treasury. Since 1992, data have been collected by the NORC at the University of Chicago.

To ensure the representativeness of the study, respondents are selected randomly using procedures described in the technical working papers on this web site.

A strong attempt is made to select families from all economic strata. Participation in the study is strictly voluntary. However, because only about 6,500 families were interviewed in the most recent study, every family selected is very important to the results. To maintain the scientific validity of the study, interviewers are not allowed to substitute respondents for families that do not participate. Thus, if a family declines to participate, it means that families like theirs may not be represented clearly in national discussions.

The confidentiality of the information provided in the study is of the highest importance to NORC and the Federal Reserve. Strenuous efforts are made to protect the privacy of participants, and in the history of the survey, there has never been a leak. The names of the participants in the survey are known only to NORC, which has more than 50 years of successful experience in collecting confidential information.

Dataframe Structure

alt

Data Dictionary

Feature Description
ACTBUS Total value of actively managed business(es), 2019 dollars
AGE Age of reference person
ASSET Total value of assets held by household, 2019 dollars
BUS Total value of business(es) in which the household has either an active or nonactive interest, 2019 dollars
DEBT Total value of debt held by household, 2019 dollars
EQUITY Total value of financial assets held by household that are invested in stock, 2019 dollars
FIN Total value of financial assets held by household, 2019 dollars
HBUS Have active or non-actively managed business(es)
HOUSES Total value of primary residence of household, 2019 dollars
INCCAT Income percentile groups
INCOME Total amount of income of household, 2019 dollars
KGBUS Unrealized capital gains or losses on businesses, 2019 dollars
KGTOTAL Total unrealized capital gains or losses for the household, 2019 dollars
NETWORTH Total net worth of household, 2019 dollars
NHNFIN total non-financial assets excluding principal residences, 2019 dollars
NFIN Total value of non-financial assets held by household, 2019 dollars

The above do not represent the entirety of the features in the dataset but rather commonly used features for this particular project.For details on full glossary and documentation visit the Survey Documentation and Analysis codebook as processed and compiled by the University of California, Berkeley

Acknowledgement

This Dataset is created from 2019 Survey of Consumer Finances through WorldQuant University Applied Data Science Lab project assessment.

Special thanks to Professor Nicholas Cifuentes-Goodbody for this project 🙏.

If you're interested in learning more about this dataset, or clustering in general, some projects that served as inspiration for this project are:

  • Hennig, C., & Liao, T. F. (2013). "How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification." Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309–369. https://doi.org/10.1111/j.1467-9876.2012.01066.
  • Tatsat, H., Puri, S., & Lookabaugh, B. (2020). Machine learning and data science blueprints for finance: From building trading strategies to robo-advisors using Python. O’Reilly. https://github.com/tatsath/fin-ml.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published