Skip to content

Latest commit

 

History

History
59 lines (53 loc) · 2.61 KB

sortna.md

File metadata and controls

59 lines (53 loc) · 2.61 KB

How to manage NA when ordering data frames

Let's read citiesNA.csv file

> data <- read.csv("citiesNA.csv")
> data
      cities countries areakm2 populationk
1   Shanghai     China    2643       21766
2    Beijing     China    1368       21500
3        NYC       USA Unknown        8406
4         LA       USA    1302        3884
5     London        UK    1737     Unknown
6 Manchester        UK     116         255
> class(data)
[1] "data.frame"

In this case we have some character data in the areakm2 and populationk that we want to consider as NA. Below is the function to order the data frame by column, that considers only complete rows with no NA values. This means that we will exclude NYC and London form our data frame.

## Subsetting data by column, we get a factor:
## > class(data[,2])
## [1] "factor"
## One way to extract a vector from the factor is by subsetting it by its levels.
## levels(data[,2]) returns a vector of the levels:
## [1] "China" "UK"    "USA"
## levels(data[,2])[data[,2]] returns a vector with the content of [data[,2]]
## [1] "China" "China" "USA"   "USA"   "UK"    "UK"  
## data[,2] would return a factor, that for our purposes is harder to handle
## [1] China China USA   USA   UK    UK   
## Levels: China UK USA
## SuppressWarnings() stops the warning alerts from R. When we coerce a mixed list of numeric and character
## into a numeric vector, text becomes automatically NA, but it's a forced coercion and R sends a warning.
## This is the case of our column 3 and 4, where "Unknown" becomes NA.
## complete.cases() returns the indexes of the rows that don't have any NA. By subsetting the matrix by these
## indexes we get a data frame with only complete cases.
sort_by_column_NA <- function(data,column){
    for (i in 3:4){
        data[,i] <- suppressWarnings(as.numeric(levels(data[,i])[data[,i]]))
    }
    orderdata <- data[order(data[,column]),]
    orderdata <- orderdata[complete.cases(orderdata),] 
    return(orderdata)
}

Examples:

> sort_by_column_NA(data,3)
      cities countries areakm2 populationk
6 Manchester        UK     116         255
4         LA       USA    1302        3884
2    Beijing     China    1368       21500
1   Shanghai     China    2643       21766
> sort_by_column_NA(data,1)
      cities countries areakm2 populationk
2    Beijing     China    1368       21500
4         LA       USA    1302        3884
6 Manchester        UK     116         255
1   Shanghai     China    2643       21766