Instructions

Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.

Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.

Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.

Introduction

Universities are typically opaque, bureaucratic institutions. To be transparent to tax payers, many public schools, such as the University of North Carolina, openly report salary information. In this assignment, we will analyze this information to answer pivotal questions that have endured over the course of time. The most recent salary data for UNC-Chapel Hill faculty and staff has already been downloaded in CSV format and titled “UNCCH_Salary.csv”.

To answer all the questions, you will need the R package tidyverse to make figures and utilize dplyr functions.

Data Information

Make sure the CSV data file is contained in the folder of your RMarkdown file. First, we start by using the read_csv function from the readr package found within the tidyverse. The code below executes this process by creating a tibble in your R environment named “salary”.

salary=read_csv("UNCCH_Salary.csv")

Now, we will explore the information that is contained in this dataset. The code below provides the names of the variables contained in the dataset.

names(salary)
## [1] "institution" "last_name"   "first_name"  "middle_init" "age"        
## [6] "hiredate"    "position"    "totalsal"    "dept"

Next, we will examine the type of data contains in these different variables.

str(salary,give.attr=F)
## spc_tbl_ [14,269 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ institution: chr [1:14269] "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
##  $ last_name  : chr [1:14269] "ABAD FERNANDEZ" "ABAJAS" "ABARBANELL" "ABBENANTE" ...
##  $ first_name : chr [1:14269] "MARIA" "YASMINA" "Jeffery" "DOMINIC" ...
##  $ middle_init: chr [1:14269] NA "L" "S" "A" ...
##  $ age        : num [1:14269] 38 41 67 41 61 55 46 53 74 56 ...
##  $ hiredate   : chr [1:14269] "1-Feb-14" "1-Jul-12" "1-Jan-99" "16-Jun-14" ...
##  $ position   : chr [1:14269] "Assistant Professor" "Associate Professor" "Associate Professor" "IT Technical Support/Paraprofessional" ...
##  $ totalsal   : num [1:14269] 94974 179000 205690 68271 128042 ...
##  $ dept       : chr [1:14269] "Microbiology and Immunology" "Peds-Hematology/Oncology" "Kenan-Flagler Bus Sch" "Kenan-Flagler Bus Sch" ...

You will notice that the variable “hiredate” is recorded as a character. The following code will first modify the original dataset to change this to a date variable with the format %d-%b-%y. Then, we will remove the hyphens to create a numeric variable as yyyymmdd. Finally, in the spirit of tidyverse, we will convert this data frame to a tibble.

salary$hiredate=as.Date(salary$hiredate, format="%d-%b-%y")
salary$hiredate=as.numeric(gsub("-","",salary$hiredate))
salary=as_tibble(salary)

Now, we will use head() to view of first five rows and the modifications made to the original data. The rest of the assignment will extend off this modified dataset named salary which by now should be in your global environment.

head(salary,5)
## # A tibble: 5 × 9
##   institution last_name  first_name middle_init   age hiredate position totalsal
##   <chr>       <chr>      <chr>      <chr>       <dbl>    <dbl> <chr>       <dbl>
## 1 UNC-CH      ABAD FERN… MARIA      <NA>           38 20140201 Assista…    94974
## 2 UNC-CH      ABAJAS     YASMINA    L              41 20120701 Associa…   179000
## 3 UNC-CH      ABARBANELL Jeffery    S              67 19990101 Associa…   205690
## 4 UNC-CH      ABBENANTE  DOMINIC    A              41 20140616 IT Tech…    68271
## 5 UNC-CH      ABELS      KIMBERLY   T              61 19950801 Academi…   128042
## # ℹ 1 more variable: dept <chr>

Assignment

Part 1: Reducing the Data to a Smaller Set of Interest

Q1 (2 Points)

Create a new dataset named salary2 that only contains the following variables:

  • “last_name”

  • “first_name”

  • “middle_init”

  • “dept”

  • “position”

  • “hiredate”

  • “totalsal”

Then, use the names() function to display the variable names of salary2.

#

Q2 (2 Points)

Now, we modify salary2. Rename the variables “middle_init”, “dept”,“position”,“totalsal” to “MiddleInit”, “Department”,“Job”, and “Salary”, respectively. Do this for a new dataset called salary3 and use names() to display the variable names of salary3.

#

Q3 (4 Points)

Now, we modify salary3.

  • Create a new variable “Name” that paste the “last_name” and “first_name” together with “,” as separation.
  • Create a new variable called “HireYear” that only contains the first four digits of the variable “hiredate” in a new dataset named salary4. Hint: Use the concept seen in the conversion of flight times to minutes since midnight. Use the function str() to ensure that your new variable “HireYear” reports the year of the date that the employee was hired.
#

Q4 (2 points)

Now, we modify salary4. Create a new variable called “YrsEmployed” which reports the number of years the employee has worked at UNC. Create a new dataset named salary5 and again use str() to display the variables in salary5. (Use 2024 to create YrsEmployed)

#

Q5 (4 points)

Now, we modify salary5 to create our final dataset named salary.final. Use the pipe %>% to make the following changes:

  • Drop the variables “last_name”, “first_name”, “hiredate” and “HireYear”.

  • Sort the observations first by “Department” and then by “YrsEmployed”.

  • Rearrange the variables so that “Name”, “YrsEmployed” and “Salary” are the first three variables in the dataset, in that order, without removing any of the other variables.

After you have used the %>% to make these changes, use the function head() to display the first 10 rows of salary.final.

#

Part 2: Answering Questions Based on All Data

Q6 (2 Points)

What is the average salary of employees in the School of Law?

Code (1 Point):

#

Answer (1 Point): (Place Answer Here Using Complete Sentences)

Q7 (3 Points)

How many employees in Family Medicine have worked between 5 and 8 years (inclusive) and have a middle initial ‘A’?

Code (2 Points):

#

Answer (1 Point): (Place Answer Here Using Complete Sentences)

Q8 (2 Points)

What is the mean salary of employees from the Linguistics department who are professors, associate professors, or assistant professors?

Code (2 Points):

#

Answer (1 Point): (Place Answer Here Using Complete Sentences)

Part 3: Answering Questions Based on Summarized Data

Q9 (4 Points)

Based off the data in salary.final, create a grouped summary based off combinations of “Department” and “YrsEmployed”. Call the new tibble deptyear_summary. Your summarized tibble, deptyear_summary, should report all of the following statistics with corresponding variable names in the following order.

  • “n” = number of employees for each combination

  • “mean” = average salary for each combination

  • “sd” = standard deviation of salary for each combination.

  • “min” = minimum salary for each combination.

  • “max” = maximum salary for each combination

In the process, make sure you use .groups='drop' to release the grouping so future work is no longer group specific. Following the creation of deptyear_summary, prove that your code worked by using head() to view the first 10 rows.

#

Q10 (4 Points)

Using the summarized data in deptyear_summary, use the dplyr functions to identify the 3 departments that award the lowest average salary for employees who have been employed for 9 years. The output should only show the 3 departments along with the corresponding years employeed, which should all be 9, and the four summarizing statistics created.

Furthermore, explain why the standard deviations for two of the departments in your list have salary standard deviations of “NA”. What does this mean and how did it occur?

Code (2 Points):

#

Answer (2 Points): (Place Answer Here Using Complete Sentences)

Q11 (4 points)

Create a scatter plot using geom_point() along with fitted lines using geom_smooth with the argument method="lm" showing the linear relationship between average salary and the years employeed. For this plot, use the summarized data in deptyear_summary. Following the plot, please explain what this plot suggests about the relationship between the salary a UNC employee makes and how many years that employee has served. Make reference to the figure and use descriptive adjectives (i.e. “strong”, “weak”, etc.) and terms (i.e. “positive”, “negative”, etc.) that are appropriate for discussing linear relationships.

Code and Figure (2 Points):

#

Answer (2 Points): (Place Answer Here Using Complete Sentences)

Q12 (6 Points)

The purpose of summarizing the data was to analyze the previously discussed linear relationship by group. In deptyear_summary, there are 754 unique departments represented. You can verify this by using length(unique(deptyear_summary$Department)). In this part, I want you to select 5 academic departments, not previously discussed, and in one figure, display the scatter plots and fitted regression lines representing the relationship between average salary and years employed in 5 different colors. Then, in complete sentences, I want you to state what departments you chose and explain the differences and/or similarities between the groups regarding the previously mentioned relationship. Compare departments on the starting salary and the rate of increase in salary based on the fitted lines.

Code and Figure: (3 Points):

#

Answer (3 Points): (Place Answer Here Using Complete Sentences)