Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.
Helpful: Make sure you knit the document as you go through the assignment. Check all your results in the created HTML file.
Submission: Submit via an electronic document on Canvas. Must be submitted as an HTML file generated in RStudio.
Universities are typically opaque, bureaucratic institutions. To be transparent to tax payers, many public schools, such as the University of North Carolina, openly report salary information. In this assignment, we will analyze this information to answer pivotal questions that have endured over the course of time. The most recent salary data for UNC-Chapel Hill faculty and staff has already been downloaded in CSV format and titled “UNCCH_Salary.csv”.
To answer all the questions, you will need the R package
tidyverse
to make figures and utilize dplyr
functions.
Make sure the CSV data file is contained in the folder of your
RMarkdown file. First, we start by using the read_csv
function from the readr
package found within the tidyverse.
The code below executes this process by creating a tibble in your R
environment named “salary”.
salary=read_csv("UNCCH_Salary.csv")
Now, we will explore the information that is contained in this dataset. The code below provides the names of the variables contained in the dataset.
names(salary)
## [1] "institution" "last_name" "first_name" "middle_init" "age"
## [6] "hiredate" "position" "totalsal" "dept"
Next, we will examine the type of data contains in these different variables.
str(salary,give.attr=F)
## spc_tbl_ [14,269 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ institution: chr [1:14269] "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
## $ last_name : chr [1:14269] "ABAD FERNANDEZ" "ABAJAS" "ABARBANELL" "ABBENANTE" ...
## $ first_name : chr [1:14269] "MARIA" "YASMINA" "Jeffery" "DOMINIC" ...
## $ middle_init: chr [1:14269] NA "L" "S" "A" ...
## $ age : num [1:14269] 38 41 67 41 61 55 46 53 74 56 ...
## $ hiredate : chr [1:14269] "1-Feb-14" "1-Jul-12" "1-Jan-99" "16-Jun-14" ...
## $ position : chr [1:14269] "Assistant Professor" "Associate Professor" "Associate Professor" "IT Technical Support/Paraprofessional" ...
## $ totalsal : num [1:14269] 94974 179000 205690 68271 128042 ...
## $ dept : chr [1:14269] "Microbiology and Immunology" "Peds-Hematology/Oncology" "Kenan-Flagler Bus Sch" "Kenan-Flagler Bus Sch" ...
You will notice that the variable “hiredate” is recorded as a character. The following code will first modify the original dataset to change this to a date variable with the format %d-%b-%y. Then, we will remove the hyphens to create a numeric variable as yyyymmdd. Finally, in the spirit of tidyverse, we will convert this data frame to a tibble.
salary$hiredate=as.Date(salary$hiredate, format="%d-%b-%y")
salary$hiredate=as.numeric(gsub("-","",salary$hiredate))
salary=as_tibble(salary)
Now, we will use head()
to view of first five rows and
the modifications made to the original data. The rest of the assignment
will extend off this modified dataset named salary
which by
now should be in your global environment.
head(salary,5)
## # A tibble: 5 × 9
## institution last_name first_name middle_init age hiredate position totalsal
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 UNC-CH ABAD FERN… MARIA <NA> 38 20140201 Assista… 94974
## 2 UNC-CH ABAJAS YASMINA L 41 20120701 Associa… 179000
## 3 UNC-CH ABARBANELL Jeffery S 67 19990101 Associa… 205690
## 4 UNC-CH ABBENANTE DOMINIC A 41 20140616 IT Tech… 68271
## 5 UNC-CH ABELS KIMBERLY T 61 19950801 Academi… 128042
## # ℹ 1 more variable: dept <chr>
Create a new dataset named salary2
that only contains
the following variables:
“last_name”
“first_name”
“middle_init”
“dept”
“position”
“hiredate”
“totalsal”
Then, use the names()
function to display the variable
names of salary2
.
#
Now, we modify salary2
. Rename the variables
“middle_init”, “dept”,“position”,“totalsal” to “MiddleInit”,
“Department”,“Job”, and “Salary”, respectively. Do this for a new
dataset called salary3
and use names()
to
display the variable names of salary3
.
#
Now, we modify salary3
.
salary4
. Hint: Use the concept seen in the conversion
of flight times to minutes since midnight. Use the function
str()
to ensure that your new variable “HireYear” reports
the year of the date that the employee was hired.#
Now, we modify salary4
. Create a new variable called
“YrsEmployed” which reports the number of years the employee has worked
at UNC. Create a new dataset named salary5
and again use
str()
to display the variables in salary5
.
(Use 2024 to create YrsEmployed
)
#
Now, we modify salary5
to create our final dataset named
salary.final
. Use the pipe %>%
to make the
following changes:
Drop the variables “last_name”, “first_name”, “hiredate” and “HireYear”.
Sort the observations first by “Department” and then by “YrsEmployed”.
Rearrange the variables so that “Name”, “YrsEmployed” and “Salary” are the first three variables in the dataset, in that order, without removing any of the other variables.
After you have used the %>%
to make these changes,
use the function head()
to display the first 10 rows of
salary.final
.
#
What is the average salary of employees in the School of Law?
Code (1 Point):
#
Answer (1 Point): (Place Answer Here Using Complete Sentences)
How many employees in Family Medicine have worked between 5 and 8 years (inclusive) and have a middle initial ‘A’?
Code (2 Points):
#
Answer (1 Point): (Place Answer Here Using Complete Sentences)
What is the mean salary of employees from the Linguistics department who are professors, associate professors, or assistant professors?
Code (2 Points):
#
Answer (1 Point): (Place Answer Here Using Complete Sentences)
Based off the data in salary.final
, create a grouped
summary based off combinations of “Department” and “YrsEmployed”. Call
the new tibble deptyear_summary
. Your summarized tibble,
deptyear_summary
, should report all of the following
statistics with corresponding variable names in the following order.
“n” = number of employees for each combination
“mean” = average salary for each combination
“sd” = standard deviation of salary for each combination.
“min” = minimum salary for each combination.
“max” = maximum salary for each combination
In the process, make sure you use .groups='drop'
to
release the grouping so future work is no longer group specific.
Following the creation of deptyear_summary
, prove that your
code worked by using head()
to view the first 10 rows.
#
Using the summarized data in deptyear_summary
, use the
dplyr
functions to identify the 3 departments that award
the lowest average salary for employees who have been employed for 9
years. The output should only show the 3 departments along with the
corresponding years employeed, which should all be 9, and the four
summarizing statistics created.
Furthermore, explain why the standard deviations for two of the departments in your list have salary standard deviations of “NA”. What does this mean and how did it occur?
Code (2 Points):
#
Answer (2 Points): (Place Answer Here Using Complete Sentences)
Create a scatter plot using geom_point()
along with
fitted lines using geom_smooth
with the argument
method="lm"
showing the linear relationship between average
salary and the years employeed. For this plot, use the summarized data
in deptyear_summary
. Following the plot, please explain
what this plot suggests about the relationship between the salary a UNC
employee makes and how many years that employee has served. Make
reference to the figure and use descriptive adjectives (i.e. “strong”,
“weak”, etc.) and terms (i.e. “positive”, “negative”, etc.) that are
appropriate for discussing linear relationships.
Code and Figure (2 Points):
#
Answer (2 Points): (Place Answer Here Using Complete Sentences)
The purpose of summarizing the data was to analyze the previously
discussed linear relationship by group. In
deptyear_summary
, there are 754 unique departments
represented. You can verify this by using
length(unique(deptyear_summary$Department))
. In this part,
I want you to select 5 academic departments, not previously discussed,
and in one figure, display the scatter plots and fitted regression lines
representing the relationship between average salary and years employed
in 5 different colors. Then, in complete sentences, I want you to state
what departments you chose and explain the differences and/or
similarities between the groups regarding the previously mentioned
relationship. Compare departments on the starting salary and the rate of
increase in salary based on the fitted lines.
Code and Figure: (3 Points):
#
Answer (3 Points): (Place Answer Here Using Complete Sentences)