Overstretched Employees
Given the schema shown below, write a solution to identify if there are any employees that are on multiple projects happening at the same time.
employees projects +---------------+---------+ +---------------+---------+ | id | int |<----+ +->| id | int | | first_name | varchar | | | | title | varchar | | last_name | varchar | | | | start_date | date | | salary | int | | | | end_date | date | | department_id | int |--+ | | | budget | int | +---------------+---------+ | | | +---------------+---------+ | | | departments | | | employees_projects +---------------+---------+ | | | +---------------+---------+ | id | int |<-+ | +--| project_id | int | | name | varchar | +-----| employee_id | int | +---------------+---------+ +---------------+---------+
Your function should return a result in the following format:
department_name | employee_id | first_name | last_name | project_at_risk ----------------+-------------+------------+-----------+---------------- varchar | int | varchar | varchar | varchar (title)
The project at risk is the project that starts later among the two (or more projects) that an employee is working on at that time.
We begin by merging the employees_projects DataFrame with the projects DataFrame.
Next, perform a self-join on the resulting DataFrame from the previous step. This self-join is done to compare each project of an employee with all other projects of the same employee. Essentially, this creates a combination of every project with every other project for each employee, allowing us to analyze potential overlaps.
With the self-joined DataFrame, apply a filtering condition to retain only the rows where the projects overlap. This involves checking if the start date of one project falls within the duration of another project and vice versa, ensuring that the projects are not the same.
For each pair of overlapping projects identified in the previous step, determine which project starts later. This later-starting project is considered the “project at risk.” Use the apply function to automate this comparison and selection process.
Pythondef find_overstretched_employees(departments: pd.DataFrame,
employees: pd.DataFrame,
projects: pd.DataFrame,
employees_projects: pd.DataFrame) -> pd.DataFrame:
# Merging employees_projects with projects to get project details for each employee
emp_proj_details = employees_projects.merge(projects, left_on='project_id', right_on='id') \
.drop(columns='id')
# Self-joining to find overlapping projects for each employee
emp_proj_details = emp_proj_details.merge(emp_proj_details, on='employee_id', suffixes=('_1', '_2'))
# Filtering to keep only overlapping projects
overlap_condition = (emp_proj_details['start_date_1'] < emp_proj_details['end_date_2']) & \
(emp_proj_details['start_date_2'] < emp_proj_details['end_date_1']) & \
(emp_proj_details['project_id_1'] != emp_proj_details['project_id_2'])
overlapping_projects = emp_proj_details[overlap_condition].copy()
# Identifying the later-starting project among the overlapping ones
overlapping_projects['later_project'] = overlapping_projects.apply(
lambda x: x['title_2'] if x['start_date_1'] < x['start_date_2'] else x['title_1'], axis=1
)
# Merging with employees and departments to get required details
result_df = overlapping_projects[['employee_id', 'later_project']] \
.drop_duplicates() \
.merge(employees, left_on='employee_id', right_on='id') \
.merge(departments, left_on='department_id', right_on='id') \
[['name', 'employee_id', 'first_name', 'last_name', 'later_project']]
result_df.columns = ['department_name', 'employee_id', 'first_name', 'last_name', 'project_at_risk']
return result_df
Can you make the above solution more efficient e.g. can we reduce the operations required by optimizing the self-join step?
# debug your code below departments = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'name': ['Reporting', 'Engineering', 'Marketing', 'Biz Dev', 'Silly Walks'] }) employees = pd.DataFrame({ 'id': [1, 2, 3, 4, 5, 6], 'first_name': ['John', 'Ava', 'Cailin', 'Mike', 'Ian', 'John'], 'last_name': ['Smith', 'Muffinson', 'Ninson', 'Peterson', 'Peterson', 'Mills'], 'salary': [20000, 10000, 30000, 20000, 80000, 50000], 'department_id': [1, 5, 2, 2, 2, 3] }) projects = pd.DataFrame({ 'id': [1, 2, 3], 'title': ['Build a cool site', 'Update TPS Reports', 'Design 3 New Products'], 'start_date': ['2023-10-28', '2024-07-20', '2024-05-11'], 'end_date': ['2024-01-26', '2024-10-28', '2024-08-19'], 'budget': [5000, 400, 100] }) employees_projects = pd.DataFrame({ 'project_id': [2, 3, 3, 1, 1, 1, 3, 2], 'employee_id': [1, 1, 2, 2, 3, 4, 5, 5] }) result = find_overstretched_employees(departments, employees, projects, employees_projects) print(result.to_string(index=False))