diff --git a/Predictions/Human Resources - Full.ipynb b/Predictions/Human Resources - Full.ipynb deleted file mode 100644 index 93ba1b6..0000000 --- a/Predictions/Human Resources - Full.ipynb +++ /dev/null @@ -1,2888 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "a081d1b8", - "metadata": {}, - "source": [ - "# Table of Contents: \n", - "* [Introduction](#intro)\n", - "* [Read The Data](#readdata)\n", - "* [Feature Engineering Part I - Handling Missing Values](#misvalues)\n", - "* [Feature Engineering Part II - Dictionary](#dictio)\n", - "* [Feature Engineering Part III - Reformatting](#reform)\n", - "* [Answering The Questions](#ans)\n", - " * [Is there any relationship between who a person works for and their performance score?](#question1)\n", - " * [What is the overall diversity profile of the organization?](#question2)\n", - " * [Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?](#question3)\n", - " * [Are there areas of the company where pay is not equitable?](#question4)\n", - " * [What are our best recruiting sources if we want to ensure a diverse organization](#question5)" - ] - }, - { - "cell_type": "markdown", - "id": "05deb58c", - "metadata": {}, - "source": [ - "# Introduction \n", - "Hi Everyone! Today I'll demonstrate my workflow to analyze a **Kaggle** dataset called `Human Resources`. Read more about it [here.](https://www.kaggle.com/datasets/rhuebner/human-resources-data-set)\n", - "\n", - "The table of contents provide an outline to what we're going from start to finish, and the questions answered in this notebook are also in the table of contents." - ] - }, - { - "cell_type": "markdown", - "id": "bd677831", - "metadata": {}, - "source": [ - "# Read The Data \n", - "We begin by grabbing the data and import the necessary libraries. You can grab the data locally, from kaggle, or use the existing link as I've uploaded the csv into my github repo." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "ea2728d9", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "b31f6f9b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Employee_NameEmpIDMarriedIDMaritalStatusIDGenderIDEmpStatusIDDeptIDPerfScoreIDFromDiversityJobFairIDSalary...ManagerNameManagerIDRecruitmentSourcePerformanceScoreEngagementSurveyEmpSatisfactionSpecialProjectsCountLastPerformanceReview_DateDaysLateLast30Absences
0Adinolfi, Wilson K10026001154062506...Michael Albert22.0LinkedInExceeds4.60501/17/201901
1Ait Sidi, Karthikeyan100841115330104437...Simon Roup4.0IndeedFully Meets4.96362/24/2016017
2Akinkuolie, Sarah10196110553064955...Kissy Sullivan20.0LinkedInFully Meets3.02305/15/201203
3Alagbe,Trina10088110153064991...Elijiah Gray16.0IndeedFully Meets4.84501/3/2019015
4Anderson, Carol10069020553050825...Webster Butler39.0Google SearchFully Meets5.00402/1/201602
\n", - "

5 rows × 36 columns

\n", - "
" - ], - "text/plain": [ - " Employee_Name EmpID MarriedID MaritalStatusID GenderID \\\n", - "0 Adinolfi, Wilson K 10026 0 0 1 \n", - "1 Ait Sidi, Karthikeyan 10084 1 1 1 \n", - "2 Akinkuolie, Sarah 10196 1 1 0 \n", - "3 Alagbe,Trina 10088 1 1 0 \n", - "4 Anderson, Carol 10069 0 2 0 \n", - "\n", - " EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary ... \\\n", - "0 1 5 4 0 62506 ... \n", - "1 5 3 3 0 104437 ... \n", - "2 5 5 3 0 64955 ... \n", - "3 1 5 3 0 64991 ... \n", - "4 5 5 3 0 50825 ... \n", - "\n", - " ManagerName ManagerID RecruitmentSource PerformanceScore \\\n", - "0 Michael Albert 22.0 LinkedIn Exceeds \n", - "1 Simon Roup 4.0 Indeed Fully Meets \n", - "2 Kissy Sullivan 20.0 LinkedIn Fully Meets \n", - "3 Elijiah Gray 16.0 Indeed Fully Meets \n", - "4 Webster Butler 39.0 Google Search Fully Meets \n", - "\n", - " EngagementSurvey EmpSatisfaction SpecialProjectsCount \\\n", - "0 4.60 5 0 \n", - "1 4.96 3 6 \n", - "2 3.02 3 0 \n", - "3 4.84 5 0 \n", - "4 5.00 4 0 \n", - "\n", - " LastPerformanceReview_Date DaysLateLast30 Absences \n", - "0 1/17/2019 0 1 \n", - "1 2/24/2016 0 17 \n", - "2 5/15/2012 0 3 \n", - "3 1/3/2019 0 15 \n", - "4 2/1/2016 0 2 \n", - "\n", - "[5 rows x 36 columns]" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "data_original = pd.read_csv(\"https://raw.githubusercontent.com/youronlydimwit/Data_ScienceUse_Cases/main/Predictions/Data/HRDataset_v14.csv\")\n", - "data_original.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "c9a4cb20", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 311 entries, 0 to 310\n", - "Data columns (total 36 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Employee_Name 311 non-null object \n", - " 1 EmpID 311 non-null int64 \n", - " 2 MarriedID 311 non-null int64 \n", - " 3 MaritalStatusID 311 non-null int64 \n", - " 4 GenderID 311 non-null int64 \n", - " 5 EmpStatusID 311 non-null int64 \n", - " 6 DeptID 311 non-null int64 \n", - " 7 PerfScoreID 311 non-null int64 \n", - " 8 FromDiversityJobFairID 311 non-null int64 \n", - " 9 Salary 311 non-null int64 \n", - " 10 Termd 311 non-null int64 \n", - " 11 PositionID 311 non-null int64 \n", - " 12 Position 311 non-null object \n", - " 13 State 311 non-null object \n", - " 14 Zip 311 non-null int64 \n", - " 15 DOB 311 non-null object \n", - " 16 Sex 311 non-null object \n", - " 17 MaritalDesc 311 non-null object \n", - " 18 CitizenDesc 311 non-null object \n", - " 19 HispanicLatino 311 non-null object \n", - " 20 RaceDesc 311 non-null object \n", - " 21 DateofHire 311 non-null object \n", - " 22 DateofTermination 104 non-null object \n", - " 23 TermReason 311 non-null object \n", - " 24 EmploymentStatus 311 non-null object \n", - " 25 Department 311 non-null object \n", - " 26 ManagerName 311 non-null object \n", - " 27 ManagerID 303 non-null float64\n", - " 28 RecruitmentSource 311 non-null object \n", - " 29 PerformanceScore 311 non-null object \n", - " 30 EngagementSurvey 311 non-null float64\n", - " 31 EmpSatisfaction 311 non-null int64 \n", - " 32 SpecialProjectsCount 311 non-null int64 \n", - " 33 LastPerformanceReview_Date 311 non-null object \n", - " 34 DaysLateLast30 311 non-null int64 \n", - " 35 Absences 311 non-null int64 \n", - "dtypes: float64(2), int64(16), object(18)\n", - "memory usage: 87.6+ KB\n" - ] - } - ], - "source": [ - "data_original.info()" - ] - }, - { - "cell_type": "markdown", - "id": "79569d4f", - "metadata": {}, - "source": [ - "# Feature Engineering Part I - Handling Missing Values \n", - "As of the information from `df.info()` above, we can see that out of 311 rows of data, column `ManagerID` and `DateofTermination` has missing values. We will try to analyze and handle them.\n", - "\n", - "[Back To Top](#top)\n", - "\n", - "## Missing Values in ManagerName" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "2a4d839d", - "metadata": {}, - "outputs": [], - "source": [ - "# Copy the data dataframe as df\n", - "df = data_original.copy()" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "f1237188", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Employee_NameEmpIDMarriedIDMaritalStatusIDGenderIDEmpStatusIDDeptIDPerfScoreIDFromDiversityJobFairIDSalary...ManagerNameManagerIDRecruitmentSourcePerformanceScoreEngagementSurveyEmpSatisfactionSpecialProjectsCountLastPerformanceReview_DateDaysLateLast30Absences
19Becker, Scott10277001353053250...Webster ButlerNaNLinkedInFully Meets4.20401/11/2019013
30Buccheri, Joseph10184001153065288...Webster ButlerNaNGoogle SearchFully Meets3.19302/1/201909
44Chang, Donovan E10154001153060380...Webster ButlerNaNLinkedInFully Meets3.80501/14/201904
\n", - "

3 rows × 36 columns

\n", - "
" - ], - "text/plain": [ - " Employee_Name EmpID MarriedID MaritalStatusID GenderID \\\n", - "19 Becker, Scott 10277 0 0 1 \n", - "30 Buccheri, Joseph 10184 0 0 1 \n", - "44 Chang, Donovan E 10154 0 0 1 \n", - "\n", - " EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary ... \\\n", - "19 3 5 3 0 53250 ... \n", - "30 1 5 3 0 65288 ... \n", - "44 1 5 3 0 60380 ... \n", - "\n", - " ManagerName ManagerID RecruitmentSource PerformanceScore \\\n", - "19 Webster Butler NaN LinkedIn Fully Meets \n", - "30 Webster Butler NaN Google Search Fully Meets \n", - "44 Webster Butler NaN LinkedIn Fully Meets \n", - "\n", - " EngagementSurvey EmpSatisfaction SpecialProjectsCount \\\n", - "19 4.20 4 0 \n", - "30 3.19 3 0 \n", - "44 3.80 5 0 \n", - "\n", - " LastPerformanceReview_Date DaysLateLast30 Absences \n", - "19 1/11/2019 0 13 \n", - "30 2/1/2019 0 9 \n", - "44 1/14/2019 0 4 \n", - "\n", - "[3 rows x 36 columns]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Give a brief view of ManagerID where it has Null / NaN values.\n", - "df[df['ManagerID'].isna()].head(3)" - ] - }, - { - "cell_type": "markdown", - "id": "f8bcd9b7", - "metadata": {}, - "source": [ - "It turns out that the ManagerName is shown as `Webster Butler`. One can wonder if the name `Webster Butler` is only showing for the NaN Values or not. To prove this true or not, we can display the `ManagerID` where the `ManagerName` is containing `Webster Butler`." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "77a5f506", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([39., nan])" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show All ManagerID where ManagerName is Webster Butler\n", - "df[df['ManagerName'] == 'Webster Butler']['ManagerID'].unique()" - ] - }, - { - "cell_type": "markdown", - "id": "6eea7613", - "metadata": {}, - "source": [ - "Okay, so apparently `Webster Butler` definitely has Null / NaN Values. But we can try to reverse this code." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "45a2bb08", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array(['Webster Butler'], dtype=object)" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show all ManagerName where ManagerID is NaN / Null\n", - "df[df['ManagerID'].isna()]['ManagerName'].unique()" - ] - }, - { - "cell_type": "markdown", - "id": "3ab9f5be", - "metadata": {}, - "source": [ - "And there it is. The NaN / Null values are only present where `ManagerName` is `Webster Butler`. Now we fill those values as the already existing number for `Webster Butler`, 39." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "d0d1dc07", - "metadata": {}, - "outputs": [], - "source": [ - "# Fill every NaN / Null with 39\n", - "df['ManagerID'].fillna(39, inplace=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "51a3e162", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([39.])" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Optional Check if Webster Butler still has NaN / Null values\n", - "df[df['ManagerName'] == 'Webster Butler']['ManagerID'].unique()" - ] - }, - { - "cell_type": "markdown", - "id": "429b84a4", - "metadata": {}, - "source": [ - "## Missing values in DateofTermination\n", - "As common knowledge, most employees in the company is still present, and only a handful or a percentage of people that have left the company, has `DateofTermination` value.\n", - "\n", - "There are many ways to handle this, but in this workflow, I decided to take a look at `TermReason` column. Why? Because that column specifies why employees are terminated / resigned, but it doesn't have any NaN / Null values." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "18fd4a35", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array(['N/A-StillEmployed', 'career change', 'hours', 'return to school',\n", - " 'Another position', 'unhappy', 'attendance', 'performance',\n", - " 'Learned that he is a gangster', 'retiring',\n", - " 'relocation out of area', 'more money', 'military',\n", - " 'no-call, no-show', 'Fatal attraction',\n", - " 'maternity leave - did not return', 'medical issues',\n", - " 'gross misconduct'], dtype=object)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show unique values in TermReason\n", - "df['TermReason'].unique()" - ] - }, - { - "cell_type": "markdown", - "id": "cc0cde74", - "metadata": {}, - "source": [ - "As you can see, the value `N/A-StillEmployed` exists for people that are still working for the company. We can and will use this information, and purposefully discard the `DateofTermination` column." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "8abe8339", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop column\n", - "df = df.drop(['DateofTermination'], axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "54a648a1", - "metadata": {}, - "source": [ - "# Feature Engineering Part II - Dictionary \n", - "For this part, I would like to store information in dictionaries for metadata purposes. Because the values will be encoded later, I would like to know what the encoded values were, as textual data.\n", - "\n", - "[Back To Top](#top)\n", - "\n", - "Some columns are assumed already providing encoded values, such as `GenderID` for `Sex`, `DeptID` for `Department`, and so on. But, we will need to analyze it further." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "b58355c5", - "metadata": {}, - "outputs": [], - "source": [ - "# Column Mapping\n", - "def column_mapping(dataframe, column1, column2):\n", - " column_mapping = {}\n", - "\n", - " for unique_value in dataframe[column1].unique():\n", - " unique_values_column2 = dataframe.loc[dataframe[column1] == unique_value, column2].unique()\n", - " column_mapping[unique_value] = unique_values_column2\n", - "\n", - " return column_mapping" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "9cb6b520", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Adinolfi, Wilson K': array([10026], dtype=int64),\n", - " 'Ait Sidi, Karthikeyan ': array([10084], dtype=int64),\n", - " 'Akinkuolie, Sarah': array([10196], dtype=int64),\n", - " 'Alagbe,Trina': array([10088], dtype=int64),\n", - " 'Anderson, Carol ': array([10069], dtype=int64),\n", - " 'Anderson, Linda ': array([10002], dtype=int64),\n", - " 'Andreola, Colby': array([10194], dtype=int64),\n", - " 'Athwal, Sam': array([10062], dtype=int64),\n", - " 'Bachiochi, Linda': array([10114], dtype=int64),\n", - " 'Bacong, Alejandro ': array([10250], dtype=int64),\n", - " 'Baczenski, Rachael ': array([10252], dtype=int64),\n", - " 'Barbara, Thomas': array([10242], dtype=int64),\n", - " 'Barbossa, Hector': array([10012], dtype=int64),\n", - " 'Barone, Francesco A': array([10265], dtype=int64),\n", - " 'Barton, Nader': array([10066], dtype=int64),\n", - " 'Bates, Norman': array([10061], dtype=int64),\n", - " 'Beak, Kimberly ': array([10023], dtype=int64),\n", - " 'Beatrice, Courtney ': array([10055], dtype=int64),\n", - " 'Becker, Renee': array([10245], dtype=int64),\n", - " 'Becker, Scott': array([10277], dtype=int64),\n", - " 'Bernstein, Sean': array([10046], dtype=int64),\n", - " 'Biden, Lowan M': array([10226], dtype=int64),\n", - " 'Billis, Helen': array([10003], dtype=int64),\n", - " 'Blount, Dianna': array([10294], dtype=int64),\n", - " 'Bondwell, Betsy': array([10267], dtype=int64),\n", - " 'Booth, Frank': array([10199], dtype=int64),\n", - " 'Boutwell, Bonalyn': array([10081], dtype=int64),\n", - " 'Bozzi, Charles': array([10175], dtype=int64),\n", - " 'Brill, Donna': array([10177], dtype=int64),\n", - " 'Brown, Mia': array([10238], dtype=int64),\n", - " 'Buccheri, Joseph ': array([10184], dtype=int64),\n", - " 'Bugali, Josephine ': array([10203], dtype=int64),\n", - " 'Bunbury, Jessica': array([10188], dtype=int64),\n", - " 'Burke, Joelle': array([10107], dtype=int64),\n", - " 'Burkett, Benjamin ': array([10181], dtype=int64),\n", - " 'Cady, Max ': array([10150], dtype=int64),\n", - " 'Candie, Calvin': array([10001], dtype=int64),\n", - " 'Carabbio, Judith': array([10085], dtype=int64),\n", - " 'Carey, Michael ': array([10115], dtype=int64),\n", - " 'Carr, Claudia N': array([10082], dtype=int64),\n", - " 'Carter, Michelle ': array([10040], dtype=int64),\n", - " 'Chace, Beatrice ': array([10067], dtype=int64),\n", - " 'Champaigne, Brian': array([10108], dtype=int64),\n", - " 'Chan, Lin': array([10210], dtype=int64),\n", - " 'Chang, Donovan E': array([10154], dtype=int64),\n", - " 'Chigurh, Anton': array([10200], dtype=int64),\n", - " 'Chivukula, Enola': array([10240], dtype=int64),\n", - " 'Cierpiszewski, Caroline ': array([10168], dtype=int64),\n", - " 'Clayton, Rick': array([10220], dtype=int64),\n", - " 'Cloninger, Jennifer': array([10275], dtype=int64),\n", - " 'Close, Phil': array([10269], dtype=int64),\n", - " 'Clukey, Elijian': array([10029], dtype=int64),\n", - " 'Cockel, James': array([10261], dtype=int64),\n", - " 'Cole, Spencer': array([10292], dtype=int64),\n", - " 'Corleone, Michael': array([10282], dtype=int64),\n", - " 'Corleone, Vito': array([10019], dtype=int64),\n", - " 'Cornett, Lisa ': array([10094], dtype=int64),\n", - " 'Costello, Frank': array([10193], dtype=int64),\n", - " 'Crimmings, Jean': array([10132], dtype=int64),\n", - " 'Cross, Noah': array([10083], dtype=int64),\n", - " 'Daneault, Lynn': array([10099], dtype=int64),\n", - " 'Daniele, Ann ': array([10212], dtype=int64),\n", - " \"Darson, Jene'ya \": array([10056], dtype=int64),\n", - " 'Davis, Daniel': array([10143], dtype=int64),\n", - " 'Dee, Randy': array([10311], dtype=int64),\n", - " 'DeGweck, James': array([10070], dtype=int64),\n", - " 'Del Bosque, Keyla': array([10155], dtype=int64),\n", - " 'Delarge, Alex': array([10306], dtype=int64),\n", - " 'Demita, Carla': array([10100], dtype=int64),\n", - " 'Desimone, Carl ': array([10310], dtype=int64),\n", - " 'DeVito, Tommy': array([10197], dtype=int64),\n", - " 'Dickinson, Geoff ': array([10276], dtype=int64),\n", - " 'Dietrich, Jenna ': array([10304], dtype=int64),\n", - " 'DiNocco, Lily ': array([10284], dtype=int64),\n", - " 'Dobrin, Denisa S': array([10207], dtype=int64),\n", - " 'Dolan, Linda': array([10133], dtype=int64),\n", - " 'Dougall, Eric': array([10028], dtype=int64),\n", - " 'Driver, Elle': array([10006], dtype=int64),\n", - " 'Dunn, Amy ': array([10105], dtype=int64),\n", - " 'Dunne, Amy': array([10211], dtype=int64),\n", - " 'Eaton, Marianne': array([10064], dtype=int64),\n", - " 'Engdahl, Jean': array([10247], dtype=int64),\n", - " 'England, Rex': array([10235], dtype=int64),\n", - " 'Erilus, Angela': array([10299], dtype=int64),\n", - " 'Estremera, Miguel': array([10280], dtype=int64),\n", - " 'Evensen, April': array([10296], dtype=int64),\n", - " 'Exantus, Susan': array([10290], dtype=int64),\n", - " 'Faller, Megan ': array([10263], dtype=int64),\n", - " 'Fancett, Nicole': array([10136], dtype=int64),\n", - " 'Ferguson, Susan': array([10189], dtype=int64),\n", - " 'Fernandes, Nilson ': array([10308], dtype=int64),\n", - " 'Fett, Boba': array([10309], dtype=int64),\n", - " 'Fidelia, Libby': array([10049], dtype=int64),\n", - " 'Fitzpatrick, Michael J': array([10093], dtype=int64),\n", - " 'Foreman, Tanya': array([10163], dtype=int64),\n", - " 'Forrest, Alex': array([10305], dtype=int64),\n", - " 'Foss, Jason': array([10015], dtype=int64),\n", - " 'Foster-Baker, Amy': array([10080], dtype=int64),\n", - " 'Fraval, Maruk ': array([10258], dtype=int64),\n", - " 'Galia, Lisa': array([10273], dtype=int64),\n", - " 'Garcia, Raul': array([10111], dtype=int64),\n", - " 'Gaul, Barbara': array([10257], dtype=int64),\n", - " 'Gentry, Mildred': array([10159], dtype=int64),\n", - " 'Gerke, Melisa': array([10122], dtype=int64),\n", - " 'Gill, Whitney ': array([10142], dtype=int64),\n", - " 'Gilles, Alex': array([10283], dtype=int64),\n", - " 'Girifalco, Evelyn': array([10018], dtype=int64),\n", - " 'Givens, Myriam': array([10255], dtype=int64),\n", - " 'Goble, Taisha': array([10246], dtype=int64),\n", - " 'Goeth, Amon': array([10228], dtype=int64),\n", - " 'Gold, Shenice ': array([10243], dtype=int64),\n", - " 'Gonzalez, Cayo': array([10031], dtype=int64),\n", - " 'Gonzalez, Juan': array([10300], dtype=int64),\n", - " 'Gonzalez, Maria': array([10101], dtype=int64),\n", - " 'Good, Susan': array([10237], dtype=int64),\n", - " 'Gordon, David': array([10051], dtype=int64),\n", - " 'Gosciminski, Phylicia ': array([10218], dtype=int64),\n", - " 'Goyal, Roxana': array([10256], dtype=int64),\n", - " 'Gray, Elijiah ': array([10098], dtype=int64),\n", - " 'Gross, Paula': array([10059], dtype=int64),\n", - " 'Gruber, Hans': array([10234], dtype=int64),\n", - " 'Guilianno, Mike': array([10109], dtype=int64),\n", - " 'Handschiegl, Joanne': array([10125], dtype=int64),\n", - " 'Hankard, Earnest': array([10074], dtype=int64),\n", - " 'Harrington, Christie ': array([10097], dtype=int64),\n", - " 'Harrison, Kara': array([10007], dtype=int64),\n", - " 'Heitzman, Anthony': array([10129], dtype=int64),\n", - " 'Hendrickson, Trina': array([10075], dtype=int64),\n", - " 'Hitchcock, Alfred': array([10167], dtype=int64),\n", - " 'Homberger, Adrienne J': array([10195], dtype=int64),\n", - " 'Horton, Jayne': array([10112], dtype=int64),\n", - " 'Houlihan, Debra': array([10272], dtype=int64),\n", - " 'Howard, Estelle': array([10182], dtype=int64),\n", - " 'Hudson, Jane': array([10248], dtype=int64),\n", - " 'Hunts, Julissa': array([10201], dtype=int64),\n", - " 'Hutter, Rosalie': array([10214], dtype=int64),\n", - " 'Huynh, Ming': array([10160], dtype=int64),\n", - " 'Immediato, Walter': array([10289], dtype=int64),\n", - " 'Ivey, Rose ': array([10139], dtype=int64),\n", - " 'Jackson, Maryellen': array([10227], dtype=int64),\n", - " 'Jacobi, Hannah ': array([10236], dtype=int64),\n", - " 'Jeannite, Tayana': array([10009], dtype=int64),\n", - " 'Jhaveri, Sneha ': array([10060], dtype=int64),\n", - " 'Johnson, George': array([10034], dtype=int64),\n", - " 'Johnson, Noelle ': array([10156], dtype=int64),\n", - " 'Johnston, Yen': array([10036], dtype=int64),\n", - " 'Jung, Judy ': array([10138], dtype=int64),\n", - " 'Kampew, Donysha': array([10244], dtype=int64),\n", - " 'Keatts, Kramer ': array([10192], dtype=int64),\n", - " 'Khemmich, Bartholemew': array([10231], dtype=int64),\n", - " 'King, Janet': array([10089], dtype=int64),\n", - " 'Kinsella, Kathleen ': array([10166], dtype=int64),\n", - " 'Kirill, Alexandra ': array([10170], dtype=int64),\n", - " 'Knapp, Bradley J': array([10208], dtype=int64),\n", - " 'Kretschmer, John': array([10176], dtype=int64),\n", - " 'Kreuger, Freddy': array([10165], dtype=int64),\n", - " 'Lajiri, Jyoti': array([10113], dtype=int64),\n", - " 'Landa, Hans': array([10092], dtype=int64),\n", - " 'Langford, Lindsey': array([10106], dtype=int64),\n", - " 'Langton, Enrico': array([10052], dtype=int64),\n", - " 'LaRotonda, William ': array([10038], dtype=int64),\n", - " 'Latif, Mohammed': array([10249], dtype=int64),\n", - " 'Le, Binh': array([10232], dtype=int64),\n", - " 'Leach, Dallas': array([10087], dtype=int64),\n", - " 'LeBlanc, Brandon R': array([10134], dtype=int64),\n", - " 'Lecter, Hannibal': array([10251], dtype=int64),\n", - " 'Leruth, Giovanni': array([10103], dtype=int64),\n", - " 'Liebig, Ketsia': array([10017], dtype=int64),\n", - " 'Linares, Marilyn ': array([10186], dtype=int64),\n", - " 'Linden, Mathew': array([10137], dtype=int64),\n", - " 'Lindsay, Leonara ': array([10008], dtype=int64),\n", - " 'Lundy, Susan': array([10096], dtype=int64),\n", - " 'Lunquist, Lisa': array([10035], dtype=int64),\n", - " 'Lydon, Allison': array([10057], dtype=int64),\n", - " 'Lynch, Lindsay': array([10004], dtype=int64),\n", - " 'MacLennan, Samuel': array([10191], dtype=int64),\n", - " 'Mahoney, Lauren ': array([10219], dtype=int64),\n", - " 'Manchester, Robyn': array([10077], dtype=int64),\n", - " 'Mancuso, Karen': array([10073], dtype=int64),\n", - " 'Mangal, Debbie': array([10279], dtype=int64),\n", - " 'Martin, Sandra': array([10110], dtype=int64),\n", - " 'Maurice, Shana': array([10053], dtype=int64),\n", - " \"Carthy, B'rigit\": array([10076], dtype=int64),\n", - " 'Mckenna, Sandy': array([10145], dtype=int64),\n", - " 'McKinzie, Jac': array([10202], dtype=int64),\n", - " 'Meads, Elizabeth': array([10128], dtype=int64),\n", - " 'Medeiros, Jennifer': array([10068], dtype=int64),\n", - " 'Miller, Brannon': array([10116], dtype=int64),\n", - " 'Miller, Ned': array([10298], dtype=int64),\n", - " 'Monkfish, Erasumus': array([10213], dtype=int64),\n", - " 'Monroe, Peter': array([10288], dtype=int64),\n", - " 'Monterro, Luisa': array([10025], dtype=int64),\n", - " 'Moran, Patrick': array([10223], dtype=int64),\n", - " 'Morway, Tanya': array([10151], dtype=int64),\n", - " 'Motlagh, Dawn': array([10254], dtype=int64),\n", - " 'Moumanil, Maliki ': array([10120], dtype=int64),\n", - " 'Myers, Michael': array([10216], dtype=int64),\n", - " 'Navathe, Kurt': array([10079], dtype=int64),\n", - " 'Ndzi, Colombui': array([10215], dtype=int64),\n", - " 'Ndzi, Horia': array([10185], dtype=int64),\n", - " 'Newman, Richard ': array([10063], dtype=int64),\n", - " 'Ngodup, Shari ': array([10037], dtype=int64),\n", - " 'Nguyen, Dheepa': array([10042], dtype=int64),\n", - " 'Nguyen, Lei-Ming': array([10206], dtype=int64),\n", - " 'Nowlan, Kristie': array([10104], dtype=int64),\n", - " \"O'hare, Lynn\": array([10303], dtype=int64),\n", - " 'Oliver, Brooke ': array([10078], dtype=int64),\n", - " 'Onque, Jasmine': array([10121], dtype=int64),\n", - " 'Osturnka, Adeel': array([10021], dtype=int64),\n", - " 'Owad, Clinton': array([10281], dtype=int64),\n", - " 'Ozark, Travis': array([10041], dtype=int64),\n", - " 'Panjwani, Nina': array([10148], dtype=int64),\n", - " 'Patronick, Lucas': array([10005], dtype=int64),\n", - " 'Pearson, Randall': array([10259], dtype=int64),\n", - " 'Smith, Martin': array([10286], dtype=int64),\n", - " 'Pelletier, Ermine': array([10297], dtype=int64),\n", - " 'Perry, Shakira': array([10171], dtype=int64),\n", - " 'Peters, Lauren': array([10032], dtype=int64),\n", - " 'Peterson, Ebonee ': array([10130], dtype=int64),\n", - " 'Petingill, Shana ': array([10217], dtype=int64),\n", - " 'Petrowsky, Thelma': array([10016], dtype=int64),\n", - " 'Pham, Hong': array([10050], dtype=int64),\n", - " 'Pitt, Brad ': array([10164], dtype=int64),\n", - " 'Potts, Xana': array([10124], dtype=int64),\n", - " 'Power, Morissa': array([10187], dtype=int64),\n", - " 'Punjabhi, Louis ': array([10225], dtype=int64),\n", - " 'Purinton, Janine': array([10262], dtype=int64),\n", - " 'Quinn, Sean': array([10131], dtype=int64),\n", - " 'Rachael, Maggie': array([10239], dtype=int64),\n", - " 'Rarrick, Quinn': array([10152], dtype=int64),\n", - " 'Ren, Kylo': array([10140], dtype=int64),\n", - " 'Rhoads, Thomas': array([10058], dtype=int64),\n", - " 'Rivera, Haley ': array([10011], dtype=int64),\n", - " 'Roberson, May': array([10230], dtype=int64),\n", - " 'Robertson, Peter': array([10224], dtype=int64),\n", - " 'Robinson, Alain ': array([10047], dtype=int64),\n", - " 'Robinson, Cherly': array([10285], dtype=int64),\n", - " 'Robinson, Elias': array([10020], dtype=int64),\n", - " 'Roby, Lori ': array([10162], dtype=int64),\n", - " 'Roehrich, Bianca': array([10149], dtype=int64),\n", - " 'Roper, Katie': array([10086], dtype=int64),\n", - " 'Rose, Ashley ': array([10054], dtype=int64),\n", - " 'Rossetti, Bruno': array([10065], dtype=int64),\n", - " 'Roup,Simon': array([10198], dtype=int64),\n", - " 'Ruiz, Ricardo': array([10222], dtype=int64),\n", - " 'Saada, Adell': array([10126], dtype=int64),\n", - " 'Saar-Beckles, Melinda': array([10295], dtype=int64),\n", - " 'Sadki, Nore ': array([10260], dtype=int64),\n", - " 'Sahoo, Adil': array([10233], dtype=int64),\n", - " 'Salter, Jason': array([10229], dtype=int64),\n", - " 'Sander, Kamrin': array([10169], dtype=int64),\n", - " 'Sewkumar, Nori': array([10071], dtype=int64),\n", - " 'Shepard, Anita ': array([10179], dtype=int64),\n", - " 'Shields, Seffi': array([10091], dtype=int64),\n", - " 'Simard, Kramer': array([10178], dtype=int64),\n", - " 'Singh, Nan ': array([10039], dtype=int64),\n", - " 'Sloan, Constance': array([10095], dtype=int64),\n", - " 'Smith, Joe': array([10027], dtype=int64),\n", - " 'Smith, John': array([10291], dtype=int64),\n", - " 'Smith, Leigh Ann': array([10153], dtype=int64),\n", - " 'Smith, Sade': array([10157], dtype=int64),\n", - " 'Soto, Julia ': array([10119], dtype=int64),\n", - " 'Soze, Keyser': array([10180], dtype=int64),\n", - " 'Sparks, Taylor ': array([10302], dtype=int64),\n", - " 'Spirea, Kelley': array([10090], dtype=int64),\n", - " 'Squatrito, Kristen': array([10030], dtype=int64),\n", - " 'Stanford,Barbara M': array([10278], dtype=int64),\n", - " 'Stansfield, Norman': array([10307], dtype=int64),\n", - " 'Steans, Tyrone ': array([10147], dtype=int64),\n", - " 'Stoica, Rick': array([10266], dtype=int64),\n", - " 'Strong, Caitrin': array([10241], dtype=int64),\n", - " 'Sullivan, Kissy ': array([10158], dtype=int64),\n", - " 'Sullivan, Timothy': array([10117], dtype=int64),\n", - " 'Sutwell, Barbara': array([10209], dtype=int64),\n", - " 'Szabo, Andrew': array([10024], dtype=int64),\n", - " 'Tannen, Biff': array([10173], dtype=int64),\n", - " 'Tavares, Desiree ': array([10221], dtype=int64),\n", - " 'Tejeda, Lenora ': array([10146], dtype=int64),\n", - " 'Terry, Sharlene ': array([10161], dtype=int64),\n", - " 'Theamstern, Sophia': array([10141], dtype=int64),\n", - " 'Thibaud, Kenneth': array([10268], dtype=int64),\n", - " 'Tippett, Jeanette': array([10123], dtype=int64),\n", - " 'Torrence, Jack': array([10013], dtype=int64),\n", - " 'Trang, Mei': array([10287], dtype=int64),\n", - " 'Tredinnick, Neville ': array([10044], dtype=int64),\n", - " 'True, Edward': array([10102], dtype=int64),\n", - " 'Trzeciak, Cybil': array([10270], dtype=int64),\n", - " 'Turpin, Jumil': array([10045], dtype=int64),\n", - " 'Valentin,Jackie': array([10205], dtype=int64),\n", - " 'Veera, Abdellah ': array([10014], dtype=int64),\n", - " 'Vega, Vincent': array([10144], dtype=int64),\n", - " 'Villanueva, Noah': array([10253], dtype=int64),\n", - " 'Voldemort, Lord': array([10118], dtype=int64),\n", - " 'Volk, Colleen': array([10022], dtype=int64),\n", - " 'Von Massenbach, Anna': array([10183], dtype=int64),\n", - " 'Walker, Roger': array([10190], dtype=int64),\n", - " 'Wallace, Courtney E': array([10274], dtype=int64),\n", - " 'Wallace, Theresa': array([10293], dtype=int64),\n", - " 'Wang, Charlie': array([10172], dtype=int64),\n", - " 'Warfield, Sarah': array([10127], dtype=int64),\n", - " 'Whittier, Scott': array([10072], dtype=int64),\n", - " 'Wilber, Barry': array([10048], dtype=int64),\n", - " 'Wilkes, Annie': array([10204], dtype=int64),\n", - " 'Williams, Jacquelyn ': array([10264], dtype=int64),\n", - " 'Winthrop, Jordan ': array([10033], dtype=int64),\n", - " 'Wolk, Hang T': array([10174], dtype=int64),\n", - " 'Woodson, Jason': array([10135], dtype=int64),\n", - " 'Ybarra, Catherine ': array([10301], dtype=int64),\n", - " 'Zamora, Jennifer': array([10010], dtype=int64),\n", - " 'Zhou, Julia': array([10043], dtype=int64),\n", - " 'Zima, Colleen': array([10271], dtype=int64)}" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "column_mapping(df, 'Employee_Name', 'EmpID')" - ] - }, - { - "cell_type": "markdown", - "id": "766aed19", - "metadata": {}, - "source": [ - "Based on the results above, there are no apparent duplicate values of `Emp_ID`, as 1 `Employee_Name` represents 1 `Emp_ID`." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "3a876612", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Production Technician I': array([19], dtype=int64),\n", - " 'Sr. DBA': array([27], dtype=int64),\n", - " 'Production Technician II': array([20], dtype=int64),\n", - " 'Software Engineer': array([24, 23], dtype=int64),\n", - " 'IT Support': array([14], dtype=int64),\n", - " 'Data Analyst': array([9], dtype=int64),\n", - " 'Database Administrator': array([8], dtype=int64),\n", - " 'Enterprise Architect': array([30], dtype=int64),\n", - " 'Sr. Accountant': array([26], dtype=int64),\n", - " 'Production Manager': array([18, 17], dtype=int64),\n", - " 'Accountant I': array([1], dtype=int64),\n", - " 'Area Sales Manager': array([3], dtype=int64),\n", - " 'Software Engineering Manager': array([25], dtype=int64),\n", - " 'BI Director': array([5], dtype=int64),\n", - " 'Director of Operations': array([10], dtype=int64),\n", - " 'Sr. Network Engineer': array([28], dtype=int64),\n", - " 'Sales Manager': array([21], dtype=int64),\n", - " 'BI Developer': array([4], dtype=int64),\n", - " 'IT Manager - Support': array([13], dtype=int64),\n", - " 'Network Engineer': array([15], dtype=int64),\n", - " 'IT Director': array([12], dtype=int64),\n", - " 'Director of Sales': array([11], dtype=int64),\n", - " 'Administrative Assistant': array([2], dtype=int64),\n", - " 'President & CEO': array([16], dtype=int64),\n", - " 'Senior BI Developer': array([22], dtype=int64),\n", - " 'Shared Services Manager': array([23], dtype=int64),\n", - " 'IT Manager - Infra': array([13], dtype=int64),\n", - " 'Principal Data Architect': array([29], dtype=int64),\n", - " 'Data Architect': array([7], dtype=int64),\n", - " 'IT Manager - DB': array([13], dtype=int64),\n", - " 'Data Analyst ': array([9], dtype=int64),\n", - " 'CIO': array([6], dtype=int64)}" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Position Info in dict\n", - "column_mapping(df, 'Position', 'PositionID')" - ] - }, - { - "cell_type": "markdown", - "id": "aa5eb56f", - "metadata": {}, - "source": [ - "It appears that some Positions has multiple IDs, such as `Production Manager` and `Software Engineer`. Knowing this, I will go forward to encode the column `Position` into numerical values, and in turn deleting the column `PositionID`." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "8aa4eef5", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop PositionID Column\n", - "df = df.drop(['PositionID'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3fc78b08", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Michael Albert': array([22., 30.]),\n", - " 'Simon Roup': array([4.]),\n", - " 'Kissy Sullivan': array([20.]),\n", - " 'Elijiah Gray': array([16.]),\n", - " 'Webster Butler': array([39.]),\n", - " 'Amy Dunn': array([11.]),\n", - " 'Alex Sweetwater': array([10.]),\n", - " 'Ketsia Liebig': array([19.]),\n", - " 'Brannon Miller': array([12.]),\n", - " 'Peter Monroe': array([7.]),\n", - " 'David Stanley': array([14.]),\n", - " 'Kelley Spirea': array([18.]),\n", - " 'Brandon R. LeBlanc': array([3., 1.]),\n", - " 'Janet King': array([2.]),\n", - " 'John Smith': array([17.]),\n", - " 'Jennifer Zamora': array([5.]),\n", - " 'Lynn Daneault': array([21.]),\n", - " 'Eric Dougall': array([6.]),\n", - " 'Debra Houlihan': array([15.]),\n", - " 'Brian Champaigne': array([13.]),\n", - " 'Board of Directors': array([9.])}" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Manager Info in dict\n", - "column_mapping(df, 'ManagerName', 'ManagerID')" - ] - }, - { - "cell_type": "markdown", - "id": "c4a390e4", - "metadata": {}, - "source": [ - "Again, there are some managers that has multiple IDs. Assuming this is a **data quality issue** instead of a **hierarchical structure problem** where a manager may have different roles or responsibilities with distinct IDs, we will continue to keep `ManagerName` and delete `ManagerID`." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "41613c39", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop ManagerID column\n", - "df = df.drop(['ManagerID'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "1c3d6edf", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'M ': array([1], dtype=int64), 'F': array([0], dtype=int64)}" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Gender Info in dict\n", - "column_mapping(df, 'Sex', 'GenderID')" - ] - }, - { - "cell_type": "markdown", - "id": "05d5aa6e", - "metadata": {}, - "source": [ - "The gender information are portrayed perfectly, and so to avoid further encoding, column `Sex` will be deleted." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "a95ce45b", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop Sex column\n", - "df = df.drop(['Sex'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "57b07e7f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Single': array([0], dtype=int64),\n", - " 'Married': array([1], dtype=int64),\n", - " 'Divorced': array([2], dtype=int64),\n", - " 'Widowed': array([4], dtype=int64),\n", - " 'Separated': array([3], dtype=int64)}" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Marital Info in dict\n", - "column_mapping(df, 'MaritalDesc', 'MaritalStatusID')" - ] - }, - { - "cell_type": "markdown", - "id": "acfa90f7", - "metadata": {}, - "source": [ - "A balanced result. Again, to avoid further encoding, column `MaritalDesc` will be deleted." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "b2ba844b", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop MaritalDesc Column\n", - "df = df.drop(['MaritalDesc'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "deabc5b2", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Active': array([1, 3, 2], dtype=int64),\n", - " 'Voluntarily Terminated': array([5], dtype=int64),\n", - " 'Terminated for Cause': array([4, 1], dtype=int64)}" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Employee Status Info in dict\n", - "column_mapping(df, 'EmploymentStatus', 'EmpStatusID')" - ] - }, - { - "cell_type": "markdown", - "id": "f90a4603", - "metadata": {}, - "source": [ - "It is very visible there's misrepresentation of the column `EmpStatusID` and `EmploymentStatus`. For this, we will delete the column `EmpStatusID`." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "dd09d9ec", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop EmpStatusID column\n", - "df = df.drop(['EmpStatusID'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "67689005", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Exceeds': array([4], dtype=int64),\n", - " 'Fully Meets': array([3, 1], dtype=int64),\n", - " 'Needs Improvement': array([2], dtype=int64),\n", - " 'PIP': array([1, 3], dtype=int64)}" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Performance Score Info in dict\n", - "column_mapping(df, 'PerformanceScore', 'PerfScoreID')" - ] - }, - { - "cell_type": "markdown", - "id": "ac103dd5", - "metadata": {}, - "source": [ - "Again, another misrepresentation. We will remove the column `PerfScoreID`." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "a275b331", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop PerfScoreID column\n", - "df = df.drop(['PerfScoreID'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "4d1e7172", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Production ': array([5, 6], dtype=int64),\n", - " 'IT/IS': array([3], dtype=int64),\n", - " 'Software Engineering': array([4, 1], dtype=int64),\n", - " 'Admin Offices': array([1], dtype=int64),\n", - " 'Sales': array([6], dtype=int64),\n", - " 'Executive Office': array([2], dtype=int64)}" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Department Info in dict\n", - "column_mapping(df, 'Department', 'DeptID')" - ] - }, - { - "cell_type": "markdown", - "id": "afa2a74a", - "metadata": {}, - "source": [ - "Just as before, we will continue by deleting the column `DeptID`. Though there is a improperly formatted value (Production), we can ignore that as it will be encoded later." - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "680be881", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop DeptID column\n", - "df = df.drop(['DeptID'], axis=1)" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "2edf7984", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 311 entries, 0 to 310\n", - "Data columns (total 28 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Employee_Name 311 non-null object \n", - " 1 EmpID 311 non-null int64 \n", - " 2 MarriedID 311 non-null int64 \n", - " 3 MaritalStatusID 311 non-null int64 \n", - " 4 GenderID 311 non-null int64 \n", - " 5 FromDiversityJobFairID 311 non-null int64 \n", - " 6 Salary 311 non-null int64 \n", - " 7 Termd 311 non-null int64 \n", - " 8 Position 311 non-null object \n", - " 9 State 311 non-null object \n", - " 10 Zip 311 non-null int64 \n", - " 11 DOB 311 non-null object \n", - " 12 CitizenDesc 311 non-null object \n", - " 13 HispanicLatino 311 non-null object \n", - " 14 RaceDesc 311 non-null object \n", - " 15 DateofHire 311 non-null object \n", - " 16 TermReason 311 non-null object \n", - " 17 EmploymentStatus 311 non-null object \n", - " 18 Department 311 non-null object \n", - " 19 ManagerName 311 non-null object \n", - " 20 RecruitmentSource 311 non-null object \n", - " 21 PerformanceScore 311 non-null object \n", - " 22 EngagementSurvey 311 non-null float64\n", - " 23 EmpSatisfaction 311 non-null int64 \n", - " 24 SpecialProjectsCount 311 non-null int64 \n", - " 25 LastPerformanceReview_Date 311 non-null object \n", - " 26 DaysLateLast30 311 non-null int64 \n", - " 27 Absences 311 non-null int64 \n", - "dtypes: float64(1), int64(12), object(15)\n", - "memory usage: 68.2+ KB\n" - ] - } - ], - "source": [ - "# Now to see the remaining columns\n", - "df.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "bdd0c9d8", - "metadata": {}, - "outputs": [], - "source": [ - "# Additional column pruning\n", - "df = df.drop(['LastPerformanceReview_Date', 'Zip', 'HispanicLatino', 'MarriedID'], axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "f44537ef", - "metadata": {}, - "source": [ - "# Feature Engineering Part III - Reformatting \n", - "Firstly, we will handle the date columns that are listed as object. Then once it's done, we can start to encode categorical columns using `LabelEncoder`.\n", - "\n", - "[Back To Top](#top)" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "id": "fb3e1d3b", - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.preprocessing import LabelEncoder" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "6d5592f1", - "metadata": {}, - "outputs": [], - "source": [ - "# Change data type into datetime64\n", - "df['DOB'] = pd.to_datetime(df['DOB'])\n", - "df['DateofHire'] = pd.to_datetime(df['DateofHire'])" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "14a6af68", - "metadata": {}, - "outputs": [], - "source": [ - "# Extract Year, Month, and Day information for 'DOB'\n", - "df['DOB_Year'] = df['DOB'].dt.year\n", - "df['DOB_Month'] = df['DOB'].dt.month\n", - "df['DOB_Day'] = df['DOB'].dt.day\n", - "\n", - "# Extract Year, Month, and Day information for 'DateofHire'\n", - "df['DateofHire_Year'] = df['DateofHire'].dt.year\n", - "df['DateofHire_Month'] = df['DateofHire'].dt.month\n", - "df['DateofHire_Day'] = df['DateofHire'].dt.day" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "fbf38f59", - "metadata": {}, - "outputs": [], - "source": [ - "# Drop the original columns\n", - "df = df.drop(['DOB', 'DateofHire'], axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "5b8e98e7", - "metadata": {}, - "source": [ - "## Encoding" - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "id": "445c535d", - "metadata": {}, - "outputs": [], - "source": [ - "# Encode Columns Function (or singular)\n", - "# This code starts the encoding from 1 not 0, and changes the result into int64 format.\n", - "# This code puts the results in a dictionary, for later to be used or mapped.\n", - "label_encoders = {}\n", - "\n", - "def encode_columns(df, column_names):\n", - " global label_encoders\n", - "\n", - " for column_name in column_names:\n", - " # Check if the column exists in the DataFrame\n", - " if column_name not in df.columns:\n", - " print(f\"Column '{column_name}' not found in the DataFrame.\")\n", - " continue\n", - "\n", - " # Initialize a label encoder for the column\n", - " label_encoder = LabelEncoder()\n", - "\n", - " # Fit and transform the column with label encoding\n", - " encoded_column = label_encoder.fit_transform(df[column_name]) + 1 # so that it starts from 1\n", - "\n", - " # Convert the new column to int64\n", - " df[column_name + '_E'] = encoded_column.astype('int64')\n", - "\n", - " # Store the label encoder for later use\n", - " mapping_dict = dict(zip(encoded_column, df[column_name]))\n", - " sorted_mapping_dict = dict(sorted(mapping_dict.items())) # Sort the dictionary by keys\n", - " label_encoders[column_name] = sorted_mapping_dict\n", - "\n", - " # Drop the original column\n", - " df = df.drop([column_name], axis=1)\n", - "\n", - " return df" - ] - }, - { - "cell_type": "code", - "execution_count": 98, - "id": "ae3ee6f6", - "metadata": {}, - "outputs": [], - "source": [ - "# Run the function for the selected columns\n", - "encoded_df = encode_columns(df, ['State', 'Position', 'CitizenDesc', 'RaceDesc',\n", - " 'TermReason', 'EmploymentStatus', 'Department',\n", - " 'ManagerName', 'RecruitmentSource', 'PerformanceScore'])" - ] - }, - { - "cell_type": "code", - "execution_count": 86, - "id": "d4f01607", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Employee_NameEmpIDMaritalStatusIDGenderIDFromDiversityJobFairIDSalaryTermdEngagementSurveyEmpSatisfactionSpecialProjectsCount...State_EPosition_ECitizenDesc_ERaceDesc_ETermReason_EEmploymentStatus_EDepartment_EManagerName_ERecruitmentSource_EPerformanceScore_E
0Adinolfi, Wilson K10026010625060450...1123364141861
1Ait Sidi, Karthikeyan100841101044371436...1131366332052
2Akinkuolie, Sarah10196100649551330...1124368341662
\n", - "

3 rows × 28 columns

\n", - "
" - ], - "text/plain": [ - " Employee_Name EmpID MaritalStatusID GenderID \\\n", - "0 Adinolfi, Wilson K 10026 0 1 \n", - "1 Ait Sidi, Karthikeyan 10084 1 1 \n", - "2 Akinkuolie, Sarah 10196 1 0 \n", - "\n", - " FromDiversityJobFairID Salary Termd EngagementSurvey EmpSatisfaction \\\n", - "0 0 62506 0 4 5 \n", - "1 0 104437 1 4 3 \n", - "2 0 64955 1 3 3 \n", - "\n", - " SpecialProjectsCount ... State_E Position_E CitizenDesc_E RaceDesc_E \\\n", - "0 0 ... 11 23 3 6 \n", - "1 6 ... 11 31 3 6 \n", - "2 0 ... 11 24 3 6 \n", - "\n", - " TermReason_E EmploymentStatus_E Department_E ManagerName_E \\\n", - "0 4 1 4 18 \n", - "1 6 3 3 20 \n", - "2 8 3 4 16 \n", - "\n", - " RecruitmentSource_E PerformanceScore_E \n", - "0 6 1 \n", - "1 5 2 \n", - "2 6 2 \n", - "\n", - "[3 rows x 28 columns]" - ] - }, - "execution_count": 86, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "encoded_df.head(3)" - ] - }, - { - "cell_type": "code", - "execution_count": 87, - "id": "808f2723", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 311 entries, 0 to 310\n", - "Data columns (total 28 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Employee_Name 311 non-null object\n", - " 1 EmpID 311 non-null int64 \n", - " 2 MaritalStatusID 311 non-null int64 \n", - " 3 GenderID 311 non-null int64 \n", - " 4 FromDiversityJobFairID 311 non-null int64 \n", - " 5 Salary 311 non-null int64 \n", - " 6 Termd 311 non-null int64 \n", - " 7 EngagementSurvey 311 non-null int64 \n", - " 8 EmpSatisfaction 311 non-null int64 \n", - " 9 SpecialProjectsCount 311 non-null int64 \n", - " 10 DaysLateLast30 311 non-null int64 \n", - " 11 Absences 311 non-null int64 \n", - " 12 DOB_Year 311 non-null int64 \n", - " 13 DOB_Month 311 non-null int64 \n", - " 14 DOB_Day 311 non-null int64 \n", - " 15 DateofHire_Year 311 non-null int64 \n", - " 16 DateofHire_Month 311 non-null int64 \n", - " 17 DateofHire_Day 311 non-null int64 \n", - " 18 State_E 311 non-null int64 \n", - " 19 Position_E 311 non-null int64 \n", - " 20 CitizenDesc_E 311 non-null int64 \n", - " 21 RaceDesc_E 311 non-null int64 \n", - " 22 TermReason_E 311 non-null int64 \n", - " 23 EmploymentStatus_E 311 non-null int64 \n", - " 24 Department_E 311 non-null int64 \n", - " 25 ManagerName_E 311 non-null int64 \n", - " 26 RecruitmentSource_E 311 non-null int64 \n", - " 27 PerformanceScore_E 311 non-null int64 \n", - "dtypes: int64(27), object(1)\n", - "memory usage: 68.2+ KB\n" - ] - } - ], - "source": [ - "encoded_df.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "id": "22f83be5", - "metadata": {}, - "outputs": [], - "source": [ - "# Optional, Saving the df\n", - "# encoded_df.to_csv('HRDataset_v14_Formatted.csv', index=False)" - ] - }, - { - "cell_type": "markdown", - "id": "7fc06848", - "metadata": {}, - "source": [ - "And that concludes our EDA! I am keeping the `Employee_Name` column for later use. The end result has no NaN values, and data types are all `int64` except `Employee_Name`\n", - "\n", - "Right now, there are 3 dataframes that we can go forward with:\n", - "- `data_original` as the original dataframe.\n", - "- `df` as the dataframe that still contains the textual values, but its columns has already been pruned.\n", - "- `encoded_df` is pretty much the same as `df`, but the values are already encoded, to be put into ML Models." - ] - }, - { - "cell_type": "markdown", - "id": "1c86088b", - "metadata": {}, - "source": [ - "# Answering The Questions \n", - "After going through the EDA, now we will start to explore the questions, and if able to, provide reasoning.\n", - "\n", - "[Back To Top](#top)" - ] - }, - { - "cell_type": "markdown", - "id": "412d0e1d", - "metadata": {}, - "source": [ - "## Is there any relationship between who a person works for and their performance score? \n", - "For this question, we will need mainly three columns from `encoded_df`:\n", - "- `ManagerName_E`\n", - "- `PerformanceScore_E`\n", - "- `EmpID`\n", - "\n", - "By having a dictionary previously made called `label_encoders`, we can use the **keys** contained inside it to show the textual values before the data were encoded.\n", - "\n", - "[Back To Top](#top)" - ] - }, - { - "cell_type": "code", - "execution_count": 104, - "id": "31e81f09", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "ManagerName_E\n", - "Debra Houlihan 2.333333\n", - "John Smith 2.285714\n", - "Lynn Daneault 2.153846\n", - "Peter Monroe 2.142857\n", - "Michael Albert 2.136364\n", - "Amy Dunn 2.095238\n", - "Brannon Miller 2.090909\n", - "Kissy Sullivan 2.045455\n", - "Webster Butler 2.000000\n", - "Board of Directors 2.000000\n", - "Brandon R. LeBlanc 2.000000\n", - "Brian Champaigne 2.000000\n", - "David Stanley 2.000000\n", - "Elijiah Gray 2.000000\n", - "Ketsia Liebig 1.952381\n", - "Kelley Spirea 1.909091\n", - "Janet King 1.894737\n", - "Alex Sweetwater 1.888889\n", - "Simon Roup 1.882353\n", - "Jennifer Zamora 1.857143\n", - "Eric Dougall 1.750000\n", - "Name: PerformanceScore_E, dtype: float64" - ] - }, - "execution_count": 104, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# The main code\n", - "# encoded_df.groupby('ManagerName_E')['PerformanceScore_E'].mean().sort_values(ascending=False)\n", - "\n", - "# We add information from previously made Dicts to make it a better contextual result\n", - "encoded_df.groupby('ManagerName_E')['PerformanceScore_E'].mean().sort_values(ascending=False).rename(index=label_encoders['ManagerName'])" - ] - }, - { - "cell_type": "code", - "execution_count": 105, - "id": "b456a397", - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "code", - "execution_count": 111, - "id": "65e9e8de", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAENCAYAAADt3gm6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAdYUlEQVR4nO3debgcZZn+8e9NEghbEiRBtoSwRJiwBDAkbA6bIkEWUdQwAhIdIsoqosjIMgPoAM7IICAQhlV2BJkEEAYYGEVBCAiBsAz8cCGyo4Yfi2DgmT/e90wqTZ+cqj6n+qST+3NdfZ3anqrndHfVU/VWdZUiAjMzW7It1d8JmJlZ/3MxMDMzFwMzM3MxMDMzXAzMzAwY2N8JtGL48OExevTo/k7DzKyjPPDAA69ExIhm4zqyGIwePZqZM2f2dxpmZh1F0u+6G+dmIjMzczEwMzMXAzMzw8XAzMxwMTAzM1wMzMwMFwMzM8PFwMzMcDEwMzM69BfI1vnOvu83lWMOnrB2DZmYGfjIwMzMcDEwMzNcDMzMDJ8zMOtIPudifc1HBmZm5mJgZmYuBmZmhouBmZnhYmBmZrgYmJkZLgZmZoaLgZmZ4WJgZma4GJiZGS4GZmaG701kZh3I92bqey4GZrbEqVpMGgtJb+MXRW4mMjMzFwMzM3MxMDMzlvBzBp3c7ucTaGbWl3xkYGZmLgZmZuZiYGZmLOHnDMysf/ic16Kn1iMDSSMl3SnpcUmzJR3eZBpJ+oGkpyXNkrR5nTmZmdn71X1kMA/4ekQ8KGlF4AFJt0XEY4VpJgFj8msicE7+a2ZmbVLrkUFEPB8RD+bu/w88DqzRMNmewKWR3AsMk7RanXmZmdmC2nbOQNJoYDPgVw2j1gCeLfTPycOeb4ifCkwFGDVqVG15VtHJ9zfpbZttf7f59vfyzfpTHduOtlxNJGkF4DrgiIh4rXF0k5B434CIaRExPiLGjxgxoo40zcyWWLUXA0mDSIXg8oi4vskkc4CRhf41gefqzsvMzOar+2oiARcAj0fE97uZbDqwf76qaEtgbkQ83820ZmZWg7rPGWwD7Ac8IumhPOwfgFEAEXEucDOwK/A08CYwpeaczMysQa3FICLupvk5geI0ARxcZx5m1rd8An/x49tRmJmZi4GZmfneRGZLJDfzWCMfGZiZWWcfGXTyk8qss3nP2hY3HV0MzMw60aK4I+tmIjMz85GBWX9wM5MtanxkYGZmLgZmZuZiYGZmuBiYmRkuBmZmhouBmZnhYmBmZpQoBpK+Wej+TMO479aRlJmZtVeZH51NBk7L3ccA1xbG7UJ6cplZR/GPvswWVKaZSN10N+s3M7MOVKYYRDfdzfrNzKwDlWkmGifpNdJRwLK5m9w/uLbMzMysbXosBhExoMyMJK0UEX/qfUpmZtZufXlp6R19OC8zM2ujviwGPplsZtah+rIY+GSymVmH8i+QzczMzURmZlaxGEjaVtKU3D1CUvEnmTv1aWZmZtY2pYuBpBOAo0m3pAAYBFzWNT4i/ti3qZmZWbtUOTLYC9gDeAMgIp4DVqwjKTMza68qxeCdiAjyVUOSlq8nJTMza7cqxeAaSecBwyQdCNwOnF9PWmZm1k5l7k2EJAFXAxsArwHrA8dHxG015mZmZm1SqhhEREi6ISI+DLgAmJktZqo0E90raYvaMjEzs35T6sgg2wE4SNJvSVcUiXTQsEkdiZmZWftUKQaTasvCzMz6Velmooj4HTAM2D2/huVh3ZJ0oaSXJD3azfjtJc2V9FB+HV8hdzMz6yNVfoF8OHA5sEp+XSbp0B7CLgZ26WGan0fEpvl1Ytl8zMys71RpJvoSMDEi3gCQdCpwD3BmdwER8TNJo3uVoZmZ1a7K1UQC3i30v0vf3Kl0K0kPS/qppA27Xbg0VdJMSTNffvnlPlismZl1qXJkcBHwK0k/yf2fBC7o5fIfBNaKiNcl7QrcAIxpNmFETAOmAYwfP94P0jEz60NVTiB/H5gC/BH4EzAlIv6tNwuPiNci4vXcfTMwSNLw3szTzMyqK31kIGlLYHZEPJj7V5Q0MSJ+1erCJa0KvJh/4TyBVJxebXV+ZmbWmirNROcAmxf632gybAGSrgS2B4ZLmgOcQHoOAhFxLrA38BVJ84C3gMn5zqhmZtZGVYqBihvqiHhP0kLjI2KfHsafBZxVIQczM6tBlauJnpF0mKRB+XU48ExdiZmZWftUKQYHAVsDf8ivicDUOpIyM7P2Kt1MFBEvAZNrzMXMzPpJj0cGkg6UNCZ3K99vaK6kWZK6PXlsZmado0wz0eHAb3P3PsA4YB3gSOCMetIyM7N2KlMM5kXEX3P3bsClEfFqRNwOLF9famZm1i5lisF7klaTNBjYCbi9MG7ZetIyM7N2KnMC+XhgJjAAmB4RswEkbYcvLTUzWyz0WAwi4kZJawErRsSfCqNmAp+rLTMzM2ubUr8ziIh5wNuSjpN0fh68OulWE2Zm1uGq/OjsIuBtYKvcPwc4uc8zMjOztqtSDNaNiNOAvwJExFv0zcNtzMysn1UpBu9IWhYIAEnrko4UzMysw1W5a+kJwC3ASEmXA9sAB9SRlJmZtVeVexPdJulBYEtS89DhEfFKbZmZmVnblG4mkrQX6dfIN0XEjcA8SZ+sLTMzM2ubKucMToiIuV09EfFnUtORmZl1uCrFoNm0Vc45mJnZIqpKMZgp6fuS1pW0jqTTgQfqSszMzNqnSjE4FHgHuBq4FvgLcHAdSZmZWXtVuZroDeBbNeZiZmb9pHQxkPQh4ChgdDEuInbs+7TMzKydqpwAvhY4F/h34N160jEzs/5QpRjMi4hzasvEzMz6TZUTyDMkfTU/9ewDXa/aMjMzs7apcmTwhfz3G4VhAazTd+mYmVl/qHI10dp1JmJmZv2n0i+IJW0EjAUGdw2LiEv7OikzM2uvKpeWnkB6zOVY4GZgEnA34GJgZtbhqpxA3hvYCXghIqYA44BlasnKzMzaqkoxeCsi3iPdunoI8BI+eWxmtliocs5gpqRhwPmkG9S9DtxXR1JmZtZeVa4m+mruPFfSLcCQiJhVT1pmZtZOVa8m2oTCvYkkrRcR19eQl5mZtVGVq4kuBDYBZgPv5cEBuBiYmXW4KkcGW0bE2CozzwVkN+CliNioyXgBZwC7Am8CB0TEg1WWYWZmvVflaqJ7JFUqBsDFwC4LGT8JGJNfUwHfCM/MrB9UOTK4hFQQXgDeBgRERGzSXUBE/EzS6IXMc0/g0ogI4F5JwyStFhHPV8jLzMx6qUoxuBDYD3iE+ecMemsN4NlC/5w87H3FQNJU0tEDo0aN6qPFm5kZVCsGv4+I6X28fDUZFs0mjIhpwDSA8ePHN53GzMxaU6UYPCHpCmAGqZkIgF5eWjoHGFnoXxN4rhfzMzOzFlQpBsuSisDOhWG9vbR0OnCIpKuAicBcny8wM2u/UsVA0gDglYj4Ro8TLxh3JelOp8MlzQFOAAYBRMS5pLuf7go8Tbq0dEqV+ZuZWd8oVQwi4l1Jm1edeUTs08P4AA6uOl8zM+tbVZqJHpI0HbgWeKNroG9HYWbW+aoUgw8ArwI7Fob5dhRmZouBKnctdXu+mdliqvTtKCStKeknkl6S9KKk6yStWWdyZmbWHlXuTXQR6VLQ1Um/Ep6Rh5mZWYerUgxGRMRFETEvvy4GRtSUl5mZtVGVYvCKpH0lDcivfUknlM3MrMNVKQZfBD4LvEC6kdzeeZiZmXW4Hq8mknRqRBwNTIyIPdqQk5mZtVmZI4NdJQ0Cjqk7GTMz6x9lfmdwC/AKsLyk18gPtWH+w22G1JifmZm1QY9HBhHxjYgYCtwUEUMiYsXi3zbkaGZmNSt1AjnftXT5mnMxM7N+UqoYRMS7wJuShtacj5mZ9YMqN6r7C/CIpNtY8K6lh/V5VmZm1lZVisFN+WVmZouZKnctvUTSssCoiHiyxpzMzKzNqty1dHfgIdKlpkjaND/sxszMOlyV21H8IzAB+DNARDwErN3nGZmZWdtVKQbzImJuw7Doy2TMzKx/VDmB/KikvwMGSBoDHAb8sp60zMysnaocGRwKbAi8DVwBzAWOqCEnMzNrszJ3LR0MHASsBzwCbBUR8+pOzMzM2qfMkcElwHhSIZgE/EutGZmZWduVOWcwNiI2BpB0AXBfvSmZmVm7lTky+GtXh5uHzMwWT2WODMbl5xhAeobBssXnGvg21mZmna/HYhARA9qRiJmZ9Z8ql5aamdliysXAzMxcDMzMzMXAzMxwMTAzM1wMzMwMFwMzM6MNxUDSLpKelPS0pG81Gb+9pLmSHsqv4+vOyczMFlTleQaVSRoAnA18DJgD3C9pekQ81jDpzyNitzpzMTOz7tV9ZDABeDoinomId4CrgD1rXqaZmVVUdzFYA3i20D8nD2u0laSHJf1U0obNZiRpqqSZkma+/PLLdeRqZrbEqrsYqMmwxucmPwisFRHjgDOBG5rNKCKmRcT4iBg/YsSIvs3SzGwJV3cxmAOMLPSvCTxXnCAiXouI13P3zcAgScNrzsvMzArqLgb3A2MkrS1paWAyML04gaRVJSl3T8g5vVpzXmZmVlDr1UQRMU/SIcCtwADgwoiYLemgPP5cYG/gK5LmAW8BkyOisSnJzMxqVGsxgP9r+rm5Ydi5he6zgLPqzsPMzLrnXyCbmZmLgZmZuRiYmRkuBmZmhouBmZnhYmBmZrgYmJkZLgZmZoaLgZmZ4WJgZma4GJiZGS4GZmaGi4GZmeFiYGZmuBiYmRkuBmZmhouBmZnhYmBmZrgYmJkZLgZmZoaLgZmZ4WJgZma4GJiZGS4GZmaGi4GZmeFiYGZmuBiYmRkuBmZmhouBmZnhYmBmZrgYmJkZLgZmZoaLgZmZ4WJgZma4GJiZGS4GZmZGG4qBpF0kPSnpaUnfajJekn6Qx8+StHndOZmZ2YJqLQaSBgBnA5OAscA+ksY2TDYJGJNfU4Fz6szJzMzer+4jgwnA0xHxTES8A1wF7NkwzZ7ApZHcCwyTtFrNeZmZWYEior6ZS3sDu0TE3+f+/YCJEXFIYZobgVMi4u7cfwdwdETMbJjXVNKRA8D6wJMLWfRw4JVepO54x/dXfCfn7vhFP36tiBjRbMTAXiy0DDUZ1lh9ykxDREwDppVaqDQzIsaXmdbxjl+U4js5d8d3dnzdzURzgJGF/jWB51qYxszMalR3MbgfGCNpbUlLA5OB6Q3TTAf2z1cVbQnMjYjna87LzMwKam0mioh5kg4BbgUGABdGxGxJB+Xx5wI3A7sCTwNvAlP6YNGlmpMc7/hFML6Tc3d8B8fXegLZzMw6g3+BbGZmLgZmZuZisMiR1OxS23Ysd/lexq/aX7nb4qW33yN/D1uzWBWDfPuLVmPXkzRe0jItxm8oaTtJK7cQu23+QR4REa18mSXtLunwqnE5dk/gVEmrtBj/ceAnLHiJcJX4LSXtl/8u3UL8mPzZDejNd6Bhnv26QenEDaKkZXsZvyqkdaDF+DG9iW+YV1vff0kjJS3dtVMmqdK2uU8+74jo+BfwoUL3gBbidwNmAXcCVxbnVzJ+Uo6/AbgJWLVk3FLACsBs4DHgoOK4CsvfGXgI+FgL//t2wBOtxDYs+7fAGS3E75Hfu0uAHwNjKsZ/EngYuA44A/gqsHwLeUzM78UWhWGqED+klfevEL85sC0wocX4rYBdevE5TgL260X+Hwe+AQzuxfKvAdZrMf5jwMvAF1uM3xE4EDiwxfgJwDbA+KrfH+ATwKPAefk9WD8PL7UNyPFHAiv06jvYm+BF4ZU35G8CVxSGlS4IwNZ5Y7hZ7v8h6RLYsvHbA//TtRKT9pA/WvF/+CbwdeBS4GsVY7cGXiwsfyiwFrBcyfgjgaNy9+p5pZoIDC0R+1HSJcEbAoOA/wT+tkLuK5MuO94o918IfAZYpcxGJcf/FBib+79I+m3LscCKFfKYBDxFuizvBuCCwrgeV2jgU6SCNLHsCtzkO/zr/PlfA3y5YvyuefmnkXZG9qiY/2DS733eAvZsIf9JefnbNxlXZvkTgd8DOzYZ1+P7SSqCD+X37x/KLrch/0eBo4C7gH0q5v+J/P9/N+dwXpl40t0XRgKP5O3IB/N24DlgwzL/P7AF8EZeD6fSi4LQUtCi8gKWB27Jb8LFwGWFcaUKAmljekChf0TeICxTMv5vgB1y96r5g7yBVOX3LvllOhL4N2An0pHJ94F/zl+Wnr4M65N+xb0naeN4J+m3G9eUWT5wGPOLwS/z8n8EXAas1EPsx4Gtc/cw4EzgK11f9BL/91DgZznPIcAzwAzgCuBketjDz/E/L25ESEcXZxRX6B7mMYB0A8X9cv8Q4G7gx4VpFrZCj87T35bnM77M/16I34x0ZDQu938GOL1C/ObATGCr3H8y6WhrlTL5F6Y5MOf/G+ALeViZDfHYHDM196+cv5Mbl10+sC/wndy9Omnjun9hfLd5kDaivwY+TFp3X6DC0RFpG3Ir8IncfwiwDyX38IHlSDskO+X+UcBLlNyhzN+/acAaXcshrZN/oEQLBbAD6eh8c9K6fzCFglDmM/y/actOuKi+8pdnBdINmn5MoSBU+DCGFLrXzF+uEXnYyhXm9W3g2Nw9Bbi6az49xK0LfCt3f510pHN2heWOI21I5+SVeinSXvKVwAd6iN2IdNO/q4Apedg6wLnAx0suf6n8d5e8Mm5cIfe9gQeAe4Hj8rAdScV9XIn4g0jFaz/gO6Qi9uWyK2Oex9E0NJGQisx5JWJHAdvl7uNJe9jjgYEN0zXdoJB2RorNg+sB95H2GMtsxCcAW+buD5B2Rmbk9+TMEvGD8t89SYXow6SjpFNJRXWhO1V5+h8Cf58//9vz9/62MsvP89iedKv7kcCDwCmkAndVidhdSTe/7Oo/hHSEObTkspfP37VPAJuSmjuvJu0YXVcy/hry0W0e9j1S0++/LiRuPdJe/cp5ed9sGP/NnNfgZt+DHL8JaYdo5TxsIvBf+T1YIQ9btvR6UHbCTnjlN/Y6ckEgVcsNKsQPJBWWO3L/50nPVyj9hjbM72Zg8xLTrQ5cRNqQP5U3KjOo0FxA2kM7uGHYLcCmJWJ3J+3dnVgYdj6wbwv/84nAMZQ4qinErJRXoN0Kw66j0NyxkNih+XO6iMIeNXAjC2nHZ8HzTPuSmglGFYZ17VyMLRE/tNB9XP7stsj9TQtjQ3zXjscA0p7mDObvoDQ9h9IQP4C0A3Aw8/fq1yTtKW7fU3zuXxu4MncfBbzDQnZIGpa/DXA68P9Ixbmr+eN24CMl4seRiui3gSMLw+8BDusmfv2G/q4dkgl5XmsVh/ew/COAa0lF+LTC8Pvo5gizIf4fSTtinyG1CJxF2qE6HxjWJLbrHOV/52n3IBWhYwrTjKabnZFC/F3A5Sx4FLYlqSBMJhWFH9GwY9LtZ1pmok565ZX4ItJ5gKeANVuYx8WkZpoHuluZm8Soof/TOb7syeQTSe2mu+f+HYCRvXgfupb/wRLTDgT2Jx1dfCm/ZgLrtrjcu6l4Ip/UbnsR6ZB3D9Ie4ugK8UsVuvcn7dk1bWZi/nmmqwrDTgKeZcGCcBWFvc4m8VcWhi1d6D6O1NR1Sl5pVymx/K6N2VKknYghpKOd6TQ01zVbfh6+TEP/BeRmvG7ii+fZVgJ+AHyWdDHDscCrwOdKvn8TgL2arEdblnz/DsrfvzPJxZW0dzylZPzAhv97xkK+K83yXy5/bz5aGHYasPdC4q8uDDs8v2enMv9o6z+A1RpiG89RTiM17a1OWv+PJe31H0BaBxs/+27PcTK/mWkk8Dzp+7xJ6XWoygrbKS/ga1Rsruh6M4GlSXs4v6filS15HsuQNqazKRw6logbCXy40F/5RGThf/hiXqE3rBi7Oekk2L9Wfe8a5nMNFTbkOWYYqa30v0ltuONaXHbX/97dHnnjeabiBuUk0onAL5P2Uh8H1u4hvnieaplC912kJpuNK8QPIBXma4F/zxuDsRXiixvET5FOpq9VIf4U4G3g07l/Oxqu7mkSXywoyxa6P11y+cX4A/NnfwTwT/n936BC/svkv8OB64FtS3z+xeV/gbTeT8jjf837j6C6/f40TLcvaadoeMPwZucob8rd65CauH6YP/v3fYe7ib+BtN1ZqjDNn6m6/reywi3KL9Iezm1UqIhN5nFA1TeyEDuI1I65fovxpU8+dhdPaoMt3TzWh+99r3LP81iRXlymSbqSaqGXJ/L+80zFgrAX8BXSxrhpMW8Sf1nD+A/lDcm4FuNvIBW0pt+hhcXn79/BpKPCsvlfkYcv1bXxW9hn2ST+8obxXyAVgrLLL77/25KaLU9u5f/P45cj7aE3PSpfWP7MP6q7ser7l8cNJJ07uY8mTbR0f45ytcL3dyDdnPNYSPyIQm470MJR/WJ5ozpJgyPiL72IVyyOb4y9T/6R4DTgnYjYR9KGwOsR8buK8W9FxL6SNiU18TwWET0+sapJ/BjSxQeXRcRjLcRvQLrK66aIeLrF/N+OiMd7iu0m/m9IG6NbIuKZCvFd7/8mwKsR8YcWlz+e1H7/UkS8VyH+rxExWdI6zP/83mlh+RuR9vDvi4gXeogdSDpB/B8RsZOkfYGPAEdExFsllt0Y/3nSSeRjIuKNnuLfNz9v82xJJ2k46QT21qS9re0jYk4L8Vvl+O0iovQDmgrx2+RBH4mIF1uI35p0ZPi3PW2Iuonvyn+HFv//ruVvFxWeSdLH7//AXsRvk/Pvzf+/FNU//4tJbfw7k5qAHikb2yR+SkTMqhLfZbG6HYVZK/Ie/CzSlUl7VdkQNMQPAz5VZUPQED+E1F5fuhA0xA/N8aULQUP8MFL+rf7/Xcuv9HCqPn7/exM/hN7//6U///xAr6VJRwOfByZXKQTdxLdUCKD+ZyCbLfIkrUQ6z7Nz1b0yxzu+1fjcFP2OpJOA+yPiqSrL7W18IzcTmdEn55kc7/iW4nt7jrKvznG6GJiZmc8ZmJmZi4GZmeFiYGZmuBiYmRkuBmZmhouBdRBJIelHhf6Bkl6WdGN/5tWMpNE530MLw86SdEA/5HKXpCclPZRfP253Drbo84/OrJO8AWwkadl875aPkZ4ItUjJ94yB9MSrwyWdV+Y+NzX7fETM7OccbBHmIwPrND8lPZUK0uMJr+waIWmCpF9K+nX+u34efoCk6yXdIukpSacVYs6RNFPSbEn/VBi+q6QnJN0t6QddRx+Slpd0oaT783L2LCzjWkkzSM+ChvSA9jtId/FcgKQD8zwelnSdpOXy8ItzTndKekbSdnl5j+d70HTF7yzpHkkP5uWu0Bdvri25XAys01wFTJY0mPTYv18Vxj1BuknbZqSnxX23MG5T4HPAxsDnJI3Mw78dEePzvLaTtEme93nApIjYlnTP+C7fBv4rIrYg3Z3ze5KWz+O2Ij1pbMfC9KcAX5c0oOH/uD4itoiIcaT79n+pMG4l0qM/v0Z66tnpwIbAxpI2zTdGO5b0IJauZyAfufC3jcsLzUTf62FaWwK5mcg6SkTMkjSadFRwc8PoocAlSreBDtK9/bvcERFzASQ9Rrpv/LPAZyVNJa0Lq5EeH7oU8ExE/CbHXkl6mAnkJ7FJOir3DyY9Bxngtoj4Y0O+v5F0H/B3DbluJOlk0s3VViA91KXLjIgISY8AL3bd70bSbNLjENfMef5CEqQHMt3T9A2bz81EtlAuBtaJpgP/QnqIz8qF4ScBd0bEXrlg3FUY93ah+11goKS1Sc/73SIi/pSbYQaTbmPcHZHuzPnkAgOliaRzGs18l/QQlJ8Vhl0MfDIiHs4nlbdvkut7DXm/R1pn3yUVnn0WkqdZJW4msk50IXBikztEDmX+CeUDSsxnCGkDPlfSB0nPYYbU3LROLiiQmpe63AocqrxLLmmznhYSEU+Qnly2W2HwisDzkgaRbj9cxb3ANpLWyzksJ+lDFedhtgAXA+s4ETEnIs5oMuo04J8l/YL0kJSe5vMw6ZGBs0kF5hd5+FvAV4FbJN0NvAjMzWEnkZqfZkl6NPeX8R1S806X40jnO24jFZ/SIuJlUrG7UtIsUnHYoIew4jmD26ssz5YMvmupWROSVoiI1/MRwNnAUxFxen/nZVYXHxmYNXegpIdIRw1DSVcXmS22fGRgtpiQ9BNg7YbBR0fErc2mNytyMTAzMzcTmZmZi4GZmeFiYGZmuBiYmRnwv8F12G9s3f5xAAAAAElFTkSuQmCC\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# Create the bar plot\n", - "sns.barplot(x='ManagerName_E', y='PerformanceScore_E', data=encoded_df, ci = 0, color = 'skyblue')\n", - "\n", - "# Rotate x-axis labels for better readability\n", - "plt.xticks(rotation=45)\n", - "\n", - "# Remove lines on top of each bar\n", - "ax = plt.gca()\n", - "\n", - "# Show the plot\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "02466434", - "metadata": {}, - "source": [ - "Though there are highs and lows, visually, there are no apparent disparity between each Managers to performance score." - ] - }, - { - "cell_type": "code", - "execution_count": 114, - "id": "f3c0b2de", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ManagerName_EPerformanceScore_EEmpID
ManagerName_E1.0000000.0029570.047830
PerformanceScore_E0.0029571.0000000.690614
EmpID0.0478300.6906141.000000
\n", - "
" - ], - "text/plain": [ - " ManagerName_E PerformanceScore_E EmpID\n", - "ManagerName_E 1.000000 0.002957 0.047830\n", - "PerformanceScore_E 0.002957 1.000000 0.690614\n", - "EmpID 0.047830 0.690614 1.000000" - ] - }, - "execution_count": 114, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Run correlation matrix\n", - "encoded_df[['ManagerName_E', 'PerformanceScore_E', 'EmpID']].corr()" - ] - }, - { - "cell_type": "markdown", - "id": "feed7c4d", - "metadata": {}, - "source": [ - "After running the correlation matrix, we can conclude that Managers play little role in Performance Score, and Employee is a bigger factor that contributes to performance score." - ] - }, - { - "cell_type": "markdown", - "id": "e2d6084d", - "metadata": {}, - "source": [ - "## What is the overall diversity profile of the organization?\n", - "\n", - "From the question, I assume that the geographical diversity is requested, and we should display the columns significant to show any geographical properties of the people working for the organization.\n", - "\n", - "[Back To Top](#top)" - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "id": "7f66f810", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# Plotting different columns\n", - "def plot_bar(data, columns):\n", - " num_columns = len(columns)\n", - "\n", - " # Set up subplots based on the number of columns\n", - " fig, axes = plt.subplots(1, num_columns, figsize=(num_columns * 7, 6))\n", - "\n", - " # Iterate through columns and create bar plots with CI\n", - " for i, col in enumerate(columns):\n", - " ax = axes[i] if num_columns > 1 else axes # Handle single-column case\n", - "\n", - " sns.barplot(x=data.index, y=col, data=data, ci=None, ax=ax) # ci=None to disable confidence intervals\n", - " ax.set_title(f'Bar Plot for {col}')\n", - " ax.set_xlabel('Count')\n", - " ax.set_ylabel('Index')\n", - "\n", - " plt.tight_layout()\n", - " plt.show()\n", - " \n", - "plot_bar(df, ['CitizenDesc', 'State', 'RaceDesc'])" - ] - }, - { - "cell_type": "markdown", - "id": "fcb71ce0", - "metadata": {}, - "source": [ - "## Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this? \n", - "Predicting who will be terminated or not requires numerical data, as we have encoded before. We will use `encoded_df` as an already encoded version of `df` and pass it into ML models.\n", - "[Back To Top](#top)" - ] - }, - { - "cell_type": "code", - "execution_count": 124, - "id": "a9166e89", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{0: array(['N/A-StillEmployed'], dtype=object),\n", - " 1: array(['career change', 'hours', 'return to school', 'Another position',\n", - " 'unhappy', 'attendance', 'performance',\n", - " 'Learned that he is a gangster', 'retiring',\n", - " 'relocation out of area', 'more money', 'military',\n", - " 'no-call, no-show', 'Fatal attraction',\n", - " 'maternity leave - did not return', 'medical issues',\n", - " 'gross misconduct'], dtype=object)}" - ] - }, - "execution_count": 124, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# I wanted to make sure that Termd and TermReason are accurately representing one another\n", - "column_mapping(df, 'Termd', 'TermReason')" - ] - }, - { - "cell_type": "code", - "execution_count": 125, - "id": "9984ed6f", - "metadata": {}, - "outputs": [], - "source": [ - "# Importing sklearn libaries\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score" - ] - }, - { - "cell_type": "code", - "execution_count": 126, - "id": "537a1057", - "metadata": {}, - "outputs": [], - "source": [ - "# This is the models to be used to test the data. Feel free to adjust\n", - "from sklearn.ensemble import RandomForestClassifier\n", - "from sklearn.svm import SVC\n", - "from sklearn.neighbors import KNeighborsClassifier\n", - "from sklearn.linear_model import LogisticRegression\n", - "from sklearn.tree import DecisionTreeClassifier\n", - "from sklearn.naive_bayes import GaussianNB\n", - "from sklearn.ensemble import AdaBoostClassifier\n", - "from sklearn.ensemble import GradientBoostingClassifier" - ] - }, - { - "cell_type": "code", - "execution_count": 127, - "id": "639634f8", - "metadata": {}, - "outputs": [], - "source": [ - "# Split the data into features (X) and labels (y)\n", - "X = encoded_df.drop(columns=['Employee_Name', 'EmpID', 'TermReason_E', 'EmploymentStatus_E'])\n", - "y = encoded_df['Termd']\n", - "\n", - "# Split the data into training and testing sets\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", - "\n", - "# Define a dictionary to store results\n", - "results = {'Model': [], 'F1_score': [], 'Accuracy': [], 'Precision': [], 'Recall': []}" - ] - }, - { - "cell_type": "code", - "execution_count": 128, - "id": "57fb2722", - "metadata": {}, - "outputs": [], - "source": [ - "# Pass the models loaded in here. Again, adjust as necessary\n", - "models = {\n", - " 'Random Forest': RandomForestClassifier(),\n", - " 'Support Vector Machine': SVC(),\n", - " 'K-Nearest Neighbors': KNeighborsClassifier(),\n", - " 'Logistic Regression': LogisticRegression(),\n", - " 'Decision Tree': DecisionTreeClassifier(),\n", - " 'Naive Bayes': GaussianNB(),\n", - " 'AdaBoost': AdaBoostClassifier(),\n", - " 'Gradient Boosting': GradientBoostingClassifier()\n", - "}" - ] - }, - { - "cell_type": "code", - "execution_count": 129, - "id": "4d917064", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "C:\\Users\\sang.yogi\\Anaconda3\\lib\\site-packages\\sklearn\\metrics\\_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.\n", - " _warn_prf(average, modifier, msg_start, len(result))\n" - ] - } - ], - "source": [ - "# Run the model one by one through loop\n", - "for model_name, model in models.items():\n", - " # Train the model\n", - " model.fit(X_train, y_train)\n", - "\n", - " # Make predictions\n", - " y_pred = model.predict(X_test)\n", - "\n", - " # Evaluate the model\n", - " f1 = f1_score(y_test, y_pred)\n", - " accuracy = accuracy_score(y_test, y_pred)\n", - " precision = precision_score(y_test, y_pred)\n", - " recall = recall_score(y_test, y_pred)\n", - "\n", - " # Store results in the dictionary\n", - " results['Model'].append(model_name)\n", - " results['F1_score'].append(f1)\n", - " results['Accuracy'].append(accuracy)\n", - " results['Precision'].append(precision)\n", - " results['Recall'].append(recall)\n", - "\n", - "# Create a DataFrame from the results dictionary\n", - "results_df = pd.DataFrame(results)" - ] - }, - { - "cell_type": "code", - "execution_count": 130, - "id": "78e97b44", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ModelF1_scoreAccuracyPrecisionRecall
0Random Forest1.0000001.0000001.0000001.000000
1Support Vector Machine0.0000000.6507940.0000000.000000
2K-Nearest Neighbors0.3000000.5555560.3333330.272727
3Logistic Regression0.2222220.6666670.6000000.136364
4Decision Tree1.0000001.0000001.0000001.000000
5Naive Bayes0.6938780.7619050.6296300.772727
6AdaBoost1.0000001.0000001.0000001.000000
7Gradient Boosting1.0000001.0000001.0000001.000000
\n", - "
" - ], - "text/plain": [ - " Model F1_score Accuracy Precision Recall\n", - "0 Random Forest 1.000000 1.000000 1.000000 1.000000\n", - "1 Support Vector Machine 0.000000 0.650794 0.000000 0.000000\n", - "2 K-Nearest Neighbors 0.300000 0.555556 0.333333 0.272727\n", - "3 Logistic Regression 0.222222 0.666667 0.600000 0.136364\n", - "4 Decision Tree 1.000000 1.000000 1.000000 1.000000\n", - "5 Naive Bayes 0.693878 0.761905 0.629630 0.772727\n", - "6 AdaBoost 1.000000 1.000000 1.000000 1.000000\n", - "7 Gradient Boosting 1.000000 1.000000 1.000000 1.000000" - ] - }, - "execution_count": 130, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Print out results\n", - "results_df" - ] - }, - { - "cell_type": "markdown", - "id": "07e39640", - "metadata": {}, - "source": [ - "With examination of the numerical results, it appears there might be overfitting or issues with the outcomes. This suspicion is highlighted by the majority showing \"very accurate\" results, contrasted with a few models such as `K-Nearest`, `Logistic Regression`, and `Naive Bayes` exhibiting poor accuracy. Considering the relatively small size of the dataset (311 rows), we will tentatively accept these findings for the time being." - ] - }, - { - "cell_type": "markdown", - "id": "1cad2a37", - "metadata": {}, - "source": [ - "## Are there areas of the company where pay is not equitable? \n", - "\n", - "For this question, I decided to use 4 columns from `encoded_df` (numerical data):\n", - "- `Position_E`\n", - "- `Department_E`\n", - "- `GenderID`\n", - "- `RaceDesc_E`\n", - "\n", - "[Back To Top](#top)\n", - "\n", - "These columns is deemed represent _\"areas of the company\"_, and will be analyzed against the column `Salary` to see any disparities from visual observation.\n", - "### Visual Observation of Disparity" - ] - }, - { - "cell_type": "code", - "execution_count": 148, - "id": "0bae43a9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Department_E\n", - "Executive Office 250,000\n", - "IT/IS 97,065\n", - "Software Engineering 94,989\n", - "Admin Offices 71,792\n", - "Sales 69,061\n", - "Production 59,954\n", - "Name: Salary, dtype: object" - ] - }, - "execution_count": 148, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Group Department with Salary, then use previously made Dictionary to map it into contextual view (also format is changed)\n", - "encoded_df.groupby('Department_E')['Salary'].mean().sort_values(ascending=False).rename(index=label_encoders['Department']).map('{:,.0f}'.format)" - ] - }, - { - "cell_type": "code", - "execution_count": 149, - "id": "a9f79bfa", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Position_E\n", - "President & CEO 250,000\n", - "CIO 220,450\n", - "Director of Sales 180,000\n", - "IT Director 178,000\n", - "Director of Operations 170,500\n", - "IT Manager - Infra 157,000\n", - "Data Architect 150,290\n", - "IT Manager - DB 144,960\n", - "IT Manager - Support 138,888\n", - "Principal Data Architect 120,000\n", - "BI Director 110,929\n", - "Database Administrator 108,500\n", - "Enterprise Architect 103,613\n", - "Sr. Accountant 102,859\n", - "Sr. DBA 102,234\n", - "Software Engineer 96,719\n", - "BI Developer 95,465\n", - "Sr. Network Engineer 93,071\n", - "Shared Services Manager 93,046\n", - "Data Analyst 89,933\n", - "Data Analyst 88,527\n", - "Senior BI Developer 84,803\n", - "Software Engineering Manager 77,692\n", - "Production Manager 75,294\n", - "Sales Manager 69,240\n", - "Area Sales Manager 64,933\n", - "Production Technician II 64,892\n", - "IT Support 63,684\n", - "Accountant I 63,508\n", - "Network Engineer 61,605\n", - "Production Technician I 55,524\n", - "Administrative Assistant 52,280\n", - "Name: Salary, dtype: object" - ] - }, - "execution_count": 149, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Group Position with Salary, then use previously made Dictionary to map it into contextual view (also format is changed)\n", - "encoded_df.groupby('Position_E')['Salary'].mean().sort_values(ascending=False).rename(index=label_encoders['Position']).map('{:,.0f}'.format)" - ] - }, - { - "cell_type": "code", - "execution_count": 150, - "id": "61dab711", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "GenderID\n", - "1 70,629\n", - "0 67,787\n", - "Name: Salary, dtype: object" - ] - }, - "execution_count": 150, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Group Gender with Salary, then use previously made Dictionary to map it into contextual view (also format is changed)\n", - "encoded_df.groupby('GenderID')['Salary'].mean().sort_values(ascending=False).map('{:,.0f}'.format)" - ] - }, - { - "cell_type": "code", - "execution_count": 151, - "id": "fa82a7c0", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "RaceDesc_E\n", - "Hispanic 83,667\n", - "Black or African American 74,431\n", - "Asian 68,521\n", - "White 67,288\n", - "American Indian or Alaska Native 65,806\n", - "Two or more races 59,998\n", - "Name: Salary, dtype: object" - ] - }, - "execution_count": 151, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Group Race Description with Salary, then use previously made Dictionary to map it into contextual view (also format is changed)\n", - "encoded_df.groupby('RaceDesc_E')['Salary'].mean().sort_values(ascending=False).rename(index=label_encoders['RaceDesc']).map('{:,.0f}'.format)" - ] - }, - { - "cell_type": "code", - "execution_count": 152, - "id": "f3cd423b", - "metadata": { - "scrolled": false - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
RaceDesc_EGenderIDPosition_EDepartment_ESalary
RaceDesc_E1.0000000.0311010.053618-0.000252-0.089597
GenderID0.0311011.000000-0.0938120.0022710.056097
Position_E0.053618-0.0938121.0000000.096064-0.184032
Department_E-0.0002520.0022710.0960641.000000-0.198331
Salary-0.0895970.056097-0.184032-0.1983311.000000
\n", - "
" - ], - "text/plain": [ - " RaceDesc_E GenderID Position_E Department_E Salary\n", - "RaceDesc_E 1.000000 0.031101 0.053618 -0.000252 -0.089597\n", - "GenderID 0.031101 1.000000 -0.093812 0.002271 0.056097\n", - "Position_E 0.053618 -0.093812 1.000000 0.096064 -0.184032\n", - "Department_E -0.000252 0.002271 0.096064 1.000000 -0.198331\n", - "Salary -0.089597 0.056097 -0.184032 -0.198331 1.000000" - ] - }, - "execution_count": 152, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Try and run correlation matrix\n", - "encoded_df[['RaceDesc_E', 'GenderID', 'Position_E', 'Department_E', 'Salary']].corr()" - ] - }, - { - "cell_type": "markdown", - "id": "d5995919", - "metadata": {}, - "source": [ - "Based on findings from above, there are no apparent disparities of Salary in different groups of each columns. And also there are weak correlation in the matrix above, indicating that they could be playing a small role in `Salary` numbers.\n", - "\n", - "Then, I decided to run ANOVA test for those four columns against `Salary`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "beb80b93", - "metadata": {}, - "outputs": [], - "source": [ - "from scipy.stats import f_oneway" - ] - }, - { - "cell_type": "code", - "execution_count": 163, - "id": "85fed2c0", - "metadata": {}, - "outputs": [], - "source": [ - "# Making a function to run ANOVA with singular or multiple columns\n", - "def test_anova(df, group_columns, value_column):\n", - " \"\"\"\n", - " Parameters:\n", - " - df: Pandas DataFrame\n", - " - group_columns: Column names to be grouped\n", - " - value_column: Column containing salary information (or other numerical column)\n", - " \"\"\"\n", - " for gro in group_columns:\n", - " # Group by group_column and extract salary groups\n", - " groups = [group[value_column] for name, group in df.groupby(gro)]\n", - " # Perform one-way ANOVA\n", - " f_statistic, p_value = f_oneway(*groups)\n", - " print(f\"Group Column: {gro} with {value_column}\")\n", - " print(f\"F-statistic: {f_statistic}\\nP-value: {p_value}\")\n", - " # Print hyphen for separation between outputs\n", - " print(\"-\" * 30)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 164, - "id": "0d0acd5d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Group Column: RaceDesc_E with Salary\n", - "F-statistic: 1.2863499291564826\n", - "P-value: 0.2695646450406796\n", - "------------------------------\n", - "Group Column: GenderID with Salary\n", - "F-statistic: 0.9754391883261777\n", - "P-value: 0.3241001178974803\n", - "------------------------------\n", - "Group Column: Position_E with Salary\n", - "F-statistic: 153.84548177486272\n", - "P-value: 1.3432601515294733e-156\n", - "------------------------------\n", - "Group Column: Department_E with Salary\n", - "F-statistic: 59.34834401921235\n", - "P-value: 4.966770445882221e-43\n", - "------------------------------\n" - ] - } - ], - "source": [ - "# Run function for selected columns, against Salary\n", - "test_anova(encoded_df, ['RaceDesc_E', 'GenderID', 'Position_E', 'Department_E'], 'Salary')" - ] - }, - { - "cell_type": "markdown", - "id": "60aeac44", - "metadata": {}, - "source": [ - "Assuming that the significance level is 0.05, we can have **Null hypothesis (H0)** and **Alternative Hypothesis (H1)** as follows:\n", - "- `H0`: There are no significant differences in `Salary` among different groups in `[selected column(s)]`.\n", - "- `H1`: There are significant differences in `Salary` among different groups in `[selected column(s)]`.\n", - "\n", - "Based on ANOVA test above, we can conclude that:\n", - "- `RaceDesc`: With a p-value of 0.26 and a significance level of 0.05, fail to reject the null hypothesis. There is not enough evidence to suggest that there are significant differences in salary among different racial groups.\n", - "- `Gender`: Similarly, with a p-value of 0.32, we fail to reject the null hypothesis for gender. There is not enough evidence to suggest that there are significant differences in salary based on gender.\n", - "- `Position`: The p-value of 1.34 is unusually high and might indicate a potential issue. P-values should typically be between 0 and 1. This result may suggest a problem with the statistical analysis or data.\n", - "- `Department`: With a p-value of 4.96, again, there seems to be an issue. Similar to the position, this result is not within the standard range of 0 to 1 for p-values.\n", - "\n", - "To my interpretation, the absurdly high p-values could be coming from factors such as:\n", - "- As the number of groups increases, the likelihood of finding a significant result by chance (Type I error) also increases. This is known as the problem of multiple comparisons.\n", - "- With a larger number of groups, we need a larger sample size to achieve the same level of statistical power (ability to detect a true effect if it exists). In this case, we only have 311 rows of data.\n", - "- Having many groups can make it challenging to interpret the overall pattern of differences, especially if there is no clear hypothesis about which specific groups are expected to differ." - ] - }, - { - "cell_type": "markdown", - "id": "7cccbeac", - "metadata": {}, - "source": [ - "## What are our best recruiting sources if we want to ensure a diverse organization \n", - "\n", - "[Back To Top](#top)\n", - "\n", - "For this question, I decided to use information from these columns from `df` (textual data):\n", - "- `RecruitmentSource`\n", - "- `FromDiversityJobFairID`\n", - "- `GenderID`\n", - "- `RaceDesc`\n", - "\n", - "Assuming that being 'diverse' is a cultural standpoint, these columns will be analyzed to see recruitment profiles. (Feel free to change it)" - ] - }, - { - "cell_type": "code", - "execution_count": 166, - "id": "a0d2e060", - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Indeed 87\n", - "LinkedIn 76\n", - "Google Search 49\n", - "Employee Referral 31\n", - "Diversity Job Fair 29\n", - "CareerBuilder 23\n", - "Website 13\n", - "Other 2\n", - "On-line Web application 1\n", - "Name: RecruitmentSource, dtype: int64" - ] - }, - "execution_count": 166, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df['RecruitmentSource'].value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 169, - "id": "1b4a55a4", - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "# Plot the distribution of Gender by Recruitment Source\n", - "plt.figure(figsize=(12, 6))\n", - "sns.countplot(x='RecruitmentSource', hue='GenderID', data=df)\n", - "plt.xticks(rotation=45, ha='right')\n", - "plt.title('Distribution of Gender by Recruitment Source')\n", - "plt.xlabel('Recruitment Source')\n", - "plt.ylabel('Count')\n", - "plt.legend(title='Gender', bbox_to_anchor=(1.05, 1), loc='upper left')\n", - "plt.show()\n", - "\n", - "# Plot the distribution of Race by Recruitment Source\n", - "plt.figure(figsize=(12, 6))\n", - "sns.countplot(x='RecruitmentSource', hue='RaceDesc', data=df)\n", - "plt.xticks(rotation=45, ha='right')\n", - "plt.title('Distribution of Race by Recruitment Source')\n", - "plt.xlabel('Recruitment Source')\n", - "plt.ylabel('Count')\n", - "plt.legend(title='Race', bbox_to_anchor=(1.05, 1), loc='upper left')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "f120c42d", - "metadata": {}, - "source": [ - "From the two graphs we can already see that certain Recruitment Sources have contributed to more diverse organization, notably `LinkedIn`, `Indeed`, and `Google Search`. And based on `GenderID`, we know that `LinkedIn` and `Indeed` is the highest sources of gaining new people." - ] - }, - { - "cell_type": "markdown", - "id": "4680ecdf", - "metadata": {}, - "source": [ - "Thank you for taking the time to view this notebook! Hope this inspires you." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "61d25079", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}