Data Science help request?

Final project

Lending Club is a peer-to-peer lending company that connects borrowers with investors through an online platform. It serves people who need personal loans ranging from $1,000 to $40,000. Borrowers receive the full amount of their loan minus an origination fee that is paid to the company. Investors purchase notes secured by personal loans and pay Lending Club a service fee. Lending Club provides data on all loans originated through its platform during specified periods.

For the purposes of this project, data on loans granted through Lending Club from 2007 to 2011 were used. Each loan has information about whether it has finally been repaid (Fully Paid or Charged off in the loan_status column). Your task is to build a classification model that, based on this data, will predict with a certain accuracy whether a potential borrower will repay his debt under the loan. The data set includes a file with a description of all variables and the "FICO Score ranged.pdf" file, which describes in detail the meaning of one of the columns.

The individual stages of analysis required to complete the project and their scoring are presented below:

Data Processing (70 points) – as an experienced Data Scientist, you probably know the individual steps that need to be performed at this stage, so we will not detail them here.

EDA, i.e. extensive data exploration (100 points) Describe the conclusions drawn from each graph, support your hypotheses with statistical tests such as t-test or Chi-square. Additionally, answer the following questions:

How does a FICO score relate to a borrower's likelihood of repaying a loan?

How does credit age relate to the probability of default and whether this risk is independent of or related to FICO score

How does home mortgage status relate to the likelihood of default?

How is annual income related to the probability of default?

How is employment history related to the likelihood of default?

How is the size of the loan requested related to the probability of default?

Feature Engineering – create 20 new variables (60 points)

Modeling (150 points)

Cluster the data (try several methods, at least 3) and check whether there are any borrower segments, use appropriate methods to determine the optimal number of clusters (40 points)

Train 5 different models, using a different algorithm for each, and then compare their performance, using the AUROC score as the model quality metric. (50 points)

Check the operation of previously used methods on compressed data using PCA, compare the results (AUROC score) with the models trained in the previous section. (20 points)

Build the final model whose AUROC score will be >= 80%, remember to select important variables, cross-validate and tune model parameters, also think about balancing classes. (40 points)

There are 380 points up for grabs in total. A minimum of 300 points is required to pass the project.

Good luck!

Ph.D. dissertation question?

Way of #Python libraries import?

Python experts and specialists?

On - line editor?

Are there any examples (PDF) of feasibility study on pike-factory construction (pike Esox lucius spawning grounds restoration) ?

SQL task help request?

Conspect help request?

Conspect help request?

Scope of a Ph.D. dissertation?

Asking questions about appriopriate books?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Request Python code?

Why does everyone use vs code?

Hello Everyone ! I'm looking for a good journal to publish my manuscript with low publication cost?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?