Embarking on a journey towards a career in data science and analytics can be both exhilarating and daunting. As you prepare for your interview at Citi (Citibank), one of the world’s leading financial institutions, it’s crucial to equip yourself with the knowledge and insights that will set you apart from the competition. In this blog, we’ll delve into the intricacies of data science and analytics interview questions and answers specifically tailored for Citi, offering you a comprehensive guide to navigate through the interview process with confidence.
Table of Contents
Technical Interview Questions
Question: Explain regression models and their applications.
Answer: Regression models are statistical methods used to understand the relationship between a dependent variable and one or more independent variables. They are applied to predict continuous outcomes, such as sales forecasts, stock prices, or housing prices. Common types include linear regression for linear relationships, logistic regression for binary classification, and polynomial regression for non-linear patterns. These models help in making predictions and understanding the impact of variables on the target outcome.
Question: How to join tables in SQL
Answer: To join tables in SQL, you use the JOIN clause to combine rows from two or more tables based on a related column between them. Here are common types of joins:
INNER JOIN: Returns rows when there is a match in both tables based on the join condition.
SELECT * FROM table1 INNER JOIN table2 ON table1.column = table2.column;
LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table and the matched rows from the right table.
SELECT * FROM table1 LEFT JOIN table2 ON table1.column = table2.column;
Question: Explain PCA.
Answer: PCA (Principal Component Analysis) is a dimensionality reduction technique used in data science and machine learning. It aims to reduce the dimensionality of a dataset by finding a new set of orthogonal (uncorrelated) variables called principal components. These components are ordered by the amount of variance they explain in the original data, with the first component explaining the most variance.
Question: What is R square?
Answer: R-squared (R²) is a statistical measure that represents the proportion of variance in the dependent variable that is explained by the independent variables in a regression model. It is a measure of the goodness of fit of the model to the data, indicating how well the model fits the observed data points.
Question: How R square is different from Adjusted R square?
Answer: R-squared (R²):
R² measures the proportion of variance in the dependent variable that is explained by the independent variables in the model.
It ranges from 0 to 1, where 0 indicates that the model does not explain any variance, and 1 indicates a perfect fit.
R² tends to increase as more independent variables are added to the model, even if they do not significantly improve the model’s predictive power.
Adjusted R-squared:
Adjusted R² is a modified version of R² that adjusts for the number of predictors (independent variables) in the model.
It penalizes the addition of unnecessary variables that do not improve the model significantly.
Adjusted R² takes into account the model’s degrees of freedom and provides a more accurate measure of the model’s goodness of fit, especially for models with multiple predictors.
It can be negative if the model is worse than a simple average.
Question: Why transformer is better than LSTM?
Answer:
Attention Mechanism:
Transformers utilize self-attention mechanisms that allow them to capture global dependencies in sequences more effectively than LSTMs. This is especially beneficial for tasks requiring long-range context understanding.
Scalability:
Transformers can scale to handle larger datasets and more complex tasks by simply increasing model size. This scalability is advantageous for tasks with vast amounts of data, such as language modeling or large-scale translation.
Ease of Training:
Transformers are relatively easier to train compared to LSTMs, especially for long sequences. They are less prone to vanishing gradient problems and do not suffer from the same issues with long-term dependencies.
Question: What are variable reducing techniques?
Answer:
Linear Discriminant Analysis (LDA):
- Finds linear combinations to separate classes.
- Maximizes class separation while reducing dimensionality.
- Useful for classification tasks to enhance performance.
Feature Selection:
- Selects relevant features based on statistical measures.
- Includes filter, wrapper, and embedded methods.
- Reduces variables while improving model performance.
Autoencoder:
- Neural network for unsupervised learning.
- Learns compressed representation of input data.
- Reduces dimensionality while preserving patterns.
Question: Explain the cost function and loss functions.
Answer:
Cost Function:
The cost function evaluates the overall performance of a machine learning model by quantifying the disparity between predicted and actual values across the entire dataset. It serves as a benchmark to assess how well the model is learning from the training data. In optimization, the goal is to minimize the cost function, adjusting model parameters to improve predictive accuracy.
Loss Functions:
Loss functions are specific metrics tailored to different machine learning tasks, measuring the model’s prediction error for individual data points. They play a crucial role in model training, guiding the optimization process to minimize the average loss across all training examples. Different tasks require different loss functions, such as Mean Squared Error (MSE) for regression and Cross-Entropy for classification, to effectively capture the model’s performance characteristics.
Question: What are the different clustering algorithms?
Answer:
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Mean Shift Clustering
- Gaussian Mixture Models (GMM)
- Agglomerative Clustering
- Affinity Propagation
- Spectral Clustering
Question: Explain L1 L2 regularization.
Answer: L1 regularization, or Lasso Regression, adds the sum of the absolute values of coefficients to the loss function. It encourages sparsity in the model, effectively performing feature selection by driving some coefficients to exactly zero. This makes L1 regularization ideal for creating simpler and more interpretable models, reducing overfitting by limiting the model’s complexity.
On the other hand, L2 regularization, known as Ridge Regression, adds the sum of the squared coefficients to the loss function. It controls overfitting by encouraging smaller and more evenly distributed coefficients, without driving them to zero. L2 regularization helps in creating more stable models that are less influenced by outliers, improving the model’s generalization ability and robustness against noise in the data.
Question: What are variable-reducing techniques?
Answer: Principal Component Analysis (PCA)
- Feature Selection
- Linear Discriminant Analysis (LDA)
- Autoencoder
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Question: How to check multicollinearity in Logistic regression.
Answer:
Correlation Matrix:
- Examine correlations among independent variables.
- High correlations (above 0.7 or -0.7) suggest multicollinearity.
Variance Inflation Factor (VIF):
- Calculate VIF for each variable.
- VIF values above 10 indicate multicollinearity.
Tolerance:
- Low tolerance values (< 0.1) indicate multicollinearity.
- Tolerance is the reciprocal of VIF (1/VIF).
Question: Difference between bagging and boosting.
Answer:
Bagging (Bootstrap Aggregating):
- Trains multiple base models independently on bootstrapped subsets.
- Models are trained in parallel.
- Reduces variance by averaging predictions from diverse models.
Boosting:
- Trains a series of weak learners sequentially, correcting errors of predecessors.
- Models are trained sequentially, focusing on misclassified instances.
- Improves accuracy by giving more weight to difficult instances in the dataset.
Question: Explain the logistics regression process.
Answer:
Model Building:
- Logistic regression models the probability of a binary outcome.
- Estimates coefficients to describe the relationship between input variables and the log-odds of the outcome.
Training:
- Iteratively adjusts coefficients using maximum likelihood estimation.
- Minimizes the logistic loss function to find the best-fitting model.
Prediction:
- Predicts the probability of the outcome using the logistic function.
- Thresholds the probabilities to classify observations into the binary outcome categories.
Question: Explain Gini coefficient.
Answer:
Gini coefficient measures the inequality among values in a dataset.
Ranges from 0 to 1, where 0 indicates perfect equality and 1 indicates perfect inequality.
Application:
Commonly used in economics to measure income distribution.
In machine learning, Gini impurity is used in decision tree algorithms to evaluate the purity of a node.
Question: Difference between chair and cart.
Answer:
Splitting Approach:
- CHAID uses chi-squared tests for categorical data.
- CART uses impurity measures (Gini or MSE) for both numerical and categorical data.
Tree Structure:
- CHAID creates multiway trees, while CART creates binary trees.
Variable Types:
- CHAID favors categorical variables.
- CART handles both numerical and categorical variables.
Interpretation:
- CHAID may be easier to interpret with clear branches.
- CART may require more effort for interpretation, especially with deeper trees.
Question: How to check outliers in a variable?
Answer: Visual Inspection:
Use box plots, histograms, or scatter plots to spot data points far from the main cluster.
Descriptive Statistics:
Look for values significantly distant from the mean or median, typically beyond 1.5 * IQR.
Z-Score or Modified Z-Score:
Calculate Z-scores for each data point and flag those beyond a threshold (e.g., 2 or 3).
Box Plot Method:
Check for points outside the “whiskers” of a box plot, usually 1.5 * IQR from quartiles.
Technical Interview Topics
- Guesstimate and SQL questions
- Statistics
- Machine Learning
- Deep Learning
- Programming in Python
- NLP and searching algorithms.
- Questions (gradient boosting
Behavioral Interview Questions
Que: Why would you like to work here
Que: ML questions based on the projects I have worked.
Que: Why did you apply for this position?
Que: Can you talk a bit about your projects and the challenges you faced while working on them?
Conclusion
As you conclude this insightful journey through data science and analytics interview questions at Citi (Citibank), one thing becomes abundantly clear: preparation is the key to success. Armed with a deeper understanding of core concepts such as decision trees, regularization, and feature scaling, you are well-equipped to showcase your expertise and problem-solving prowess in the interview room.
Remember, beyond the technical knowledge lies the ability to communicate your ideas effectively and demonstrate a keen understanding of real-world applications. Use this blog as a springboard to dive into further exploration, refine your skills, and stay updated with the latest trends in the ever-evolving landscape of data science.