As one of the leading pharmaceutical companies in the world, AstraZeneca offers exciting opportunities for data scientists and analytics professionals. If you’re gearing up for an interview at AstraZeneca in the field of data science or analytics, it’s crucial to be well-prepared for the rigorous selection process. To help you ace your interview, let’s delve into some common questions and their answers that you might encounter.
Table of Contents
Technical Interview Questions
Question: What machine learning models do you know?
Answer:
- Linear Regression: Used for predicting continuous values based on input features.
- Logistic Regression: Suitable for binary classification tasks, predicting probabilities of outcomes.
- Decision Trees: Non-linear models that make decisions based on feature splits.
- Random Forest: Ensemble method combining multiple decision trees for improved performance.
- Support Vector Machines (SVM): Effective for both classification and regression tasks, finding optimal decision boundaries.
- K-Nearest Neighbors (KNN): Instance-based method using similarity of data points for prediction.
- Naive Bayes: Probability-based method often used in text classification and spam detection.
- Neural Networks: Deep learning models with layers of interconnected neurons, powerful for complex tasks like image recognition and natural language processing.
Question: Do you know SVM?
Answer: Support Vector Machines (SVM) is a supervised learning algorithm that aims to find the optimal hyperplane to separate data points into different classes. It maximizes the margin, the distance between the hyperplane, and the nearest data points (support vectors). SVM can handle linear and non-linear data separation using kernel functions like linear, polynomial, RBF, and sigmoid. Its effectiveness lies in its ability to generalize well and handle high-dimensional data.
Question: Do you know a random forest?
Answer: Random Forest is an ensemble learning algorithm that constructs multiple decision trees during training. It improves accuracy and reduces overfitting by averaging the predictions of these trees for regression tasks or selecting the mode prediction for classification tasks. Random Forest also handles missing values well and provides insights into feature importance.
Question: How do you perform feature selection?
Answer: Feature selection involves identifying the most relevant features from a dataset to improve model performance and reduce complexity. Techniques include filter methods (based on statistical measures like correlation), wrapper methods (using performance metrics of the model), and embedded methods (incorporated into the model training process). The chosen method depends on the dataset size, computational resources, and desired model interpretability.
Question: Explain CNN and RNN.
Answer: CNN (Convolutional Neural Network):
Purpose: For image processing and computer vision.
Architecture: Uses convolutional layers to extract features.
Key Features: Learns spatial patterns, used in image tasks.
RNN (Recurrent Neural Network):
Purpose: For sequence data and tasks with dependencies.
Architecture: Has recurrent connections for memory.
Key Features: Handles time series, used in language tasks.
Question: What is Regularization?
Answer: Regularization is a technique in machine learning to prevent overfitting by adding a penalty term to the loss function. It discourages overly complex models by penalizing large coefficients or weights. L1 regularization (Lasso) and L2 regularization (Ridge) are common methods used to control model complexity and improve generalization performance.
Question: Explain SVM.
Answer: Support Vector Machine (SVM) is a supervised learning algorithm for classification and regression tasks. It finds the optimal hyperplane that separates classes in feature space, maximizing the margin between classes. SVM can handle linear and non-linear separation using kernel tricks, making it effective in high-dimensional spaces for robust classification.
Question: What is bias variance trade-off?
Answer:
- Bias: Low model complexity leading to oversimplified representations of the data.
- Variance: High model complexity resulting in sensitivity to noise or random fluctuations.
- Tradeoff: Balancing model simplicity (low bias) with the ability to capture data variations (low variance) for optimal generalization.
- Goal: Find the right complexity to minimize both bias and variance, improving the model’s performance on unseen data.
Question: What is Batch normalization?
Answer: Batch normalization is a technique used in deep learning to normalize the inputs of each layer. It improves the training speed and stability of deep neural networks by reducing internal covariate shift. In essence, it standardizes the inputs to have zero mean and unit variance, helping the network learn more effectively by ensuring that each layer receives normalized inputs. This technique also acts as a form of regularization, reducing the need for other methods like dropout and improving the overall performance and convergence speed of the network.
Question: What is PCA and PLS?
Answer:
PCA (Principal Component Analysis):
- Dimensionality reduction technique for high-dimensional data.
- Identifies orthogonal components capturing maximum variance.
- Used for data visualization, noise reduction, and feature extraction.
PLS (Partial Least Squares):
- Statistical method for modeling relationships in regression.
- Finds latent variables explaining variance in predictor and response.
- Useful for handling multicollinearity, noise, and high-dimensional data in predictive modeling.
Python and SQL Interview Questions
Question: What is Pandas in Python and why is it used?
Answer: Pandas is a library for data manipulation and analysis. It provides DataFrame structures to work with tabular data, making tasks like cleaning, transforming, and analyzing data easier.
Question: Explain the use of NumPy in Python for data science.
Answer: NumPy is a fundamental library for numerical computations. It provides support for arrays and matrices, along with mathematical functions, crucial for tasks like linear algebra and statistical operations.
Question: How do you handle missing values in a Pandas DataFrame?
Answer: Missing values can be handled using methods like fillna() to replace NaN values, dropna() to remove rows with NaNs, or interpolation to fill missing values based on nearby data points.
Question: Write a SQL query to find the total revenue from a Sales table.
Answer:
SELECT SUM(Revenue) AS TotalRevenue FROM Sales;
Question: Explain the difference between INNER JOIN and LEFT JOIN in SQL.
Answer: INNER JOIN returns rows when there is at least one match in both tables, while LEFT JOIN returns all rows from the left table and matching rows from the right table.
Question: How do you remove duplicate rows from a table in SQL?
Answer:
DELETE FROM table_name WHERE id IN ( SELECT id FROM ( SELECT id, ROW_NUMBER() OVER(PARTITION BY column_name ORDER BY id) AS rnum FROM table_name ) t WHERE t.rnum > 1 );
Question: Write a SQL query to calculate the average salary from an Employee table.
Answer:
SELECT AVG(Salary) AS AverageSalary FROM Employee;
Statistics and ML Interview Questions
Question: What is the Central Limit Theorem (CLT)?
Answer: The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population distribution.
Question: Explain the difference between Type I and Type II errors.
Answer: Type I error (false positive) occurs when a true null hypothesis is rejected. Type II error (false negative) occurs when a false null hypothesis is not rejected.
Question: What is the p-value in hypothesis testing?
Answer: The p-value is the probability of observing a test statistic at least as extreme as the one calculated from the sample data, assuming the null hypothesis is true.
Question: What are the different types of machine learning algorithms?
Answer:
- Supervised Learning (classification, regression)
- Unsupervised Learning (clustering, dimensionality reduction)
- Semi-supervised Learning
- Reinforcement Learning
Question: Explain the concept of overfitting in machine learning. How do you prevent it?
Answer: Overfitting occurs when a model learns noise or random fluctuations in the training data, leading to poor generalization on unseen data. To prevent overfitting, techniques like cross-validation, regularization, and early stopping can be used.
Question: What is cross-validation, and why is it important in machine learning?
Answer: Cross-validation is a technique used to assess the performance of a model by splitting the data into multiple subsets, training the model on some subsets, and evaluating it on others. It helps in estimating the model’s performance on unseen data and reduces the risk of overfitting.
Question: Explain the difference between supervised and unsupervised learning.
Answer:
- Supervised Learning: The model learns from labeled training data, making predictions or classifications based on input-output pairs.
- Unsupervised Learning: The model learns patterns and structures in unlabeled data, clustering similar data points or reducing the dimensionality of the data.
Conclusion
Navigating a data science or analytics interview at AstraZeneca requires a blend of solid technical knowledge, problem-solving prowess, and effective communication skills. By understanding the fundamental concepts of statistics and machine learning, being proficient in Python and SQL, and showcasing your practical experience through relevant projects, you can stand out as a strong candidate. Remember to stay updated with industry trends, demonstrate your ability to derive actionable insights from data, and present your findings with clarity and confidence. With thorough preparation and a strategic approach, you’ll be well-prepared to seize the exciting opportunities that await in the dynamic world of data science and analytics at AstraZeneca. Best wishes on your interview journey!