Basic Pandas Interview Questions and Answers

1. What is Pandas in Python?

  • Pandas is an open-source library for data manipulation and analysis in Python.
  • It provides two main data structures: Series (1-dimensional) and DataFrame (2-dimensional).
  • With Pandas, you can efficiently handle structured data, perform operations like cleaning, merging, reshaping, and analyze data in a variety of formats like CSV, Excel, SQL databases, etc.

2. What is a DataFrame in Pandas?

  • A DataFrame is a 2-dimensional labeled data structure with rows and columns, similar to a table in a database or an Excel spreadsheet.
  • It can hold data of different types (integers, strings, floats, etc.), and each column can be considered as a Series. DataFrames are essential for data analysis in Pandas.

3. How do you create a DataFrame in Pandas?

A DataFrame can be created in several ways:

  • From a dictionary of lists: data = {'Name': ['Alice', 'Bob'], 'Age': [24, 30]} df = pd.DataFrame(data)
  • From a CSV file: df = pd.read_csv('file.csv')
  • From a NumPy array: import numpy as np; data = np.array([[1, 2], [3, 4]]); df = pd.DataFrame(data, columns=['A', 'B'])

4. What is the difference between a Series and a DataFrame?

  • A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure.
  • A DataFrame is essentially a collection of Series, where each column in the DataFrame is a Series. A Series holds data of a single type, while a DataFrame can hold multiple types of data across different columns.

5. How do you handle missing data in Pandas?

Pandas provides several methods to handle missing data:

  • Use isnull() and notnull() to detect missing values.
  • Use dropna() to remove rows with missing values.
  • Use fillna() to fill missing values with a specific value or method (like forward fill or backward fill).

6. What is the use of the groupby() function in Pandas?

The groupby() function is used to split the data into groups based on some criteria (like a column value), apply a function (such as sum, mean), and combine the results back into a DataFrame. It's useful for aggregation, transformation, and filtering tasks.

For example:

df.groupby('Category')['Value'].sum()

7. How do you merge two DataFrames in Pandas?

To merge two DataFrames, the merge() function is used. You can specify columns to join on and the type of join (inner, outer, left, right). It works similarly to SQL joins.

Example: merged_df = pd.merge(df1, df2, on='A', how='inner')

8. What are the different ways to select columns from a DataFrame?

There are several ways to select columns in a DataFrame:

  • Select a single column by using the column name: df['column_name']
  • Select multiple columns by passing a list of column names: df[['col1', 'col2']]
  • Use the loc or iloc functions to select rows and columns based on labels or integer indices.

9. How do you concatenate two DataFrames in Pandas?

You can concatenate two DataFrames along rows or columns using the concat() function. To concatenate along rows (vertically), use axis=0. To concatenate along columns (horizontally), use axis=1.

Example:

pd.concat([df1, df2], axis=0)

10. How do you sort a DataFrame by a column?

To sort a DataFrame by a column, use the sort_values() function. You can specify the column to sort by and whether the sorting should be in ascending or descending order.

Example:

df.sort_values(by='Age', ascending=False)

11. How do you reset the index of a DataFrame?

To reset the index of a DataFrame, use the reset_index() function. By default, it moves the current index to a new column and assigns a new sequential index. Example:

df.reset_index(drop=True)

12. What is the purpose of the apply() function in Pandas?

The apply() function allows you to apply a custom function to each element or row/column of a DataFrame. It can be used for operations like transformations, calculations, or aggregations.

Example:

df['column'].apply(lambda x: x * 2)

13. What is the pivot_table() function used for in Pandas?

The pivot_table() function is used to create a pivot table in a DataFrame. It allows you to aggregate data based on certain columns, similar to how pivot tables work in Excel.

Example:

df.pivot_table(values='Value', index='Category', aggfunc='sum')

14. How do you filter rows in a DataFrame based on conditions?

To filter rows in a DataFrame based on conditions, you can use boolean indexing.

For example:

df[df['Age'] > 30] will filter rows where the value in the 'Age' column is greater than 30.

15. What is the purpose of the duplicated() function in Pandas?

The duplicated() function detects duplicate rows in a DataFrame. It returns a boolean Series where True indicates a duplicate row. You can use this to identify and remove duplicates from your dataset.

Example:

df[df.duplicated()]

16. How do you rename columns in a DataFrame?

To rename columns in a DataFrame, you can use the rename() function. You pass a dictionary mapping old column names to new ones.

Example:

df.rename(columns={'old_name': 'new_name'}, inplace=True)

17. What is the astype() function in Pandas used for?

The astype() function is used to convert the data type of a column or entire DataFrame. It can be used to change a column's type from int to float, or from string to datetime, etc.

Example:

df['Age'] = df['Age'].astype(float)

18. How do you extract specific rows or columns from a DataFrame?

You can extract rows and columns from a DataFrame using loc (label-based indexing) or iloc (integer position-based indexing).

Example:

df.loc[0, 'Name'] extracts the first row and 'Name' column value. df.iloc[0, 0] extracts the first row and column.

19. What is the crosstab() function in Pandas?

The crosstab() function is used to compute a cross-tabulation of two or more factors, similar to a contingency table. It returns a DataFrame that shows the frequency distribution of the input variables.

Example:

pd.crosstab(df['Category'], df['Region'])

20. How do you deal with categorical data in Pandas?

Categorical data can be converted into a category dtype in Pandas for efficient memory usage and performance. You can also use one-hot encoding for machine learning tasks.

Example:

df['Category'] = df['Category'].astype('category')

21. What is the timedelta object in Pandas?

The timedelta object represents the difference between two dates or times. It is used for performing arithmetic operations on datetime objects, such as adding or subtracting time.

Example:

df['date'] - pd.to_datetime('2025-01-01')

22. How do you create a rolling window in Pandas?

You can create a rolling window in Pandas using the rolling() function. This is useful for applying statistical functions over a moving window, like calculating a moving average.

Example:

df['Value'].rolling(window=3).mean()

23. What is the cut() function in Pandas?

The cut() function is used to segment and sort data values into discrete bins or categories. It's often used for binning continuous variables into intervals.

Example:

pd.cut(df['Age'], bins=[0, 18, 35, 50, 100],labels=['Teen', 'Adult', 'Middle-Aged', 'Senior'])

24. How do you sample data from a DataFrame?

You can use the sample() function to randomly sample rows from a DataFrame. You can specify the number of samples or the fraction of the total dataset.

Example:

df.sample(n=5) or df.sample(frac=0.1)

25. What are the advantages of using Pandas?

Pandas provides powerful data structures for handling and analyzing structured data. Its advantages include:

  • Efficient memory and computation management for large datasets.
  • Built-in support for handling missing data.
  • Comprehensive and flexible data manipulation functions.
  • Easy integration with other libraries (like NumPy, Matplotlib, and Scikit-learn).

line

Copyrights © 2024 letsupdateskills All rights reserved