A DataFrame can be created in several ways:
Pandas provides several methods to handle missing data:
The groupby() function is used to split the data into groups based on some criteria (like a column value), apply a function (such as sum, mean), and combine the results back into a DataFrame. It's useful for aggregation, transformation, and filtering tasks.
For example:
df.groupby('Category')['Value'].sum()
To merge two DataFrames, the merge() function is used. You can specify columns to join on and the type of join (inner, outer, left, right). It works similarly to SQL joins.
Example: merged_df = pd.merge(df1, df2, on='A', how='inner')
There are several ways to select columns in a DataFrame:
You can concatenate two DataFrames along rows or columns using the concat() function. To concatenate along rows (vertically), use axis=0. To concatenate along columns (horizontally), use axis=1.
Example:
pd.concat([df1, df2], axis=0)
To sort a DataFrame by a column, use the sort_values() function. You can specify the column to sort by and whether the sorting should be in ascending or descending order.
Example:
df.sort_values(by='Age', ascending=False)
To reset the index of a DataFrame, use the reset_index() function. By default, it moves the current index to a new column and assigns a new sequential index. Example:
df.reset_index(drop=True)
The apply() function allows you to apply a custom function to each element or row/column of a DataFrame. It can be used for operations like transformations, calculations, or aggregations.
Example:
df['column'].apply(lambda x: x * 2)
The pivot_table() function is used to create a pivot table in a DataFrame. It allows you to aggregate data based on certain columns, similar to how pivot tables work in Excel.
Example:
df.pivot_table(values='Value', index='Category', aggfunc='sum')
To filter rows in a DataFrame based on conditions, you can use boolean indexing.
For example:
df[df['Age'] > 30] will filter rows where the value in the 'Age' column is greater than 30.
The duplicated() function detects duplicate rows in a DataFrame. It returns a boolean Series where True indicates a duplicate row. You can use this to identify and remove duplicates from your dataset.
Example:
df[df.duplicated()]
To rename columns in a DataFrame, you can use the rename() function. You pass a dictionary mapping old column names to new ones.
Example:
df.rename(columns={'old_name': 'new_name'}, inplace=True)
The astype() function is used to convert the data type of a column or entire DataFrame. It can be used to change a column's type from int to float, or from string to datetime, etc.
Example:
df['Age'] = df['Age'].astype(float)
You can extract rows and columns from a DataFrame using loc (label-based indexing) or iloc (integer position-based indexing).
Example:
df.loc[0, 'Name'] extracts the first row and 'Name' column value. df.iloc[0, 0] extracts the first row and column.
The crosstab() function is used to compute a cross-tabulation of two or more factors, similar to a contingency table. It returns a DataFrame that shows the frequency distribution of the input variables.
Example:
pd.crosstab(df['Category'], df['Region'])
Categorical data can be converted into a category dtype in Pandas for efficient memory usage and performance. You can also use one-hot encoding for machine learning tasks.
Example:
df['Category'] = df['Category'].astype('category')
The timedelta object represents the difference between two dates or times. It is used for performing arithmetic operations on datetime objects, such as adding or subtracting time.
Example:
df['date'] - pd.to_datetime('2025-01-01')
You can create a rolling window in Pandas using the rolling() function. This is useful for applying statistical functions over a moving window, like calculating a moving average.
Example:
df['Value'].rolling(window=3).mean()
The cut() function is used to segment and sort data values into discrete bins or categories. It's often used for binning continuous variables into intervals.
Example:
pd.cut(df['Age'], bins=[0, 18, 35, 50, 100],labels=['Teen', 'Adult', 'Middle-Aged', 'Senior'])
You can use the sample() function to randomly sample rows from a DataFrame. You can specify the number of samples or the fraction of the total dataset.
Example:
df.sample(n=5) or df.sample(frac=0.1)
Pandas provides powerful data structures for handling and analyzing structured data. Its advantages include:
Copyrights © 2024 letsupdateskills All rights reserved