I Thought Pandas Were Cuddly Bears. I Was Wrong.

My first week wrestling Python’s data library, where the bears have teeth and the data bites back.

Key Takeaways:

Series

A Series is a one-dimensional data structure, similar to a list or a column in a spreadsheet. It holds data of the same type, with each element labeled by an index.
You can use pd.Series(list) to create a serie.

Dataframe

A DataFrame is a versatile two-dimensional structure in Pandas, similar to a table or spreadsheet, with rows and columns. It can hold data of different types, with each column functioning as a Series. Like a spreadsheet, a DataFrame includes both an index and column labels, making it ideal for handling large, structured datasets.

To create a DataFrame object, you’ll need to use a dictionary in conjunction with the .DataFrame() constructor, for example pd.DataFrame(dictionary).

import pandas as pd #Creating a dictionary store_data = {'Product': ['Coffee', 'Tea', 'Apples', 'Banana', 'Beef', 'Orange Juice'], 'Country of Origin': ['Brasil', 'India', 'Spain', 'Costa Rica', 'Argentina', 'Spain']} store_df = pd.DataFrame(store_data) print(store_df)

Below is the outcome of the code:

_______Product_______Country of Origin
0______Coffee________Brasil
1______Tea___________India
2______Apples________Spain
3______Banana_______Costa Rica
4______Beef__________Argentina
5______Orange Juice___Spain

To add a new column, you can use the following syntax: store_df["Allergens"] = [True, False, False, False, False, True]. New column will be added at the end.

Alternatively, you can use the insert() method that uses 3 parameters: 1) is the position where you want to add the new column; 2) is the name of the column and 3) are the values that need to be added as rows. For example, store_df.insert(1, "Allergens", [True, False, False, False, False, True]).

To delete rows or columns, you should drop() that takes 3 parameters: 1) label which says which row/column to be removed; 2) axis with 0 (default) will remove rows while 1 will remove columns; 3) inplace (optional, default value is False) allow you to modify the original DataFrame by using True. By default, it returns a new DataFrame without modifying the original one.
Example: store_df = store_df.drop(columns = ["Allergens"], axis=1).

To retrieve a set of columns as a new DataFrame, you can use the following syntax:
new_store_df = store_df[['Product','Allergens']] – Note the double square brackets.

To retrieve a specific row by its position, you can use .iloc[2] with the row number -1.

* df.iloc[1, 2] will extract the data from the second row and third column. df.iloc[ row_index, col_index].
* df.iloc[:, 3] will extract all rows from the fourth column.
* df.iloc[3, :] will extract alll columns from the fourth row.
* df.iloc[:50, [1,4]] will extract the first 50 rows and only columns with index 1 and 4.

To retrieve a specific row by its string label, use loc[].

To retrieve the first rows of a DataFrame, use df.head(8), no parameters will return the first 5.
To retrieve the last rows of a DataFrame, use df.tail(8), no parameters will return the first 5.
To retrieve random rows of a DataFrame, use df.sample(8), no parameters will return the first 5.

The isin() method checks if the Dataframe contains the specified value(s).

* .between(2, 3, inclusive = 'right') – extracts data where 'Engine_volume' > 2 and 'Engine_volume' <= 3.
* .between(2, 3, inclusive = 'left') – extracts data where 'Engine_volume' >= 2 and 'Engine_volume' < 3.
* .between(2, 3, inclusive = 'both') – extracts data where 'Engine_volume' >= 2 and 'Engine_volume' <= 3.
* .between(2, 3, inclusive = 'neither') – extracts data where 'Engine_volume' > 2 and 'Engine_volume' < 3.

df.info() will return information on the DataFrame.

df.columns (without parenthses) return the names of columns as a list

df.dtypes returns the type of each column

isna() or isnull() returns a Boolean DataFrame that contains all rows with True if value is missing or False if a value is present.

df.dropna() deletes all rows with at a least a cell with no value.

df.fillna(value) will fill all None value in the DataFrame with the provided default value. It is possible to also use a specific default value by column: df['Product'] = df['Product'].fillna('Unknow-Product') AND df['Country of Origin'] = df['Country of Origin'].fillna('Unknow-Country').

df.mean() returns the mean value for each column.
df.max() returns the highest value for each column.
df.min() returns the lowest value for each column.

df.count() counts all non-null cells.
df.sum() calculates the sum of values for each column, but it only works with numeric or boolean columns.

df.isna().sum() calculates the number of missing values for each of the columns.

df.mode() identifies the most frequently occurring value in each column.

df.unique() will return the liste of unique values in an array.
df.nunique() will return the number of unique values.

df.describe() provide details on the DataFrame:
* count – The number of not-empty values.
* mean – The average (mean) value.
* std – The standard deviation.
* min – the minimum value.
* max – the maximum value.
* 25% – The 25% percentile.
* 50% – The 50% percentile.
* 75% – The 75% percentile.

Read / Write CSV

Use pd.read_csv() to create a DataFramce from a CSV with the following parameters:
* filepath_or_buffer: path to the CSV file (string or URL)
* sep: delimiter (default is a comma ',')
* header: row number to use as the column headers (default is the first row)
* names: list of column names to use
* usecols: subset of columns to read
* index_col: column (or list of columns) to set as the DataFrame index

Use pd.to_csv() to create a CSV file from a DataFrame with the following parameters:
* path_or_buf: file path or object where the CSV should be written
* sep: delimiter for separating values (default is a comma ',')
* columns: subset of columns to write (default is all columns)
* header: whether to include column names as the header (default is True)
* index: whether to write row indices to the file (default is True)

Lambda function

This is one of the most exciting feature of Python if you ask me, with the possibility to add user defined function as part of existing system functions. A lambda function can take any number of arguments, but can only have one expression.

For example, when using iloc() with a DataFrame, you can include a lambda function within the iloc() function: variable = df.iloc[lambda x: x.index % 2 == 0]. This will only extract data with an even index.