Analyzing Tabular Data#
Overview
Questions:
How do I work with data presented in tables?
Objectives:
Use functions in
pandas
to read in tabular data.Access information in a data frame using column names and row numbers.
Most scientists work with a lot of numerical data. In this module we will focus on reading in and analyzing numerical data, visualizing the data, and working with arrays.
Reading in Tabular Data#
As we already discussed, there are many ways to read in data from files in Python. In our last module, we used the readlines() function to read in a complex output file. In theory, you could always use the readlines() function, and then use the data parsing tools we learned in the previous module to format the data as you needed. But sometimes there are other ways that make more sense, particularly if the data is formatted in a table.
A common table format is the CSV
file or comma separated values
.
This is exactly what it sounds like.
Data is presented in rows, with each value separated by a comma.
If you have data in a spreadsheet program that you need to import into a python code, you can save the data as a csvfile to read it in.
In this example, we have a CSV file that contains data from a molecular dynamics trajectory. We have a 20 ns simulation that used a 2 fs timestep. The data was saved to the trajectory file every 1000 steps, so our file has 10,000 timesteps. At each timestep, we are interested in the distance between particular atoms. These trajectories were generated with the AMBER molecular dynamics program and the distances were measured with the python program MDAnalysis. The table of atomic distances was saved as a CSV file called “distance_data_headers.csv”. This file was downloaded as part of your lesson materials. Open the file in a text editor and study it to determine its structure.
In analyzing tabular data, we often need to perform the same types of calculations (averaging, calculating the minimum or maximum of the data set), so we are once again going to use a Python library, this time a library that contains lots of functions to work with tables.
This library is called pandas
. When it is imported, it is usually shortened to pd
.
import pandas as pd
The function we will use is called read_csv
.
You can read more about this function in your Jupyter notebook by using
help(pd.read_csv)
Library Documentation
Most popular Python libraries have very good online documentation.
You can find the pandas documentation by googling “pandas docs”.
You will be able to find the same help message you get for read_csv
as well as tutorials and other types of documentation.
To get started, we will make a variable for our file path, then use that variable in the read_csv
function.
distance_file = "data/distance_data_headers.csv"
distances = pd.read_csv(distance_file)
The variable called distances
is a pandas dataframe
.
When we want to look at a dataframe, we should not print it.
We should instead put it as the last thing in a cell.
This is because pandas dataframes are rendered in a special way in the Jupyter notebook.
If you use the print
function with them, they will not look as nice.
distances
Frame | THR4_ATP | THR4_ASP | TYR6_ATP | TYR6_ASP | |
---|---|---|---|---|---|
0 | 1 | 8.9542 | 5.8024 | 11.5478 | 9.9557 |
1 | 2 | 8.6181 | 6.0942 | 13.9594 | 11.6945 |
2 | 3 | 9.0066 | 6.0637 | 13.0924 | 11.3043 |
3 | 4 | 9.2002 | 6.0227 | 14.5282 | 10.1763 |
4 | 5 | 9.1294 | 5.9365 | 13.5321 | 10.6279 |
... | ... | ... | ... | ... | ... |
9995 | 9996 | 8.5083 | 7.7587 | 9.1789 | 10.6715 |
9996 | 9997 | 8.9524 | 7.4681 | 9.5132 | 10.9945 |
9997 | 9998 | 8.6625 | 7.7306 | 9.5469 | 10.3063 |
9998 | 9999 | 9.2456 | 7.8886 | 9.8151 | 10.7564 |
9999 | 10000 | 8.8135 | 7.9170 | 9.9517 | 10.7848 |
10000 rows × 5 columns
This output gives us some information about what is in the dataframe. You can see that pandas made the first row in the file into headers. It also tells us that we have a table with 10,000 rows and 5 columns.
Accessing Data in the Data Frame#
One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is
dataframe["column_name"]
For example, to get the column “THR4_ATP”, we put the name in square brackets.
distances["THR4_ATP"]
0 8.9542
1 8.6181
2 9.0066
3 9.2002
4 9.1294
...
9995 8.5083
9996 8.9524
9997 8.6625
9998 9.2456
9999 8.8135
Name: THR4_ATP, Length: 10000, dtype: float64
If you want multiple columns, you use a list of column names in square brackets.
distances[["THR4_ATP", "TYR6_ATP"]]
THR4_ATP | TYR6_ATP | |
---|---|---|
0 | 8.9542 | 11.5478 |
1 | 8.6181 | 13.9594 |
2 | 9.0066 | 13.0924 |
3 | 9.2002 | 14.5282 |
4 | 9.1294 | 13.5321 |
... | ... | ... |
9995 | 8.5083 | 9.1789 |
9996 | 8.9524 | 9.5132 |
9997 | 8.6625 | 9.5469 |
9998 | 9.2456 | 9.8151 |
9999 | 8.8135 | 9.9517 |
10000 rows × 2 columns
If we want to get information in the dataframe using row and column numbers, we use the iloc
function.
The syntax for iloc
is
dataframe.iloc[row_number, column_number]
If you specify only a row number, you will get all the columns.
# This will get the first row, all of the columns.
distances.iloc[0]
Frame 1.0000
THR4_ATP 8.9542
THR4_ASP 5.8024
TYR6_ATP 11.5478
TYR6_ASP 9.9557
Name: 0, dtype: float64
# This will get the first row and the second column.
distances.iloc[0, 1]
8.9542
Check Your Understanding
How would you get the THR4_ASP
column?
How would you get the value in row index 10 of the THR4_ASP
column?
Solution
# Get the THR4_ASP column
distances["THR4_ASP"]
# Get row 1o of the THR4_ASP column.
distances["THR4_ASP"].iloc[10]
Slicing#
Similar to lists, you can slice pandas dataframes when you use iloc
.
This allows you to get a range of rows or columns.
To take a slice, you use a colon:
dataframe.iloc[row_start:row_end, column_start:column_end]
distances.iloc[0:10, :2]
Frame | THR4_ATP | |
---|---|---|
0 | 1 | 8.9542 |
1 | 2 | 8.6181 |
2 | 3 | 9.0066 |
3 | 4 | 9.2002 |
4 | 5 | 9.1294 |
5 | 6 | 9.0462 |
6 | 7 | 8.8657 |
7 | 8 | 9.3256 |
8 | 9 | 9.4184 |
9 | 10 | 9.0600 |
Analyzing Tabular Data#
A pandas dataframe has a number of functions built in that you can use to analyze your data.
The first we will look at is the mean
function. If we do dataframe.mean()
, we will get the average value of each column.
distances.mean()
Frame 5000.500000
THR4_ATP 10.876951
THR4_ASP 7.342345
TYR6_ATP 11.209791
TYR6_ASP 10.993443
dtype: float64
Similarly, we can get the standard deviation of each column by using the std
function.
distances.std()
Frame 2886.895680
THR4_ATP 2.676669
THR4_ASP 0.995629
TYR6_ATP 2.367468
TYR6_ASP 1.058657
dtype: float64
You can quickly get all descriptive statistics of all of the columns by using .describe
.
distances.describe()
Frame | THR4_ATP | THR4_ASP | TYR6_ATP | TYR6_ASP | |
---|---|---|---|---|---|
count | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 5000.50000 | 10.876951 | 7.342345 | 11.209791 | 10.993443 |
std | 2886.89568 | 2.676669 | 0.995629 | 2.367468 | 1.058657 |
min | 1.00000 | 5.741600 | 3.606100 | 5.707900 | 7.855700 |
25% | 2500.75000 | 8.740850 | 6.739650 | 9.560125 | 10.299925 |
50% | 5000.50000 | 10.775950 | 7.410700 | 10.894250 | 10.880550 |
75% | 7500.25000 | 12.892300 | 7.976450 | 12.921500 | 11.574075 |
max | 10000.00000 | 19.480000 | 10.816900 | 18.932300 | 15.006500 |
The output of these functions is a pandas dataframe, so we can save our results in a variable and access it the same we we did with our other dataframe.
statistics = distances.describe()
statistics
Frame | THR4_ATP | THR4_ASP | TYR6_ATP | TYR6_ASP | |
---|---|---|---|---|---|
count | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 5000.50000 | 10.876951 | 7.342345 | 11.209791 | 10.993443 |
std | 2886.89568 | 2.676669 | 0.995629 | 2.367468 | 1.058657 |
min | 1.00000 | 5.741600 | 3.606100 | 5.707900 | 7.855700 |
25% | 2500.75000 | 8.740850 | 6.739650 | 9.560125 | 10.299925 |
50% | 5000.50000 | 10.775950 | 7.410700 | 10.894250 | 10.880550 |
75% | 7500.25000 | 12.892300 | 7.976450 | 12.921500 | 11.574075 |
max | 10000.00000 | 19.480000 | 10.816900 | 18.932300 | 15.006500 |
# Get the THR4_ATP column
statistics["THR4_ATP"]
count 10000.000000
mean 10.876951
std 2.676669
min 5.741600
25% 8.740850
50% 10.775950
75% 12.892300
max 19.480000
Name: THR4_ATP, dtype: float64
Writing Files using Pandas#
Pandas has built-in functions to write our data to files.
We can save our results to a new CSV file using the function to_csv
.
# save a file called "analysis.csv"
statistics.to_csv("analysis.csv")
Mathematical Operations and Filtering#
Pandas dataframes will often allow you to avoid using for
loops.
For example, if we wanted to multiply a column by 10, we could do
distances["THR4_ASP"] * 10
0 58.024
1 60.942
2 60.637
3 60.227
4 59.365
...
9995 77.587
9996 74.681
9997 77.306
9998 78.886
9999 79.170
Name: THR4_ASP, Length: 10000, dtype: float64
This would also work for adding, subtracting, or multiplying. We can also easily save our calculated value as a new column. The syntax for this is
dataframe["new_column_name"] = calculation
Imagine we wanted to convert our distances to nanometers, we could do
distances["THR4_ASP(nm)"] = distances["THR4_ASP"] * 0.1
distances
Frame | THR4_ATP | THR4_ASP | TYR6_ATP | TYR6_ASP | THR4_ASP(nm) | |
---|---|---|---|---|---|---|
0 | 1 | 8.9542 | 5.8024 | 11.5478 | 9.9557 | 0.58024 |
1 | 2 | 8.6181 | 6.0942 | 13.9594 | 11.6945 | 0.60942 |
2 | 3 | 9.0066 | 6.0637 | 13.0924 | 11.3043 | 0.60637 |
3 | 4 | 9.2002 | 6.0227 | 14.5282 | 10.1763 | 0.60227 |
4 | 5 | 9.1294 | 5.9365 | 13.5321 | 10.6279 | 0.59365 |
... | ... | ... | ... | ... | ... | ... |
9995 | 9996 | 8.5083 | 7.7587 | 9.1789 | 10.6715 | 0.77587 |
9996 | 9997 | 8.9524 | 7.4681 | 9.5132 | 10.9945 | 0.74681 |
9997 | 9998 | 8.6625 | 7.7306 | 9.5469 | 10.3063 | 0.77306 |
9998 | 9999 | 9.2456 | 7.8886 | 9.8151 | 10.7564 | 0.78886 |
9999 | 10000 | 8.8135 | 7.9170 | 9.9517 | 10.7848 | 0.79170 |
10000 rows × 6 columns
Instead of writing an if
statement, we can do
distances["THR4_ATP"] > 9
0 False
1 False
2 True
3 True
4 True
...
9995 False
9996 False
9997 False
9998 True
9999 False
Name: THR4_ATP, Length: 10000, dtype: bool
We can then use that as an index to get the rows where the condition is True
distances[distances["THR4_ATP"] > 9]
Frame | THR4_ATP | THR4_ASP | TYR6_ATP | TYR6_ASP | THR4_ASP(nm) | |
---|---|---|---|---|---|---|
2 | 3 | 9.0066 | 6.0637 | 13.0924 | 11.3043 | 0.60637 |
3 | 4 | 9.2002 | 6.0227 | 14.5282 | 10.1763 | 0.60227 |
4 | 5 | 9.1294 | 5.9365 | 13.5321 | 10.6279 | 0.59365 |
5 | 6 | 9.0462 | 6.2553 | 13.9060 | 10.4292 | 0.62553 |
7 | 8 | 9.3256 | 6.2351 | 13.0721 | 10.4740 | 0.62351 |
... | ... | ... | ... | ... | ... | ... |
9990 | 9991 | 9.1535 | 8.3352 | 10.9256 | 11.3146 | 0.83352 |
9991 | 9992 | 9.4492 | 8.1078 | 11.0096 | 11.5627 | 0.81078 |
9992 | 9993 | 9.7893 | 7.5384 | 10.1918 | 11.0180 | 0.75384 |
9993 | 9994 | 9.4321 | 8.0167 | 9.5764 | 10.9514 | 0.80167 |
9998 | 9999 | 9.2456 | 7.8886 | 9.8151 | 10.7564 | 0.78886 |
7143 rows × 6 columns
Pandas and NumPy#
Another commonly used library in data science and scientifid programming is called NumPy.
Pandas dataframes are built on top of a data structure known as the NumPy Array..
In general, you should use pandas dataframe when working with data which is:
Two dimensional (rows and columns).
Labeled.
Mixed type.
Something for which you would like to be able to easily get statistics.
You should work with NumPy arrays when:
You have higher dimensional data (collection of two dimensional arrays).
You need to perform advanced mathematics like linear algebra.
You are using a library which requires NumPy arrays (scikitlearn).
To convert a dataframe to a numpy array, use the function .to_numpy()
.
distances_array = distances.to_numpy()
print(distances_array)
[[1.00000e+00 8.95420e+00 5.80240e+00 1.15478e+01 9.95570e+00 5.80240e-01]
[2.00000e+00 8.61810e+00 6.09420e+00 1.39594e+01 1.16945e+01 6.09420e-01]
[3.00000e+00 9.00660e+00 6.06370e+00 1.30924e+01 1.13043e+01 6.06370e-01]
...
[9.99800e+03 8.66250e+00 7.73060e+00 9.54690e+00 1.03063e+01 7.73060e-01]
[9.99900e+03 9.24560e+00 7.88860e+00 9.81510e+00 1.07564e+01 7.88860e-01]
[1.00000e+04 8.81350e+00 7.91700e+00 9.95170e+00 1.07848e+01 7.91700e-01]]
After you use the to_numpy()
function, you will gt a NumPy array.
A NumPy array does not have columnn names.
You can slice a NumPy array by using square brackets []
and the same slicing you used with
pandas dataframes and iloc
.
distances_array[0:10, 1:3]
array([[8.9542, 5.8024],
[8.6181, 6.0942],
[9.0066, 6.0637],
[9.2002, 6.0227],
[9.1294, 5.9365],
[9.0462, 6.2553],
[8.8657, 5.9186],
[9.3256, 6.2351],
[9.4184, 6.1993],
[9.06 , 6.0478]])
Your Turn#
Pandas Exercise
Read in file PubChemElements_all.csv
.
Calculate melting points in celsius and save in a new column. The columns with melting and boiling points use Kelvin.
Calculate column statistics and save them in a new dataframe.
Find elements where the boiling point is greater than the average boiling point.
Solution
This shows one potential solution.
periodic_data = pd.read_csv("data/PubChemElements_all.csv")
# 1 - Calculate melting points in celsius and save in a new column.
periodic_data["MeltingPointC"] = periodic_data["MeltingPoint"] - 273.15
# 2 - Calculate the column statistics and sav them in a new dataframe.
stats = periodic_data.describe()
# 3. Find elements where the boiling point is greater than the average.
periodic_data[ periodic_data["BoilingPoint"] > periodic_data["BoilingPoint"].mean() ]
Key Points
Use pandas to read and work with data that has rows and columns (two dimensional data).
Pandas dataframes let you use column names.
Use `dataframe[“column_name”] to access columns of information.
You can slice dataframes using indices using the
dataframe.iloc[]
syntax.dataframe.describe()
can be used to quickly get statistical information about columns in a dataframe.