Analyzing Tabular Data#

Overview

Questions:

  • How do I work with data presented in tables?

Objectives:

  • Use functions in pandas to read in tabular data.

  • Access information in a data frame using column names and row numbers.

Most scientists work with a lot of numerical data. In this module we will focus on reading in and analyzing numerical data, visualizing the data, and working with arrays.

Reading in Tabular Data#

As we already discussed, there are many ways to read in data from files in Python. In our last module, we used the readlines() function to read in a complex output file. In theory, you could always use the readlines() function, and then use the data parsing tools we learned in the previous module to format the data as you needed. But sometimes there are other ways that make more sense, particularly if the data is formatted in a table.

A common table format is the CSV file or comma separated values. This is exactly what it sounds like. Data is presented in rows, with each value separated by a comma. If you have data in a spreadsheet program that you need to import into a python code, you can save the data as a csvfile to read it in.

In this example, we have a CSV file that contains data from a molecular dynamics trajectory. We have a 20 ns simulation that used a 2 fs timestep. The data was saved to the trajectory file every 1000 steps, so our file has 10,000 timesteps. At each timestep, we are interested in the distance between particular atoms. These trajectories were generated with the AMBER molecular dynamics program and the distances were measured with the python program MDAnalysis. The table of atomic distances was saved as a CSV file called “distance_data_headers.csv”. This file was downloaded as part of your lesson materials. Open the file in a text editor and study it to determine its structure.

In analyzing tabular data, we often need to perform the same types of calculations (averaging, calculating the minimum or maximum of the data set), so we are once again going to use a Python library, this time a library that contains lots of functions to work with tables.

This library is called pandas. When it is imported, it is usually shortened to pd.

import pandas as pd

The function we will use is called read_csv. You can read more about this function in your Jupyter notebook by using

help(pd.read_csv)

Library Documentation

Most popular Python libraries have very good online documentation. You can find the pandas documentation by googling “pandas docs”. You will be able to find the same help message you get for read_csv as well as tutorials and other types of documentation.

  1. Pandas Documentation

  2. read_csv documentation

To get started, we will make a variable for our file path, then use that variable in the read_csv function.

distance_file = "data/distance_data_headers.csv"

distances = pd.read_csv(distance_file)

The variable called distances is a pandas dataframe. When we want to look at a dataframe, we should not print it. We should instead put it as the last thing in a cell. This is because pandas dataframes are rendered in a special way in the Jupyter notebook. If you use the print function with them, they will not look as nice.

distances
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP
0 1 8.9542 5.8024 11.5478 9.9557
1 2 8.6181 6.0942 13.9594 11.6945
2 3 9.0066 6.0637 13.0924 11.3043
3 4 9.2002 6.0227 14.5282 10.1763
4 5 9.1294 5.9365 13.5321 10.6279
... ... ... ... ... ...
9995 9996 8.5083 7.7587 9.1789 10.6715
9996 9997 8.9524 7.4681 9.5132 10.9945
9997 9998 8.6625 7.7306 9.5469 10.3063
9998 9999 9.2456 7.8886 9.8151 10.7564
9999 10000 8.8135 7.9170 9.9517 10.7848

10000 rows × 5 columns

This output gives us some information about what is in the dataframe. You can see that pandas made the first row in the file into headers. It also tells us that we have a table with 10,000 rows and 5 columns.

Accessing Data in the Data Frame#

One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is


dataframe["column_name"]

For example, to get the column “THR4_ATP”, we put the name in square brackets.

distances["THR4_ATP"]
0       8.9542
1       8.6181
2       9.0066
3       9.2002
4       9.1294
         ...  
9995    8.5083
9996    8.9524
9997    8.6625
9998    9.2456
9999    8.8135
Name: THR4_ATP, Length: 10000, dtype: float64

If you want multiple columns, you use a list of column names in square brackets.

distances[["THR4_ATP", "TYR6_ATP"]]
THR4_ATP TYR6_ATP
0 8.9542 11.5478
1 8.6181 13.9594
2 9.0066 13.0924
3 9.2002 14.5282
4 9.1294 13.5321
... ... ...
9995 8.5083 9.1789
9996 8.9524 9.5132
9997 8.6625 9.5469
9998 9.2456 9.8151
9999 8.8135 9.9517

10000 rows × 2 columns

If we want to get information in the dataframe using row and column numbers, we use the iloc function.

The syntax for iloc is

dataframe.iloc[row_number, column_number]

If you specify only a row number, you will get all the columns.

# This will get the first row, all of the columns.
distances.iloc[0]
Frame        1.0000
THR4_ATP     8.9542
THR4_ASP     5.8024
TYR6_ATP    11.5478
TYR6_ASP     9.9557
Name: 0, dtype: float64
# This will get the first row and the second column.
distances.iloc[0, 1]
8.9542

Check Your Understanding

How would you get the THR4_ASP column?

How would you get the value in row index 10 of the THR4_ASP column?

Slicing#

Similar to lists, you can slice pandas dataframes when you use iloc. This allows you to get a range of rows or columns. To take a slice, you use a colon:


dataframe.iloc[row_start:row_end, column_start:column_end]
distances.iloc[0:10, :2]
Frame THR4_ATP
0 1 8.9542
1 2 8.6181
2 3 9.0066
3 4 9.2002
4 5 9.1294
5 6 9.0462
6 7 8.8657
7 8 9.3256
8 9 9.4184
9 10 9.0600

Analyzing Tabular Data#

A pandas dataframe has a number of functions built in that you can use to analyze your data.

The first we will look at is the mean function. If we do dataframe.mean(), we will get the average value of each column.

distances.mean()
Frame       5000.500000
THR4_ATP      10.876951
THR4_ASP       7.342345
TYR6_ATP      11.209791
TYR6_ASP      10.993443
dtype: float64

Similarly, we can get the standard deviation of each column by using the std function.

distances.std()
Frame       2886.895680
THR4_ATP       2.676669
THR4_ASP       0.995629
TYR6_ATP       2.367468
TYR6_ASP       1.058657
dtype: float64

You can quickly get all descriptive statistics of all of the columns by using .describe.

distances.describe()
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP
count 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000
mean 5000.50000 10.876951 7.342345 11.209791 10.993443
std 2886.89568 2.676669 0.995629 2.367468 1.058657
min 1.00000 5.741600 3.606100 5.707900 7.855700
25% 2500.75000 8.740850 6.739650 9.560125 10.299925
50% 5000.50000 10.775950 7.410700 10.894250 10.880550
75% 7500.25000 12.892300 7.976450 12.921500 11.574075
max 10000.00000 19.480000 10.816900 18.932300 15.006500

The output of these functions is a pandas dataframe, so we can save our results in a variable and access it the same we we did with our other dataframe.

statistics = distances.describe()
statistics
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP
count 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000
mean 5000.50000 10.876951 7.342345 11.209791 10.993443
std 2886.89568 2.676669 0.995629 2.367468 1.058657
min 1.00000 5.741600 3.606100 5.707900 7.855700
25% 2500.75000 8.740850 6.739650 9.560125 10.299925
50% 5000.50000 10.775950 7.410700 10.894250 10.880550
75% 7500.25000 12.892300 7.976450 12.921500 11.574075
max 10000.00000 19.480000 10.816900 18.932300 15.006500
# Get the THR4_ATP column
statistics["THR4_ATP"]
count    10000.000000
mean        10.876951
std          2.676669
min          5.741600
25%          8.740850
50%         10.775950
75%         12.892300
max         19.480000
Name: THR4_ATP, dtype: float64

Writing Files using Pandas#

Pandas has built-in functions to write our data to files. We can save our results to a new CSV file using the function to_csv.

# save a file called "analysis.csv"
statistics.to_csv("analysis.csv")

Mathematical Operations and Filtering#

Pandas dataframes will often allow you to avoid using for loops. For example, if we wanted to multiply a column by 10, we could do

distances["THR4_ASP"] * 10
0       58.024
1       60.942
2       60.637
3       60.227
4       59.365
         ...  
9995    77.587
9996    74.681
9997    77.306
9998    78.886
9999    79.170
Name: THR4_ASP, Length: 10000, dtype: float64

This would also work for adding, subtracting, or multiplying. We can also easily save our calculated value as a new column. The syntax for this is

dataframe["new_column_name"] = calculation

Imagine we wanted to convert our distances to nanometers, we could do

distances["THR4_ASP(nm)"] = distances["THR4_ASP"] * 0.1
distances
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP THR4_ASP(nm)
0 1 8.9542 5.8024 11.5478 9.9557 0.58024
1 2 8.6181 6.0942 13.9594 11.6945 0.60942
2 3 9.0066 6.0637 13.0924 11.3043 0.60637
3 4 9.2002 6.0227 14.5282 10.1763 0.60227
4 5 9.1294 5.9365 13.5321 10.6279 0.59365
... ... ... ... ... ... ...
9995 9996 8.5083 7.7587 9.1789 10.6715 0.77587
9996 9997 8.9524 7.4681 9.5132 10.9945 0.74681
9997 9998 8.6625 7.7306 9.5469 10.3063 0.77306
9998 9999 9.2456 7.8886 9.8151 10.7564 0.78886
9999 10000 8.8135 7.9170 9.9517 10.7848 0.79170

10000 rows × 6 columns

Instead of writing an if statement, we can do

distances["THR4_ATP"] > 9
0       False
1       False
2        True
3        True
4        True
        ...  
9995    False
9996    False
9997    False
9998     True
9999    False
Name: THR4_ATP, Length: 10000, dtype: bool

We can then use that as an index to get the rows where the condition is True

distances[distances["THR4_ATP"] > 9]
Frame THR4_ATP THR4_ASP TYR6_ATP TYR6_ASP THR4_ASP(nm)
2 3 9.0066 6.0637 13.0924 11.3043 0.60637
3 4 9.2002 6.0227 14.5282 10.1763 0.60227
4 5 9.1294 5.9365 13.5321 10.6279 0.59365
5 6 9.0462 6.2553 13.9060 10.4292 0.62553
7 8 9.3256 6.2351 13.0721 10.4740 0.62351
... ... ... ... ... ... ...
9990 9991 9.1535 8.3352 10.9256 11.3146 0.83352
9991 9992 9.4492 8.1078 11.0096 11.5627 0.81078
9992 9993 9.7893 7.5384 10.1918 11.0180 0.75384
9993 9994 9.4321 8.0167 9.5764 10.9514 0.80167
9998 9999 9.2456 7.8886 9.8151 10.7564 0.78886

7143 rows × 6 columns

Pandas and NumPy#

Another commonly used library in data science and scientifid programming is called NumPy.

Pandas dataframes are built on top of a data structure known as the NumPy Array..

In general, you should use pandas dataframe when working with data which is:

  • Two dimensional (rows and columns).

  • Labeled.

  • Mixed type.

  • Something for which you would like to be able to easily get statistics.

You should work with NumPy arrays when:

  • You have higher dimensional data (collection of two dimensional arrays).

  • You need to perform advanced mathematics like linear algebra.

  • You are using a library which requires NumPy arrays (scikitlearn).

To convert a dataframe to a numpy array, use the function .to_numpy().

distances_array = distances.to_numpy()
print(distances_array)
[[1.00000e+00 8.95420e+00 5.80240e+00 1.15478e+01 9.95570e+00 5.80240e-01]
 [2.00000e+00 8.61810e+00 6.09420e+00 1.39594e+01 1.16945e+01 6.09420e-01]
 [3.00000e+00 9.00660e+00 6.06370e+00 1.30924e+01 1.13043e+01 6.06370e-01]
 ...
 [9.99800e+03 8.66250e+00 7.73060e+00 9.54690e+00 1.03063e+01 7.73060e-01]
 [9.99900e+03 9.24560e+00 7.88860e+00 9.81510e+00 1.07564e+01 7.88860e-01]
 [1.00000e+04 8.81350e+00 7.91700e+00 9.95170e+00 1.07848e+01 7.91700e-01]]

After you use the to_numpy() function, you will gt a NumPy array. A NumPy array does not have columnn names. You can slice a NumPy array by using square brackets [] and the same slicing you used with pandas dataframes and iloc.

distances_array[0:10, 1:3]
array([[8.9542, 5.8024],
       [8.6181, 6.0942],
       [9.0066, 6.0637],
       [9.2002, 6.0227],
       [9.1294, 5.9365],
       [9.0462, 6.2553],
       [8.8657, 5.9186],
       [9.3256, 6.2351],
       [9.4184, 6.1993],
       [9.06  , 6.0478]])

Your Turn#

Pandas Exercise

Read in file PubChemElements_all.csv.

  1. Calculate melting points in celsius and save in a new column. The columns with melting and boiling points use Kelvin.

  2. Calculate column statistics and save them in a new dataframe.

  3. Find elements where the boiling point is greater than the average boiling point.

Key Points

  • Use pandas to read and work with data that has rows and columns (two dimensional data).

  • Pandas dataframes let you use column names.

  • Use `dataframe[“column_name”] to access columns of information.

  • You can slice dataframes using indices using the dataframe.iloc[] syntax.

  • dataframe.describe() can be used to quickly get statistical information about columns in a dataframe.