Analyzing Tabular Data

Analyzing Tabular Data#

Overview

Questions:

How do I work with data presented in tables?

Objectives:

Use functions in pandas to read in tabular data.
Access information in a data frame using column names and row numbers.

Most scientists work with a lot of numerical data. In this module we will focus on reading in and analyzing numerical data, visualizing the data, and working with arrays.

Reading in Tabular Data#

As we already discussed, there are many ways to read in data from files in Python. In our last module, we used the readlines() function to read in a complex output file. In theory, you could always use the readlines() function, and then use the data parsing tools we learned in the previous module to format the data as you needed. But sometimes there are other ways that make more sense, particularly if the data is formatted in a table.

A common table format is the CSV file or comma separated values. This is exactly what it sounds like. Data is presented in rows, with each value separated by a comma. If you have data in a spreadsheet program that you need to import into a python code, you can save the data as a csvfile to read it in.

In this example, we have a CSV file that contains data from a molecular dynamics trajectory. We have a 20 ns simulation that used a 2 fs timestep. The data was saved to the trajectory file every 1000 steps, so our file has 10,000 timesteps. At each timestep, we are interested in the distance between particular atoms. These trajectories were generated with the AMBER molecular dynamics program and the distances were measured with the python program MDAnalysis. The table of atomic distances was saved as a CSV file called “distance_data_headers.csv”. This file was downloaded as part of your lesson materials. Open the file in a text editor and study it to determine its structure.

In analyzing tabular data, we often need to perform the same types of calculations (averaging, calculating the minimum or maximum of the data set), so we are once again going to use a Python library, this time a library that contains lots of functions to work with tables.

This library is called pandas. When it is imported, it is usually shortened to pd.

import pandas as pd

The function we will use is called read_csv. You can read more about this function in your Jupyter notebook by using

help(pd.read_csv)

Library Documentation

Most popular Python libraries have very good online documentation. You can find the pandas documentation by googling “pandas docs”. You will be able to find the same help message you get for read_csv as well as tutorials and other types of documentation.

To get started, we will make a variable for our file path, then use that variable in the read_csv function.

distance_file = "data/distance_data_headers.csv"

distances = pd.read_csv(distance_file)

The variable called distances is a pandas dataframe. When we want to look at a dataframe, we should not print it. We should instead put it as the last thing in a cell. This is because pandas dataframes are rendered in a special way in the Jupyter notebook. If you use the print function with them, they will not look as nice.

distances

	Frame	THR4_ATP	THR4_ASP	TYR6_ATP	TYR6_ASP
0	1	8.9542	5.8024	11.5478	9.9557
1	2	8.6181	6.0942	13.9594	11.6945
2	3	9.0066	6.0637	13.0924	11.3043
3	4	9.2002	6.0227	14.5282	10.1763
4	5	9.1294	5.9365	13.5321	10.6279
...	...	...	...	...	...
9995	9996	8.5083	7.7587	9.1789	10.6715
9996	9997	8.9524	7.4681	9.5132	10.9945
9997	9998	8.6625	7.7306	9.5469	10.3063
9998	9999	9.2456	7.8886	9.8151	10.7564
9999	10000	8.8135	7.9170	9.9517	10.7848

10000 rows × 5 columns

This output gives us some information about what is in the dataframe. You can see that pandas made the first row in the file into headers. It also tells us that we have a table with 10,000 rows and 5 columns.

Accessing Data in the Data Frame#

One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is

dataframe["column_name"]

For example, to get the column “THR4_ATP”, we put the name in square brackets.

distances["THR4_ATP"]

     8.9542
     8.6181
     9.0066
     9.2002
     9.1294
         ...  
  8.5083
  8.9524
  8.6625
  9.2456
  8.8135
Name: THR4_ATP, Length: 10000, dtype: float64

If you want multiple columns, you use a list of column names in square brackets.

distances[["THR4_ATP", "TYR6_ATP"]]

	THR4_ATP	TYR6_ATP
0	8.9542	11.5478
1	8.6181	13.9594
2	9.0066	13.0924
3	9.2002	14.5282
4	9.1294	13.5321
...	...	...
9995	8.5083	9.1789
9996	8.9524	9.5132
9997	8.6625	9.5469
9998	9.2456	9.8151
9999	8.8135	9.9517

10000 rows × 2 columns

If we want to get information in the dataframe using row and column numbers, we use the iloc function.

The syntax for iloc is

dataframe.iloc[row_number, column_number]

If you specify only a row number, you will get all the columns.

# This will get the first row, all of the columns.
distances.iloc[0]

Frame        1.0000
THR4_ATP     8.9542
THR4_ASP     5.8024
TYR6_ATP    11.5478
TYR6_ASP     9.9557
Name: 0, dtype: float64

# This will get the first row and the second column.
distances.iloc[0, 1]

8.9542

Check Your Understanding

How would you get the THR4_ASP column?

How would you get the value in row index 10 of the THR4_ASP column?

Solution

# Get the THR4_ASP column
distances["THR4_ASP"]

# Get row 1o of the THR4_ASP column.
distances["THR4_ASP"].iloc[10]

Slicing#

Similar to lists, you can slice pandas dataframes when you use iloc. This allows you to get a range of rows or columns. To take a slice, you use a colon:

dataframe.iloc[row_start:row_end, column_start:column_end]

distances.iloc[0:10, :2]

	Frame	THR4_ATP
0	1	8.9542
1	2	8.6181
2	3	9.0066
3	4	9.2002
4	5	9.1294
5	6	9.0462
6	7	8.8657
7	8	9.3256
8	9	9.4184
9	10	9.0600

Analyzing Tabular Data#

A pandas dataframe has a number of functions built in that you can use to analyze your data.

The first we will look at is the mean function. If we do dataframe.mean(), we will get the average value of each column.

distances.mean()

Frame       5000.500000
THR4_ATP      10.876951
THR4_ASP       7.342345
TYR6_ATP      11.209791
TYR6_ASP      10.993443
dtype: float64

Similarly, we can get the standard deviation of each column by using the std function.

distances.std()

Frame       2886.895680
THR4_ATP       2.676669
THR4_ASP       0.995629
TYR6_ATP       2.367468
TYR6_ASP       1.058657
dtype: float64

You can quickly get all descriptive statistics of all of the columns by using .describe.

distances.describe()

	Frame	THR4_ATP	THR4_ASP	TYR6_ATP	TYR6_ASP
count	10000.00000	10000.000000	10000.000000	10000.000000	10000.000000
mean	5000.50000	10.876951	7.342345	11.209791	10.993443
std	2886.89568	2.676669	0.995629	2.367468	1.058657
min	1.00000	5.741600	3.606100	5.707900	7.855700
25%	2500.75000	8.740850	6.739650	9.560125	10.299925
50%	5000.50000	10.775950	7.410700	10.894250	10.880550
75%	7500.25000	12.892300	7.976450	12.921500	11.574075
max	10000.00000	19.480000	10.816900	18.932300	15.006500

The output of these functions is a pandas dataframe, so we can save our results in a variable and access it the same we we did with our other dataframe.

statistics = distances.describe()

statistics

	Frame	THR4_ATP	THR4_ASP	TYR6_ATP	TYR6_ASP
count	10000.00000	10000.000000	10000.000000	10000.000000	10000.000000
mean	5000.50000	10.876951	7.342345	11.209791	10.993443
std	2886.89568	2.676669	0.995629	2.367468	1.058657
min	1.00000	5.741600	3.606100	5.707900	7.855700
25%	2500.75000	8.740850	6.739650	9.560125	10.299925
50%	5000.50000	10.775950	7.410700	10.894250	10.880550
75%	7500.25000	12.892300	7.976450	12.921500	11.574075
max	10000.00000	19.480000	10.816900	18.932300	15.006500

# Get the THR4_ATP column
statistics["THR4_ATP"]

count    10000.000000
mean        10.876951
std          2.676669
min          5.741600
25%          8.740850
50%         10.775950
75%         12.892300
max         19.480000
Name: THR4_ATP, dtype: float64

Writing Files using Pandas#

Pandas has built-in functions to write our data to files. We can save our results to a new CSV file using the function to_csv.

# save a file called "analysis.csv"
statistics.to_csv("analysis.csv")

Mathematical Operations and Filtering#

Pandas dataframes will often allow you to avoid using for loops. For example, if we wanted to multiply a column by 10, we could do

distances["THR4_ASP"] * 10

     58.024
     60.942
     60.637
     60.227
     59.365
         ...  
  77.587
  74.681
  77.306
  78.886
  79.170
Name: THR4_ASP, Length: 10000, dtype: float64

This would also work for adding, subtracting, or multiplying. We can also easily save our calculated value as a new column. The syntax for this is

dataframe["new_column_name"] = calculation

Imagine we wanted to convert our distances to nanometers, we could do

distances["THR4_ASP(nm)"] = distances["THR4_ASP"] * 0.1
distances

	Frame	THR4_ATP	THR4_ASP	TYR6_ATP	TYR6_ASP	THR4_ASP(nm)
0	1	8.9542	5.8024	11.5478	9.9557	0.58024
1	2	8.6181	6.0942	13.9594	11.6945	0.60942
2	3	9.0066	6.0637	13.0924	11.3043	0.60637
3	4	9.2002	6.0227	14.5282	10.1763	0.60227
4	5	9.1294	5.9365	13.5321	10.6279	0.59365
...	...	...	...	...	...	...
9995	9996	8.5083	7.7587	9.1789	10.6715	0.77587
9996	9997	8.9524	7.4681	9.5132	10.9945	0.74681
9997	9998	8.6625	7.7306	9.5469	10.3063	0.77306
9998	9999	9.2456	7.8886	9.8151	10.7564	0.78886
9999	10000	8.8135	7.9170	9.9517	10.7848	0.79170

10000 rows × 6 columns

Instead of writing an if statement, we can do

distances["THR4_ATP"] > 9

     False
     False
      True
      True
      True
        ...  
  False
  False
  False
   True
  False
Name: THR4_ATP, Length: 10000, dtype: bool

We can then use that as an index to get the rows where the condition is True

distances[distances["THR4_ATP"] > 9]

	Frame	THR4_ATP	THR4_ASP	TYR6_ATP	TYR6_ASP	THR4_ASP(nm)
2	3	9.0066	6.0637	13.0924	11.3043	0.60637
3	4	9.2002	6.0227	14.5282	10.1763	0.60227
4	5	9.1294	5.9365	13.5321	10.6279	0.59365
5	6	9.0462	6.2553	13.9060	10.4292	0.62553
7	8	9.3256	6.2351	13.0721	10.4740	0.62351
...	...	...	...	...	...	...
9990	9991	9.1535	8.3352	10.9256	11.3146	0.83352
9991	9992	9.4492	8.1078	11.0096	11.5627	0.81078
9992	9993	9.7893	7.5384	10.1918	11.0180	0.75384
9993	9994	9.4321	8.0167	9.5764	10.9514	0.80167
9998	9999	9.2456	7.8886	9.8151	10.7564	0.78886

7143 rows × 6 columns

Pandas and NumPy#

Another commonly used library in data science and scientifid programming is called NumPy.

Pandas dataframes are built on top of a data structure known as the NumPy Array..

In general, you should use pandas dataframe when working with data which is:

Two dimensional (rows and columns).
Labeled.
Mixed type.
Something for which you would like to be able to easily get statistics.

You should work with NumPy arrays when:

You have higher dimensional data (collection of two dimensional arrays).
You need to perform advanced mathematics like linear algebra.
You are using a library which requires NumPy arrays (scikitlearn).

To convert a dataframe to a numpy array, use the function .to_numpy().

distances_array = distances.to_numpy()
print(distances_array)

[[1.00000e+00 8.95420e+00 5.80240e+00 1.15478e+01 9.95570e+00 5.80240e-01]
 [2.00000e+00 8.61810e+00 6.09420e+00 1.39594e+01 1.16945e+01 6.09420e-01]
 [3.00000e+00 9.00660e+00 6.06370e+00 1.30924e+01 1.13043e+01 6.06370e-01]
 ...
 [9.99800e+03 8.66250e+00 7.73060e+00 9.54690e+00 1.03063e+01 7.73060e-01]
 [9.99900e+03 9.24560e+00 7.88860e+00 9.81510e+00 1.07564e+01 7.88860e-01]
 [1.00000e+04 8.81350e+00 7.91700e+00 9.95170e+00 1.07848e+01 7.91700e-01]]

After you use the to_numpy() function, you will gt a NumPy array. A NumPy array does not have columnn names. You can slice a NumPy array by using square brackets [] and the same slicing you used with pandas dataframes and iloc.

distances_array[0:10, 1:3]

array([[8.9542, 5.8024],
       [8.6181, 6.0942],
       [9.0066, 6.0637],
       [9.2002, 6.0227],
       [9.1294, 5.9365],
       [9.0462, 6.2553],
       [8.8657, 5.9186],
       [9.3256, 6.2351],
       [9.4184, 6.1993],
       [9.06  , 6.0478]])

Your Turn#

Pandas Exercise

Read in file PubChemElements_all.csv.

Calculate melting points in celsius and save in a new column. The columns with melting and boiling points use Kelvin.
Calculate column statistics and save them in a new dataframe.
Find elements where the boiling point is greater than the average boiling point.

Solution

This shows one potential solution.

periodic_data = pd.read_csv("data/PubChemElements_all.csv")

# 1 - Calculate melting points in celsius and save in a new column.
periodic_data["MeltingPointC"] = periodic_data["MeltingPoint"] - 273.15

# 2 - Calculate the column statistics and sav them in a new dataframe.
stats = periodic_data.describe()

# 3. Find elements where the boiling point is greater than the average.
periodic_data[ periodic_data["BoilingPoint"] > periodic_data["BoilingPoint"].mean() ]

Key Points

Use pandas to read and work with data that has rows and columns (two dimensional data).
Pandas dataframes let you use column names.
Use `dataframe[“column_name”] to access columns of information.
You can slice dataframes using indices using the dataframe.iloc[] syntax.
dataframe.describe() can be used to quickly get statistical information about columns in a dataframe.

Analyzing Tabular Data

Contents

Analyzing Tabular Data#

Reading in Tabular Data#

Accessing Data in the Data Frame#

Slicing#

Analyzing Tabular Data#

Writing Files using Pandas#

Mathematical Operations and Filtering#

Pandas and NumPy#

Your Turn#