File Parsing#

Overview

Questions:

  • How can I use Python to read text files?

  • How do I sort through all the information in a text file and extract particular pieces of information?

Objectives:

  • Open a file and read in its contents line by line.

  • Search for a particular string in a file.

  • Manipulate strings and change data types.

  • Print to a new file.

Working with files#

One of the most common tasks in research is analyzing data. Many computational chemistry programs output text files that include a large amount of information including text and data that you need to analyze. Often, you need to sort through the output file and identify particular pieces of information that are most important to you. In general, this is called file parsing.

The os library#

For this section, we will be working with the file ethanol.out in the outfiles directory.

To work with file locations, we will use a Python library called os. In Python, we a lot of the functionalities are obtained through imported libraries. The os library has functions we can use that are related to our operating system. We will use it to look at the files in a folder.

To use a library, we first have to import it. The syntax for this is

import library_name

We can import the os library.

import os

To use functions that are part of the os library, we do os.function_name. To see what folder your notebook is in, type:

print(os.getcwd())

You can see what other files are in the same folder by using the os.listdir function.

print(os.listdir())

Once you do this, you should see a folder called data. We can see what is in the data folder by using the os.listdir function.

print(os.listdir("data"))
['03_Prod.mdout', 'distance_data_headers.csv', 'outfiles', 'benzene.xyz', 'sapt.out', 'water.xyz', 'buckminsterfullerene.xyz', 'PubChemElements_all.csv', 'mdout']

The file we will be working with is in the outfiles folder and is ethanol.out. We will pull a piece of information (the energy) from this file.

We will open the file in the next step, but first, we have to tell Python where the file is. We will create a variable called ethanol_file that contains a string that tells Python where the file is. This string will have folder names and file names separated by forward slashes (/) and is called a “file path”. When deciding your file path, you can think about what you would tell someone to click in order to find the file.

ethanol_file = "data/outfiles/ethanol.out"
print(ethanol_file)
data/outfiles/ethanol.out

Notice that this doesn’t check that the file actually exists, so make sure you type it correctly!

Reading a file#

In Python, there are many ways to read in information from a text file. The best method to use depends on the type of data and the type of analysis you are performing. We will use the open function to open the file, and another function called readlines to pull information out of the file. If you have a file with lots of different types of information, text and numbers, with different types of formatting, the most generic way to read in information is the read or readlines function. Before you can read in a file, you have to open the file using the file path we defined above. This will create a file object, or filehandle. The file we will be analyzing in this example is a Psi4 output file for a SCF/cc-pVDZ energy calculation for an ethanol molecule.

In Python, when we use the open function, the syntax is

with open(filename, open_mode) as variable:
    # read the file
    data = variable.readlines()

In the open function, we specify the file we want to open as the first argument to the function (filename above), followed by the opening mode. The "r" specifies that we want to read the file.

Next, we use the readlines function. This pulls all of the information from the file into a list. Each element in the list is a line in the list.

with open(ethanol_file, "r") as outfile:
    data = outfile.readlines()

Check Your Understanding

Check that your file was read in correctly by determining how many lines are in the file.

Searching for a pattern in your file#

The file we opened is an output file which calculates the energy (and a lot of other stuff!) for an ethanol molecule. As stated previously, the readlines() function put the file contents into a list where each element is a line of the file. You may remember from lesson 1 that a for loops can be used to execute the same code repeatedly. As we learned in the previous lesson, we can use a for loop to iterate through elements in a list.

Let’s take a look at what’s in the file.

for line in data:
    print(line)
print(line)

This will print exactly what is in the file.

If you look through the output, you will see that the critical line says “Final Energy”. We want to search through this file and find that line, and print only that line. We can do this using an if statement.

Returning to our file example,

for line in data:
    if 'Final Energy' in line:
        energy_line = line
        print(energy_line)
  @DF-RHF Final Energy:  -154.09130176573018

Remember that readlines() saves each line of the file as a string, so energy_line is a string that contains the whole line. For our analysis, if we are most interested in the energy, we need to split up the line so we can save just the number as a different variable name. To do this, we use a new function called split. The split function takes a string and divides it into its components using a delimiter.

The delimiter is specified as an argument to the function (put in the parenthesis ()). If you do not specify a delimiter, a space is used by default. Let’s try this out.

energy_line.split()
['@DF-RHF', 'Final', 'Energy:', '-154.09130176573018']

Or, we an use the colon (:) as the delimiter.

energy_line.split(':')
['  @DF-RHF Final Energy', '  -154.09130176573018\n']

When we use ‘:’ as the delimiter, a list with two elements is returned. It is split where a colon was found.

We can save the output of this function to a variable as a new list. In the example below, we take the line we found in the for loop and split it up into its individual words.

words = energy_line.split()
print(words)
['@DF-RHF', 'Final', 'Energy:', '-154.09130176573018']

From this print statement, we now see that we have a list called words, where we have split energy_line. The energy is actually the fourth element of this list, so we can now save it as a new variable.

energy = words[3]
print(energy)
-154.09130176573018

Python negative indexing

We also recognize that the energy is the last element of the list. In Python, we can count backwards from the end of the list by using negative numbers. Therefore, an alternative way to assign energy is

energy = words[-1]

If we now try to do a math operation on energy, we get an error message. Why do you think that is?

energy + 50
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 energy + 50

TypeError: can only concatenate str (not "int") to str

Even though energy looks like a number to us, it is really a string, so we can not add an integer to it. We need to change the data type of energy to a float. This is called casting.

energy = float(energy)

Now our math operation will work. If we thought ahead, we could have changed the data type when we assigned the variable originally.

energy = float(words[3])

Your Turn#

Exercise on File Parsing

The file 03_Prod.mdout is an output file from an Amber molecular dynamics simulation. Read in the file, and pull out all of the total energy values (Etot). Save the values in a list (don’t forget to cast them to floating point numbers!)

Key Points

  • Python has libraries which can be imported. Libraries let us access more functions.

  • You can import a library using import library_name.

  • You use functions from imported libraries by doing library_name.function_name.

  • To open a file, use with open(filename).

  • Get contents of a text file in a variable using readlines.

  • You can search through the lines in your file by using a for loop to go through each line and an if statement to look for a pattern.