Python Basics: Working with Text

Colab version of this page is available here. When you open the link, go to File and then Save a copy in Drive. This way you can access the notebook and run the code on your browser as well as work on the exercises.

Python Basics: Working with text

In this notebook, we will go through basic Python operations concerning text as data. Like many programming languages, Python treats “text” as a data type called a string.

Strings in Python are sequences of characters. They can be created by enclosing characters in quotes. Python treats single quotes the same as double quotes.

Let’s start by enclosing some text in quotes.

'Welcome to COST Action Training School in Prague??'
'Welcome to COST Action Training School in Prague??'

In this code cell, we have entered a text value: a sequence of different characters enclosed by quotes. Python has no clue what to do with that value, so it simply repeats the input in the output.

However, we usually assign this value to a variable. A variable is an object in memory that can store data values, allowing us to use them in other parts of the code.

In practice, once we assign a value to a variable, Python recognizes the type of data stored in the variable. This determines how we can manipulate the text stored in the variable.

To assign a value to a variable, we follow this recipe: first, we choose a name for our variable, then we use the = symbol, and finally, we type or paste in the value we want to assign.

example = 'Welcome to COST Action Training School in Prague??'

Functions

We can check if Python understood that we want to deal with strings in this notebook. We can check the type of any variable by using the function type().

Functions are reusable pieces of code that have their own names. They are often used to perform a single, specific task that is frequently repeated.

Functions take one or more variables or values as input and return or print some values as output.

The input variables or values are called arguments.

To check if our text value is recognized as a string, we can use the following code:

type('Welcome to COST Action Training School in Prague??')
str

This is cumbersome even when pasting the value. It’s much easier for us to type the variable name as an argument.

Give it a try!

type()

Has anything changed?

The str output indicates that Python treats the argument as a string.

Strings are essentially sequences of textual characters. They can consist of letters of the alphabet, digits, special characters, or anything else stored in Unicode (UTF-8) or similar encodings (such as currency characters or mathematical operators).


If you haven’t heard of Unicode before, it’s a system that matches a code recognizable to computers to visual representations of characters/symbols.

Here’s an example of using emojis in Python:

print("\U0001F1EA\U0001F1F8")
🇪🇸

The sequence of digits and numbers after the backslash \ represents codes that instruct any machine system using Unicode on how to visually display the symbol associated with that code.

This concept applies to standard characters as well, such as the letter ‘X’:

print("\u0058")
X

Play with the code and try to find less common letters!


Indexing and Slicing

In programming, as well as in mathematics, sequences are almost always indexed. Indexing means that every element of the sequence, in our case, every character, has its position within the sequence.

Returning to our example variable, we can access the first character of the text stored in that variable by using the following code:

example[1]
'e'

Wait, what just happened?

One of the many quirky things about Python is that indexing starts from position 0.

It’s hard to explain why, but here’s the rationale: the position of an element tells you how far it is from the start of the sequence. Think of it as a measure of distance.

example[0]
'W'

This code cell provides the correct value. To display the third and fourth characters of the text stored in the example variable (at positions 2 and 3), you can try the following:

example[2]

example[3]
'c'

Why do we only see the fourth character?

In Python, if you retrieve variable values by simply typing the variable name, you will only see the last value. To view all the values, you need to use the print() function.

print(example[2])

print(example[3])
l
c

Some argue that zero-based indexing also makes sense for a common operation called slicing. Slicing involves extracting a range of items from a sequence.

Think of it as cutting a section of a sequence with a pair of scissors.

If we want to extract only the first 7 characters of our example string, we would use the following code:

example[0:7]
'Welcome'

This can be translated as: starting from the beginning of the sequence, give me elements that are within a distance of less than 7 positions.

A more technically challenging way to express this is: when slicing in Python, the start index is inclusive and the end index is exclusive. Try experimenting with different indices for slicing to understand what this means.

If you’re familiar with R programming, you will notice a big difference in indexing and slicing. It can be confusing and overall, working with strings in Python is quite different to R.

The big difference is that in Python strings are immutable! Unlike in R, you cannot change a single element of a string.

It would be convenient if we could replace the question marks in our example text with exclamation marks.

Counting the position of these question marks from the beginning of the string can be annoying. To make it easier, we can count backwards using negative positions, which measure the distance from the end of the sequence.

example[-1]
'?'

It would be great if we could replace the character at this position with !.

example[-1] = '!'

We need to find a better solution for this.


Exercise 1

Replace the ‘??' at the end of the example string with’!!'.

The first part of the solution involves slicing, while the second part involves string concatenation. You can think of it as adding two strings together.

# Slice the example string so it doesn't contain `??` characters



# Store this string in example_sliced variable



# Make new variable solution which is addition of the sliced variable and '!!'

Methods

We can solve this problem in a different way by using a specialized tool built for this purpose called a method.

Methods are similar to functions, but they are not standalone, autonomous pieces of code. They are very specific pieces of reusable code that are tied to a specific object in Python. A method can access and modify the object it’s been tied to.

Methods are usually named with verbs, so think of them as different ways to perform a specific action with a variable.

In our case, we want to replace ‘??’ with ‘!!’, and it turns out that all strings have a method named just like that, designed to solve the problem we’re facing right now.

## fill in the blanks

example.replace('__', '__')
'Welcome to COST Action Training School in Prague??'

Here’s another useful method: upper. It transforms the string by converting all characters to uppercase.

example.upper
# What went wrong?
<function str.upper()>

Methods may or may not have arguments. If we think of them as functions, then the object they are being tied to is one of the arguments because the starting point of every method is the object they are being called from.

Think of everything preceding the . on the left side of the method name as the material the method is working with.

('alL' + ' ' + 'caPs').upper()
'ALL CAPS'

Exercise 2

Now that we have a good understanding of replacing characters in strings, let’s modify the example text. Change “Prague” to “Salamanca” and make sure the sentence ends with exclamation points.

# your solution goes here

Loops

Another very useful string method is split. It’s particularly helpful when working with a large quantity of text and we want to split it into sentences or words.

We can split our example string into words using the following approach:

words = example.split(' ')

print(words)
['Welcome', 'to', 'COST', 'Action', 'Training', 'School', 'in', 'Prague??']

It is important to note that the split method returns a list of strings. Lists are another type of sequence in Python, and they are mutable, meaning we can change their elements as desired.

words[4] = 'Python'
words[5] = 'Workshop'

print(words)
['Welcome', 'to', 'COST', 'Action', 'Python', 'Workshop', 'in', 'Prague??']

We can access elements of a list in the same way we access elements of a string, by using indexing. Another way to access elements of a list is by using a loop.

Loops are used to repeat a block of code multiple times. They are very useful when we want to perform the same operation on multiple elements of a sequence.

There are two types of loops in Python: for and while loops. In this workshop, we will focus on for loops.

Let’s print every word in the words list on a separate line.

for word in words:
    print(word)
Welcome
to
COST
Action
Python
Workshop
in
Prague??

Here, word is a variable that will take on the value of each element in the words list during each iteration of the loop. It starts with the first element of the list and then moves to the next element in the subsequent iterations. Within the loop, we can use the word variable to access the current element of the list and perform the same operation on every element of the list.

Here’s another example. Let’s print the number of characters in each word in the words list.

We can achieve this by using the len function, which returns the number of elements in a sequence. Since every word is a sequence of characters, we can utilize the len function to count the number of characters in each word.

For instance, let’s print the number of characters in the first word of the words list.

print(words[0])
 
print(len(words[0]))
Welcome
7

Exercise 3

Now, let’s write a loop that will print the number of characters in each word in the words list.

# your solution goes here, pay attention to the spaces/identation

What do you think will happen if you execute len(words)? Give it a try!


IF Statements

Now, let’s move on to a more challenging problem. We need to find all the characters in the example text and count how many times each character appears.

First, we will convert the text from a string into a set of characters. A set is similar to a list, but it can only contain unique elements.

characters = set(example)
print(characters)
{'W', 'n', 'a', 'A', 'c', 'h', 'i', 'u', 'P', 'S', 'e', '?', 'g', 'C', 'O', 'm', 'T', 'l', ' ', 't', 'r', 'o'}

Now we’re going to use loops to count the number of times each character appears in the example text.

Within the loop, we’ll perform two tasks:

  1. Check if the character is a letter.
  2. If the character is a letter, count how many times it appears in the example text.

To check if a character is a letter, we can use the isalpha method. For example:

'X'.isalpha()
True
'##'.isalpha()
False

The isalpha() method returns True if all characters in the string are alphabetic and there is at least one character. Otherwise, it returns False. In Python, True and False are special values known as booleans.

Think of booleans as a way to answer a question: Is this character a letter? Yes or no?

We use boolean values in if statements. If statements are used to execute a block of code only if a certain condition is met. In our example, we only want to count characters that are letters.

If statements are written in the following way:

if condition:
    block of code

Here is an example of how to use the isalpha method within an if statement:

for character in characters:
    if character.isalpha():
        print(character)
W
n
a
A
c
h
i
u
P
S
e
g
C
O
m
T
l
t
r
o

Let’s start counting!

We can use the count method from the string to count the number of times each character appears in the example text. This method returns the number of occurrences of a substring in the string.

For example, let’s count the number of times the letter ‘a’ appears in the example text.

example.count('a')
2

Exercise 4

Print out the number of times each character appears in the example text.

# your solution goes here, pay attention to the spaces/identation

Dictionaries

Printing out the number of times each character appears in the example text is not very useful. We can store this information in a more structured way. In Python, we often use dictionaries for this purpose.

Think of dictionaries as a way to connect two pieces of information. In our case, we want to connect each character with the number of times it appears in the example text. The dictionary provides a quick and intuitive way to look up the number of times each character appears in the text.

Dictionaries have the following structure:

{
    key1: value1,
    key2: value2,
    ...
}

We can create the one from our example in a following way:

character_count = {} # empty dictionary

for character in characters:
    character_count[character] = example.count(character)

Let’s explore how simple it is to count the frequency of the letter c in the example text.

print(character_count['c'])
3

Unlike lists, the keys in dictionaries have semantic meaning. They are not just positions in a sequence; instead, they are labels that help us understand the data stored in the dictionary. Dictionaries are particularly useful for pairing qualitative and quantitative data, enabling efficient retrieval.

List Comprehension Now comes the hard part. List comprehension is a powerful tool in Python, but it can be somewhat unintuitive at first.

List comprehension is a concise way to compress a loop into a single line of code. It’s a useful tool when you want to perform a simple operation on every element of a sequence and then aggregate the results into a new sequence.

For example, we can use it on the example text to create a list of all words that contain the letter o.

First, let’s see how we can check if a word contains the letter o. We can do this by using the in operator.

'o' in words[0]
True

Once again, we have a boolean value. This time, it indicates whether the word contains the letter o. Similar to the previous example, we can utilize this boolean value in an if statement.

Exercise 5

Write a loop that prints all words from the words list that contain the letter o.

# your solution goes here, pay attention to the spaces/identation

There is an issue with upper and lower case letters. How can we solve this problem?


Now, let’s simplify this loop with list comprehension.

List comprehension is written in the following way:

[expression for element in sequence if condition]

It may sound abstract at first, but let’s break it down:

  • The expression refers to the operation that we want to perform on each element of the sequence in order to generate a corresponding element in the new sequence. In this case, the operation is simply keeping the word itself in the new list.
  • The element is a temporary variable that takes on the value of each element in the sequence during each iteration of the loop.
  • The sequence is the sequence on which we want to perform the operation. In this case, it is the words list.
  • The condition is the requirement that must be met for the operation to be executed. In this case, the condition is the presence of the letter o in the word.
[word for word in words if 'o' in word]
['Welcome', 'to', 'Action', 'Python', 'Workshop']

Exercise 6

Write a list comprehension that creates a list of all words from the words list that do not contain uppercase letters.

First, you need to find a way to check if a word contains an uppercase letter. Then, you can use this information in the list comprehension.

HINT: You can use the isupper method to check if a character is uppercase. Similarly, you can use the islower method to check if a character is lowercase.

# your solution goes here

Summary

Throughout this notebook, we have used a single sentence as an example to demonstrate basic text manipulation operations in Python. Additionally, we have covered fundamental concepts in Python programming, including variables, functions, methods, loops, if statements, dictionaries, and list comprehension.

It is worth noting that we have not relied on any special methods or libraries to perform these operations. Python offers a wide range of built-in tools that can assist in manipulating text data.

In the next notebook, we will be working with large quantities of text data. To accomplish this, we will utilize the powerful pandas library, which provides efficient tools for data manipulation in Python. While doing so, we will extensively apply the concepts we have learned in this notebook, such as functions, methods, and if statements.