'Welcome to COST Action Training School in Prague??'
'Welcome to COST Action Training School in Prague??'
Colab version of this page is available here. When you open the link, go to File
and then Save a copy in Drive
. This way you can access the notebook and run the code on your browser as well as work on the exercises.
In this notebook, we will go through basic Python operations concerning text as data. Like many programming languages, Python treats “text” as a data type called a string
.
Strings in Python are sequences of characters. They can be created by enclosing characters in quotes. Python treats single quotes the same as double quotes.
Let’s start by enclosing some text in quotes.
'Welcome to COST Action Training School in Prague??'
In this code cell, we have entered a text value: a sequence of different characters enclosed by quotes. Python has no clue what to do with that value, so it simply repeats the input in the output.
However, we usually assign this value to a variable. A variable is an object in memory that can store data values, allowing us to use them in other parts of the code.
In practice, once we assign a value to a variable, Python recognizes the type of data stored in the variable. This determines how we can manipulate the text stored in the variable.
To assign a value to a variable, we follow this recipe: first, we choose a name for our variable, then we use the =
symbol, and finally, we type or paste in the value we want to assign.
We can check if Python understood that we want to deal with strings in this notebook. We can check the type of any variable by using the function type()
.
Functions are reusable pieces of code that have their own names. They are often used to perform a single, specific task that is frequently repeated.
Functions take one or more variables or values as input and return or print some values as output.
The input variables or values are called arguments.
To check if our text value is recognized as a string, we can use the following code:
This is cumbersome even when pasting the value. It’s much easier for us to type the variable name as an argument.
Give it a try!
Has anything changed?
The str
output indicates that Python treats the argument as a string.
Strings are essentially sequences of textual characters. They can consist of letters of the alphabet, digits, special characters, or anything else stored in Unicode (UTF-8) or similar encodings (such as currency characters or mathematical operators).
If you haven’t heard of Unicode before, it’s a system that matches a code recognizable to computers to visual representations of characters/symbols.
Here’s an example of using emojis in Python:
The sequence of digits and numbers after the backslash \
represents codes that instruct any machine system using Unicode on how to visually display the symbol associated with that code.
This concept applies to standard characters as well, such as the letter ‘X’:
Play with the code and try to find less common letters!
In programming, as well as in mathematics, sequences are almost always indexed. Indexing means that every element of the sequence, in our case, every character, has its position within the sequence.
Returning to our example
variable, we can access the first character of the text stored in that variable by using the following code:
Wait, what just happened?
One of the many quirky things about Python is that indexing starts from position 0.
It’s hard to explain why, but here’s the rationale: the position of an element tells you how far it is from the start of the sequence. Think of it as a measure of distance.
This code cell provides the correct value. To display the third and fourth characters of the text stored in the example
variable (at positions 2 and 3), you can try the following:
Why do we only see the fourth character?
In Python, if you retrieve variable values by simply typing the variable name, you will only see the last value. To view all the values, you need to use the print()
function.
Some argue that zero-based indexing also makes sense for a common operation called slicing. Slicing involves extracting a range of items from a sequence.
Think of it as cutting a section of a sequence with a pair of scissors.
If we want to extract only the first 7 characters of our example
string, we would use the following code:
This can be translated as: starting from the beginning of the sequence, give me elements that are within a distance of less than 7 positions.
A more technically challenging way to express this is: when slicing in Python, the start index is inclusive and the end index is exclusive. Try experimenting with different indices for slicing to understand what this means.
If you’re familiar with R programming, you will notice a big difference in indexing and slicing. It can be confusing and overall, working with strings in Python is quite different to R.
The big difference is that in Python strings are immutable! Unlike in R, you cannot change a single element of a string.
It would be convenient if we could replace the question marks in our example text with exclamation marks.
Counting the position of these question marks from the beginning of the string can be annoying. To make it easier, we can count backwards using negative positions, which measure the distance from the end of the sequence.
It would be great if we could replace the character at this position with !
.
We need to find a better solution for this.
Replace the ‘??'
at the end of the example
string with’!!'
.
The first part of the solution involves slicing, while the second part involves string concatenation. You can think of it as adding two strings together.
We can solve this problem in a different way by using a specialized tool built for this purpose called a method.
Methods are similar to functions, but they are not standalone, autonomous pieces of code. They are very specific pieces of reusable code that are tied to a specific object in Python. A method can access and modify the object it’s been tied to.
Methods are usually named with verbs, so think of them as different ways to perform a specific action with a variable.
In our case, we want to replace ‘??’ with ‘!!’, and it turns out that all strings have a method named just like that, designed to solve the problem we’re facing right now.
'Welcome to COST Action Training School in Prague??'
Here’s another useful method: upper. It transforms the string by converting all characters to uppercase.
Methods may or may not have arguments. If we think of them as functions, then the object they are being tied to is one of the arguments because the starting point of every method is the object they are being called from.
Think of everything preceding the .
on the left side of the method name as the material the method is working with.
Now that we have a good understanding of replacing characters in strings, let’s modify the example
text. Change “Prague” to “Salamanca” and make sure the sentence ends with exclamation points.
Another very useful string method is split
. It’s particularly helpful when working with a large quantity of text and we want to split it into sentences or words.
We can split our example
string into words using the following approach:
['Welcome', 'to', 'COST', 'Action', 'Training', 'School', 'in', 'Prague??']
It is important to note that the split
method returns a list of strings. Lists are another type of sequence in Python, and they are mutable, meaning we can change their elements as desired.
['Welcome', 'to', 'COST', 'Action', 'Python', 'Workshop', 'in', 'Prague??']
We can access elements of a list in the same way we access elements of a string, by using indexing. Another way to access elements of a list is by using a loop.
Loops are used to repeat a block of code multiple times. They are very useful when we want to perform the same operation on multiple elements of a sequence.
There are two types of loops in Python: for
and while
loops. In this workshop, we will focus on for
loops.
Let’s print every word in the words
list on a separate line.
Here, word
is a variable that will take on the value of each element in the words
list during each iteration of the loop. It starts with the first element of the list and then moves to the next element in the subsequent iterations. Within the loop, we can use the word
variable to access the current element of the list and perform the same operation on every element of the list.
Here’s another example. Let’s print the number of characters in each word in the words
list.
We can achieve this by using the len
function, which returns the number of elements in a sequence. Since every word is a sequence of characters, we can utilize the len
function to count the number of characters in each word.
For instance, let’s print the number of characters in the first word of the words
list.
Now, let’s write a loop that will print the number of characters in each word in the words
list.
What do you think will happen if you execute len(words)
? Give it a try!
Now, let’s move on to a more challenging problem. We need to find all the characters in the example
text and count how many times each character appears.
First, we will convert the text from a string into a set of characters. A set is similar to a list, but it can only contain unique elements.
{'W', 'n', 'a', 'A', 'c', 'h', 'i', 'u', 'P', 'S', 'e', '?', 'g', 'C', 'O', 'm', 'T', 'l', ' ', 't', 'r', 'o'}
Now we’re going to use loops to count the number of times each character appears in the example
text.
Within the loop, we’ll perform two tasks:
example
text.To check if a character is a letter, we can use the isalpha
method. For example:
The isalpha()
method returns True
if all characters in the string are alphabetic and there is at least one character. Otherwise, it returns False
. In Python, True
and False
are special values known as booleans.
Think of booleans as a way to answer a question: Is this character a letter? Yes or no?
We use boolean values in if statements. If statements are used to execute a block of code only if a certain condition is met. In our example, we only want to count characters that are letters.
If statements are written in the following way:
if condition:
block of code
Here is an example of how to use the isalpha
method within an if statement:
W
n
a
A
c
h
i
u
P
S
e
g
C
O
m
T
l
t
r
o
Let’s start counting!
We can use the count
method from the string to count the number of times each character appears in the example
text. This method returns the number of occurrences of a substring in the string.
For example, let’s count the number of times the letter ‘a’ appears in the example
text.
Print out the number of times each character appears in the example
text.
Printing out the number of times each character appears in the example
text is not very useful. We can store this information in a more structured way. In Python, we often use dictionaries for this purpose.
Think of dictionaries as a way to connect two pieces of information. In our case, we want to connect each character with the number of times it appears in the example
text. The dictionary provides a quick and intuitive way to look up the number of times each character appears in the text.
Dictionaries have the following structure:
{
key1: value1,
key2: value2,
...
}
We can create the one from our example in a following way:
Let’s explore how simple it is to count the frequency of the letter c
in the example
text.
Unlike lists, the keys in dictionaries have semantic meaning. They are not just positions in a sequence; instead, they are labels that help us understand the data stored in the dictionary. Dictionaries are particularly useful for pairing qualitative and quantitative data, enabling efficient retrieval.
List Comprehension Now comes the hard part. List comprehension is a powerful tool in Python, but it can be somewhat unintuitive at first.
List comprehension is a concise way to compress a loop into a single line of code. It’s a useful tool when you want to perform a simple operation on every element of a sequence and then aggregate the results into a new sequence.
For example, we can use it on the example text to create a list of all words that contain the letter o.
First, let’s see how we can check if a word contains the letter o. We can do this by using the in operator.
Once again, we have a boolean value. This time, it indicates whether the word contains the letter o
. Similar to the previous example, we can utilize this boolean value in an if statement.
Write a loop that prints all words from the words
list that contain the letter o
.
There is an issue with upper and lower case letters. How can we solve this problem?
Now, let’s simplify this loop with list comprehension.
List comprehension is written in the following way:
[expression for element in sequence if condition]
It may sound abstract at first, but let’s break it down:
expression
refers to the operation that we want to perform on each element of the sequence in order to generate a corresponding element in the new sequence. In this case, the operation is simply keeping the word itself in the new list.element
is a temporary variable that takes on the value of each element in the sequence during each iteration of the loop.sequence
is the sequence on which we want to perform the operation. In this case, it is the words
list.condition
is the requirement that must be met for the operation to be executed. In this case, the condition is the presence of the letter o
in the word.Write a list comprehension that creates a list of all words from the words
list that do not contain uppercase letters.
First, you need to find a way to check if a word contains an uppercase letter. Then, you can use this information in the list comprehension.
HINT: You can use the isupper
method to check if a character is uppercase. Similarly, you can use the islower
method to check if a character is lowercase.
Throughout this notebook, we have used a single sentence as an example to demonstrate basic text manipulation operations in Python. Additionally, we have covered fundamental concepts in Python programming, including variables, functions, methods, loops, if statements, dictionaries, and list comprehension.
It is worth noting that we have not relied on any special methods or libraries to perform these operations. Python offers a wide range of built-in tools that can assist in manipulating text data.
In the next notebook, we will be working with large quantities of text data. To accomplish this, we will utilize the powerful pandas
library, which provides efficient tools for data manipulation in Python. While doing so, we will extensively apply the concepts we have learned in this notebook, such as functions, methods, and if statements.