Monday, July 17, 2017

The Curious Case of Python String Slicing

We were studying about Python strings and we tried to understand Python String Slicing through an example. We found a very strange [at least at first look] result when we tried to use String Slicing for creating a copy of the string.

To better understand the scenario and the result, lets start with the basics of integers and lists w.r.t to slicing.

>>> m = 10         # Integer m initialised to 10
>>> n = m           # Another integer n storing the same value as in m
>>> print m, n
10 10

>>> id(m), id(n)
( 140204825415168 , 140204825415168 ) # <-- Both are of the same id
A variable name in Python, is a "tag" name associated with a memory location.
In the above scenario, when we initialised variable m, Python associated a tag name m with memory location [ 140204825415168 ] ( For this discussion, assume that id represents memory location )



When we created a new variable n and assigned it the same value as is present in m, Python intelligently, associated a new tag name n, with the same id. Python reused the same memory location to avoid [ or delay ] unnecessary memory allocations.

>>> m += 5       # Increment m by 5
>>> print m, n   # m is modified, but n retains its original value
15  10

>>> id(m), id(n)
140204825415048, 140204825415168 )  # <-- The id or [address location] of m has changed


When we modified m, by updating its value, Python now allocated a new memory location for m, such that both m and n point to different memory locations [ Illustrated by different id values ]. This resulted in two different values being printed from m and n.

Now, let's do something similar with list variables.

>>> a = [ 1, 2, 3 ]      # List initialised to [ 1, 2, 3 ]
>>> b = a                  # Another list b storing the same value as in a
>>> print a, b
[1, 2, 3]   [1, 2, 3]

>>> id(a), id(b)
( 4342157256 , 4342157256 ) # <-- Both are of the same id
Initialisation of list variables is nothing different from the initialisation of integers as was in the previous case. There is one location with a single id value that contains two tags, corresponding to the two variables a and b

>>> a.append(5)       # Append 5 to a
>>> print a, b            # Both a and b are modified !!!
[1, 2, 3, 5]   [1, 2, 3, 5]

>>> id(a), id(b)
(  4342157256 , 4342157256 ) # <-- Both are of the same id !!!

We see a strange behaviour when we modify a list variable now. The changes done in list variable a are now visible from list variable b as well !!!

The reason for this behaviour is the mutability property of lists. In Python, lists are mutable, i.e., modifiable. Python here is using the variable names a and b as aliases [ or reference in C++ ] to the same data in memory. This is the standard, expected behaviour in Python for lists. Developers or Python coders need to be aware of this, and accordingly write their code.

If you want the changes in one list [ a ],  not to be visible through another list [ c ] that also contains the same data, then we need to create a copy or replica of the original list a

>>> a = [ 1, 2, 3 ]      # List initialised to [ 1, 2, 3 ]
>>> b = a                  # Another list b storing the same value as in a
>>> c = a[:]               # List c is a replica / copy of list a
>>> print a, b, c
[1, 2, 3]   [1, 2, 3]   [1, 2, 3]

>>> id(a), id(b), id(c)
( 4342157256 , 4342157256, 4342380304 ) # <-- Id of c is different from that of and b
a[:] is the syntax used to create a replica or copy of list a. A new memory is allocated for this replica and the new variable c is attached as its tag. Now, all modifications to data in a, either through variable names a or b, are not visible through variable c, since these data [ of a and c ] reside in two separate memory locations.

>>> a.append(5)       # Append 5 to a
>>> print a, b, c        # Only a and b are modified !!!
[1, 2, 3, 5]   [1, 2, 3, 5]   [1, 2, 3]

>>> id(a), id(b), id(c)
( 4342157256 , 4342157256, 4342380304 ) # <-- Id of c is different from that of and b
>>> b.append(6)       # Append 5 to a
>>> print a, b, c        # Only a and b are modified !!!
[1, 2, 3, 5, 6]   [1, 2, 3, 5, 6]   [1, 2, 3]

>>> id(a), id(b), id(c)
( 4342157256 , 4342157256, 4342380304 ) # <-- Id of c is different from that of and b
This is one the major reasons, that when a function gets a list as input parameter, then the best thing to do inside the function, is to create a copy of the input list, before doing any modifications to it. This way, the caller of this function will be guaranteed that the input list will not be modified after the execution of the function.

Slicing is possible on strings as well. Let's perform the same operations as were done for list slicing on a string now.

>>> x = "Hello"        # String initialised to "Hello"
>>> y = x                  # Another string y storing the same value as in x
>>> z = x[:]               # String z is a replica / copy of string x
>>> print x, y, z
Hello   Hello    Hello

>>> id(x), id(y), id(z)
( 4342372992 , 4342372992,  4342372992 ) # <-- All three ids are the same !!!
Strings are an immutable collection of characters. So, there is no operation that can be performed on one set of characters stored in a specific memory location that can alter the contents of that string. With this understanding of strings, we see that y and z both are treated as simple aliases [ references in C++ ] or as alias tags to x.  This is one of the first differences we see in string slicing, when compared to list slicing. The difference being driven mainly by the fact that list are modifiable while strings are not.

Now, lets append to a string as we did with lists and see the behaviour.

>>> += "World"      # Append "World" to x
>>> print x, y, z         # Only x is modified !!!
HelloWorld   Hello    Hello

>>> id(x), id(y), id(z)
( 4342372896 , 4342372992,  4342372992 ) # <-- Only id of x is changed !!!
String x has been reallocated to a new memory !!! [ or at least a new id in the above scenario ]. When, we append a string like "World" to an existing string in x, Python, internally creates a new memory location for the resultant string. It is this new memory location that is tagged with the variable name x. The earlier memory location [ containing "Hello" ] is no longer accessible through x now.

We would see the same behaviour even if we used a method like replace on x or y as shown below :

>>> y = y.replace("lo", "ipad")     # Replace "lo" with "ipad"
>>> print x, y, z                            # Only y is modified !!!
HelloWorld   Helipad    Hello

>>> id(x), id(y), id(z)
( 4342372896 , 4342373136,  4342372992 ) # <-- Only id of y is changed !!!
Moral of the story : Python intelligently, converts string copy syntax x[:] as an alias to the existing variable x. This helps in managing the memory more efficiently.

When we say that string slice x[:] is a copy of string x, we are actually using the word "copy" with its english connotation and it does not mean that Python will create a new memory and copy the contents of x in it.



No comments:

Post a Comment