Shortest Common Super-sequence in Dynamic Programming

Hi everyone!🙋‍♀️

Hope you’re staying positive and testing negative!😄

In order to understand the Shortest Common Supersequence(SCS) problem, we first need to learn about two things it is very closely related to:

  1. Dynamic Programming
  2. Longest Common Subsequence.

So lets start with the basics right away!

DYNAMIC PROGRAMMING

Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

It refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. Similar to the divide and conquer method, Dynamic Programming solves problems by combining the solutions of subproblems. It involves solving each subproblem just once and then saving its answer in a table, thereby avoiding re-computing the answer every time.

A comparison between Divide & Conquer Approach and Dynamic Programming.
A comparison between Divide & Conquer Approach and Dynamic Programming.

LONGEST COMMON SUBSEQUENCE

The longest common subsequence states that , given two sequences, we need to find the length of longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but is not necessarily contiguous. The naive solution for this problem is to generate all subsequences of both given sequences and find the longest matching subsequence.

For example:

Here, we compare both the given strings and try to find a pattern common in both of them. Thus, we see that BCBA and BDAB come out to be the longest common subsequences present in the given input strings and thereby we conclude that the length of the longest common subsequence here is 4.

Here’s a solved example of the LCS problem

Now that we have gained knowledge about the basics required to solve the SCS problem, lets jump right into it! :)

SHORTEST COMMON SUPERSEQUENCE

PROBLEM STATEMENT

Shortest Common Supersequence (SCS) problem states , as for a given set of strings , the task is to find the supersequence of every string that is minimum in length. The term supersequence means the sequence of symbols of a string and sequence of symbols of the computed strings are same (sequence need not to be adjacent/continuous).

In the SCS problem, two sequences X and Y are given, and the task is to find the shortest possible common supersequence of these sequences.

If both strings have all characters different, then result is sum of lengths of two given strings. If there are common characters, then we don’t want them multiple times(i.e. no overlapping) as the task is to minimize the length. Therefore, we first find the longest common subsequence, take one occurrence of this subsequence , and add the extra characters.

Now, lets try to solve an example to understand this concept thoroughly.

STEP-BY-STEP DEMONSTRATION OF SOLVING THE SCS PROBLEM

  • Question:

Given two strings S1 and S2, find the length of the smallest string which has both, S1and S2 as its sub-sequences.

  • Input Strings:

S1 = AGGTAB

S2= GXTXAYB

  • Demonstration of the solution:
  • Final Solution:

Thus, the Shortest Common Supersequence is AGXGTXAYB and the length of it is 9. This supersequence has all elements of the both the input strings although not necessarily in a continuous order.

ALGORITHM FOR THE SCS PROBLEM

Here is the Dynamic Programming approach:

  1. Compute the LCS using Dynamic Programming tabulation where each cell returns the length of the cell up to i characters of String 1 and j characters of String 2.
  2. We will start processing the table using the last cell till i>0 or j>0 .

i) Check if s1[i-1]==s2[j-1]. If equal, we must add this character to the result string only once

ii) If not equal, then find the maximum of t[i-1][j] and t[i][j-1] (this is how we calculate LCS length first) , start moving in the max direction after inserting the character to the result string. Moving in max direction means discarding that character of the string that has not contributed in LCS. But still inserting in the final string because it will contribute in the supersequence.
i.e.
if (t[i-1][j]>t[i][j-1])
{
res.push_back(s1[i-1]);
i — ;
}

iii) Compute till i>0 && j>0. If any of the string is left i.e if i>0 or j>0 then add its characters to the result. This means that we copied 1 complete string but other is still remaining.

3. We have got the required string but stored all the characters in reverse order. So, just reverse the result and you will get the final answer.

This is the basic idea behind the algorithm.

CODE IMPLEMENTATION FOR THE SCS PROBLEM

The code for the above mentioned algorithm can be written in various programming languages like C , Java and C++, but I prefer to write it in Python using the dynamic programming approach.

# A dynamic programming based Python3 program print 
# shortest supersequence of two strings

# returns shortest supersequence of X and Y
def printShortestSuperSeq(x, y):
m = len(x)
n = len(y)

# dp[i][j] contains length of shortest
# supersequence for X[0..i-1] and Y[0..j-1]
dp = [[0 for i in range(n + 1)]
for j in range(n + 1)]

# Fill table in bottom up manner
for i in range(m + 1):
for j in range(n + 1):

# Below steps follow recurrence relation
if i == 0:
dp[i][j] = j
elif j == 0:
dp[i][j] = i
elif x[i — 1] == y[j — 1]:
dp[i][j] = 1 + dp[i — 1][j — 1]
else:
dp[i][j] = 1 + min(dp[i — 1][j],
dp[i][j — 1])

# Following code is used to print
# shortest supersequence

# dp[m][n] stores the length of the
# shortest supersequence of X and Y
index = dp[m][n]

# string to store the shortest supersequence
string = “”

# Start from the bottom right corner and
# one by one push characters in output string
i = m
j = n
while i > 0 and j > 0:

# If current character in X and Y are same,
# then current character is part of
# shortest supersequence
if x[i — 1] == y[j — 1]:

# Put current character in result
string += x[i — 1]

# reduce values of i, j and index
i -= 1
j -= 1
index -= 1

# If current character in X and Y are different
elif dp[i — 1][j] > dp[i][j — 1]:

# Put current character of Y in result
string += y[j — 1]

# reduce values of j and index
j -= 1
index -= 1
else:

# Put current character of X in result
string += x[i — 1]

# reduce values of i and index
i -= 1
index -= 1

# If Y reaches its end, put remaining characters
# of X in the result string
while i > 0:
string += x[i — 1]
i -= 1
index -= 1

# If X reaches its end, put remaining characters
# of Y in the result string
while j > 0:
string += y[j — 1]
j -= 1
index -= 1

string = list(string)

# reverse the string and return it
string.reverse()
return ‘’.join(string)

# Driver Code
if __name__ == “__main__”:
x = “AGGTAB”
y = “GXTXAYB”

print(printShortestSuperSeq(x, y))

OUTPUT OF THE ABOVE CODE

TIME COMPLEXITY

· The Dynamic Programming approach takes O(m*n) time , where m, n are the length of the first and second string respectively.

SPACE COMPLEXITY

· The Dynamic Programming approach takes O(m*n) auxiliary space.

RELATION BETWEEN SCS AND LCS PROBLEMS

The SCS problem is closely related to LCS problem. Lets see how.

1) Find Longest Common Subsequence (lcs) of two given strings. For example, lcs of “geek” and “eke” is “ek”.

2) Insert non-lcs characters (in their original order in strings) to the lcs found above, and return the result. So “ek” becomes “geeke” which is shortest common supersequence.

Length of the shortest supersequence  = (Sum of lengths of given two strings) - (Length of LCS of two given strings)

APPLICATIONS OF THE SCS PROBLEM

The SCS problem has diversified applications in:

1)Data compression and different bioinformatics analysis.

2) Probe synthesis during microarray production.

3) AI planning.

4) Query optimization in the database.

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

REFERENCES