Algorithms
Any Algorithms for Data Engineering

Any Algorithms for Data Engineering

Data engineering is an important aspect of the data science ecosystem. It involves the use of various algorithms, tools, and techniques for the processing, storage, and management of large datasets. Algorithms play an essential role in data engineering as they help to transform and manipulate data in different formats. In this article, we will explore some common algorithms used in data engineering.

Sorting Algorithms

Sorting algorithms are used to arrange data in a particular order. They are critical in data processing pipelines where the data needs to be in a specific sequence for further processing. There are several sorting algorithms, including:

  • Bubble sort: This is a simple sorting algorithm that compares adjacent elements and swaps them if they are in the wrong order.
def bubbleSort(arr):
    n = len(arr)
 
    for i in range(n):
 
        for j in range(0, n-i-1):
 
            if arr[j] > arr[j+1] :
                arr[j], arr[j+1] = arr[j+1], arr[j]

arr = [64, 34, 25, 12, 22, 11, 90]
 
bubbleSort(arr)
 
print ("Sorted array is:")
for i in range(len(arr)):
    print ("%d" %arr[i]),
  • Quick sort: This is a divide-and-conquer algorithm that recursively divides a list into smaller sub-lists.
def partition(arr, low, high):
    i = (low-1)         
    pivot = arr[high]    

    for j in range(low, high):

        if arr[j] <= pivot:

            i = i+1
            arr[i], arr[j] = arr[j], arr[i]

    arr[i+1], arr[high] = arr[high], arr[i+1]
    return (i+1)

def quickSort(arr, low, high):
    if len(arr) == 1:
        return arr
    if low < high:

        pi = partition(arr, low, high)

        quickSort(arr, low, pi-1)
        quickSort(arr, pi+1, high)

arr = [10, 7, 8, 9, 1, 5]
n = len(arr)
quickSort(arr, 0, n-1)
print("Sorted array is:")
for i in range(n):
    print("%d" % arr[i]),

Search Algorithms

Search algorithms help in locating specific data in a dataset. There are several search algorithms available, including:

  • Linear search: This involves sequentially looking at each element in a list until the target element is found.
def linear_search(arr, x):
    for i in range(len(arr)):
 
        if arr[i] == x:
            return i
 
    return -1

arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]
x = 5

result = linear_search(arr, x)

if result != -1:
    print(f"Element is present at index {result}")
else:
    print("Element is not present in array")
  • Binary search: This algorithm involves repeatedly dividing the search interval in half until the target value is located.
def binary_search(arr, low, high, x):
    if high >= low:
 
        mid = (high + low) // 2
 
        if arr[mid] == x:
            return mid
 
        elif arr[mid] > x:
            return binary_search(arr, low, mid - 1, x)
 
        else:
            return binary_search(arr, mid + 1, high, x)
 
    else:
        return -1

arr = [2, 3, 4, 10, 40]
x = 10

result = binary_search(arr, 0, len(arr)-1, x)

if result != -1:
    print(f"Element is present at index {result}")
else:
    print("Element is not present in array")

Graph Algorithms

Graph algorithms are used to solve problems related to graphs, such as shortest path, connectivity, etc. Some common graph algorithms include:

  • Breadth-first search (BFS): This algorithm traverses a graph in breadth-first order, starting from a particular node.
from collections import defaultdict
 
class Graph:
 
    def __init__(self):
        self.graph = defaultdict(list)
 
    def add_edge(self, u, v):
        self.graph[u].append(v)
 
    def bfs(self, s):
        visited = [False] * len(self.graph)
        queue = []
 
        queue.append(s)
        visited[s] = True
 
        while queue:
            s = queue.pop(0)
            print (s, end = " ")
 
            for i in self.graph[s]:
                if visited[i] == False:
                    queue.append(i)
                    visited[i] = True

g = Graph()
g.add_edge(0, 1)
g.add_edge(0, 2)
g.add_edge(1, 2)
g.add_edge(2, 0)
g.add_edge(2, 3)
g.add_edge(3, 3)
 
print("Following is Breadth First Traversal")
g.bfs(2)
  • Depth-first search (DFS): This algorithm traverses a graph in depth-first order, starting from a particular node.
from collections import defaultdict
 
class Graph:
 
    def __init__(self):
        self.graph = defaultdict(list)
 
    def add_edge(self, u, v):
        self.graph[u].append(v)
 
    def dfs_util(self, v, visited):

        visited.add(v)
        print(v, end=' ')
 
        for neighbour in self.graph[v]:
            if neighbour not in visited:
                self.dfs_util(neighbour, visited)
 
    def dfs(self, v):

        visited = set()
        self.dfs_util(v, visited)

g = Graph()
g.add_edge(0, 1)
g.add_edge(0, 2)
g.add_edge(1, 2)
g.add_edge(2, 0)
g.add_edge(2, 3)
g.add_edge(3, 3)
 
print("Following is Depth First Traversal")
g.dfs(2)

Conclusion

Data engineering involves the use of various algorithms to manipulate, process and store data. The algorithms discussed in this article are just a few examples of the many algorithms used in data engineering. As a data engineer, it is crucial to be familiar with different algorithms to be able to develop efficient and scalable data processing solutions.

Category: Algorithms