Building a Go Concurrent Text Indexer

Table of Contents

  1. Introduction
  2. Prerequisites
  3. Overview
  4. Setup
  5. Implementation - Step 1: Reading Files - Step 2: Text Indexing - Step 3: Concurrent Processing
  6. Testing
  7. Conclusion

Introduction

In this tutorial, we will build a concurrent text indexer in Go (Golang). A text indexer is a program that reads text files and creates an index, allowing efficient searching for specific words or phrases within those files. We will leverage the power of Go’s concurrency to process multiple files simultaneously, improving indexing speed.

By the end of this tutorial, you will have a solid understanding of how to build a concurrent text indexer in Go and gain experience with techniques such as file I/O, goroutines, channels, and synchronization.

Prerequisites

To follow along with this tutorial, you should have some basic knowledge of Go programming language syntax and concepts. Familiarity with concurrent programming and goroutines is helpful but not mandatory. Additionally, ensure that you have Go installed on your system.

Overview

  1. We will start by setting up a new Go project and organizing our code.
  2. Next, we will implement functionality to read text files from a specified directory.
  3. We will then build the core text indexing logic using data structures and algorithms.

  4. Finally, we will introduce concurrency to process multiple files concurrently and optimize the indexing speed.

Setup

Before getting started, let’s set up the project directory structure and create a new Go module. Open a terminal and execute the following commands:

mkdir go-text-indexer
cd go-text-indexer
go mod init github.com/your-username/go-text-indexer

Implementation

Step 1: Reading Files

Create a new file named file_reader.go and add the following code:

package main

import (
	"fmt"
	"io/ioutil"
	"path/filepath"
)

func readFiles(directory string) ([]string, error) {
	var fileNames []string

	files, err := ioutil.ReadDir(directory)
	if err != nil {
		return nil, err
	}

	for _, file := range files {
		if !file.IsDir() {
			filePath := filepath.Join(directory, file.Name())
			fileNames = append(fileNames, filePath)
		}
	}

	return fileNames, nil
}

func main() {
	files, err := readFiles("/path/to/directory")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	for _, file := range files {
		fmt.Println(file)
	}
}

In the readFiles function, we use the ioutil.ReadDir function to get a list of files in the specified directory. We ignore any directories and only add the file paths to the fileNames slice. The function returns the list of file names.

In the main function, we call readFiles with a directory path. It prints the file paths to the console for testing purposes.

Step 2: Text Indexing

Create a new file named text_indexer.go and add the following code:

package main

import (
	"bufio"
	"fmt"
	"os"
	"strings"
)

type Index map[string][]string

func indexText(filePath string) Index {
	index := make(Index)
	file, err := os.Open(filePath)
	if err != nil {
		fmt.Printf("Error opening file: %v\n", err)
		return nil
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	for lineNumber := 1; scanner.Scan(); lineNumber++ {
		line := scanner.Text()
		words := strings.Split(line, " ")

		for _, word := range words {
			word = strings.ToLower(word)

			if index[word] == nil {
				index[word] = []string{filePath}
			} else {
				index[word] = append(index[word], filePath)
			}
		}
	}

	if scanner.Err() != nil {
		fmt.Printf("Error reading file: %v\n", scanner.Err())
		return nil
	}

	return index
}

func main() {
	index := indexText("/path/to/file.txt")
	for word, files := range index {
		fmt.Printf("%s: %v\n", word, files)
	}
}

In the indexText function, we open the file specified by filePath and use a scanner to read it line by line. For each line, we split it into words and add them to the Index map. The keys in the map are the words (converted to lowercase), and the values are lists of file paths where the word occurs.

In the main function, we call indexText with a file path. It prints each word and the associated file paths where it occurs.

Step 3: Concurrent Processing

Create a new file named concurrent_indexer.go and add the following code:

package main

import (
	"fmt"
	"sync"
)

func concurrentIndexer(directory string) Index {
	fileNames, err := readFiles(directory)
	if err != nil {
		fmt.Printf("Error reading files: %v\n", err)
		return nil
	}

	wg := sync.WaitGroup{}
	indexLock := sync.Mutex{}
	index := make(Index)

	for _, fileName := range fileNames {
		wg.Add(1)
		go func(filePath string) {
			defer wg.Done()

			fileIndex := indexText(filePath)

			indexLock.Lock()
			for word, files := range fileIndex {
				index[word] = append(index[word], files...)
			}
			indexLock.Unlock()
		}(fileName)
	}

	wg.Wait()

	return index
}

func main() {
	index := concurrentIndexer("/path/to/directory")
	for word, files := range index {
		fmt.Printf("%s: %v\n", word, files)
	}
}

In the concurrentIndexer function, we first call readFiles to get a list of file names in the specified directory. Then, we create a sync.WaitGroup to wait for all goroutines to finish indexing.

Inside the goroutine, we call indexText to index each file separately. To avoid race conditions while updating the shared index, we use a sync.Mutex named indexLock to synchronize access to the index map.

After all goroutines have completed, we return the indexed result.

Testing

To test the concurrent text indexer, modify the main function in concurrent_indexer.go with the desired directory path:

func main() {
	index := concurrentIndexer("/path/to/directory")
	for word, files := range index {
		fmt.Printf("%s: %v\n", word, files)
	}
}

Save the file and execute the program. It should print the indexed words and associated file paths.

Conclusion

Congratulations! You have successfully built a Go concurrent text indexer. In this tutorial, we covered the steps required to read files from a directory, index the text, and process files concurrently using goroutines and synchronization primitives.

By leveraging Go’s concurrency features, you can improve the indexing speed significantly. Feel free to explore further improvements such as error handling, progress monitoring, or integrating with a search functionality.

Remember to experiment with different file types and sizes to test the scalability and performance of your text indexer. Happy coding!