Scaling My Partner’s Poetry (Part 2)

My goal today was to help Kaitlin know which poems she has already posted to Instagram in image format. The process to do so consisted of downloading all of her uploaded images along with the associated URL. Next, I ran the images through an Image-to-Text / OCR tool called as Tesseract. I then compared each extracted text with the existing poem files trying to find the match. Once I knew which Instagram posts matched which poems, I added the Instagram URL to the front matter created in part 1.

I was not super concerned with creating the perfect code, and I’m certain that improvements could be made to any code below. This script was only run once so it just needed to function.

Step 1: Downloading Her Instagram Posts

I used a python program called Instalooter. This program can download all images and video associated to an Instagram user. Once I installed the program on my Ubuntu laptop, I ran the following command in my terminal:

{{< cmd >}} instalooter user kaitquinnpoetry -d {{< /cmd >}}

The -d flag is used to dump meta into a .JSON file alongside the downloaded images.

Step 2: Use Tesseract To Extract Text From Instagram Posts

The following ruby code iterates through each Instagram post .JSON file and extracts text from the images using the RTesseract gem (a wrapper for the real Tesseract). The output for the JSON file location, Instagram URL, and extracted text is sent to a .CSV file.

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'rtesseract'
end

class Gram
  attr_accessor :shortcode, :location, :text

  def initialize
    @text = String.new
  end
end

def gram_posts
  Dir.glob(File.join(Dir.home, 'Code', 'poetry', 'insta', '*.json'))
end

@posts = []
gram_posts.sort.reverse.each do |gram_post|
  @posts << Gram.new.tap do |g|
    gram_json = JSON.parse(File.read(gram_post))
    g.location = gram_post
    g.shortcode = gram_json['shortcode']
    selected_images = gram_images.select { |i| File.basename(i)[0..8] == File.basename(gram_post)[0..8] }
    g.text << selected_images.map { |image| RTesseract.new(image).to_s }.join("\n")
    puts g.text
    # binding.pry
  end
end

CSV.open('OCR_output.csv', 'w') do |csv|
  @posts.each do |post|
    csv << [post.location, post.shortcode, post.text]
  end
end

Step 3: Match Poem To Instagram Post

I had the original poem files in Markdown and text extracted from the Instagram posts. I had to compare the text and find matches. This connects the dots and gets me closer to the goal of putting the Instagram URL into the poem’s Markdown file front matter.

For this, I found a gem called similar_text and wrote a very inefficient script to compare the texts.

If the similarity is greater than 35%, I output the “match found” information to a .CSV file. I got “35%” from trial and error. At 35%, there were not many more false positives.

require 'csv'

@ocr_data = []
# At this point I manually added headers to the OCR_output.csv file
CSV.foreach('OCR_output.csv', headers: true) { |row| @ocr_data << row.to_hash }
@poems = []
@matches = []
poem_files.each do |poem_file|
  poem_text = File.open(poem_file).read
  @ocr_data.each do |ocrtext|
    similar_percent = poem_text.similar(ocrtext['text'])
    next unless similar_percent > 35
    @matches << [poem_file, ocrtext['slug'], similar_percent]
  end
end

CSV.open('matches.csv', 'w') do |csv|
  @matches.each do |match|
    csv << [match[0], match[1], match[2]]
  end
end

The CSV matches.csv now contains the file location, Instagram URL, and similarity percentage.

Step 4: Add Front Matter To Poem (Markdown File)

Now that I know which Instagram images matched which poem (Markdown) I can add the following front matter:

...[other front matter above]...
instagram_url: <theURL>
---

For this, I used the PadUtils gem that has great functions for inserting lines at different specified areas in a text file. The following code inserts the instagram_url to the poem’s front matter:

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'pad_utils'
end

@matches = []
# At this point I manually added headers to the matches.csv file
CSV.foreach('matches.csv', headers: true) { |row| @matches << row.to_hash }

@matches.each do |match|
  poem = match['poem_location']
  slug = match['slug']
  puts "Slug: #{slug} - Poem: #{poem}"
  PadUtils.insert_before_last(original: poem, tag: '---', text: "\ninstagram_url: " + slug + "\n")
end

Remind me, what was the point?

Kaitlin now knows which poems have been posted to Instagram and has a link to where they are posted. In the event she wants to publish those poems elsewhere, she can quickly get to that Instagram post to make it private. There may be other undiscovered benefits to having that information more handy.

Leave a Reply

Your email address will not be published. Required fields are marked *