Sentiment analysis with a SQLite database (…and TF-IDF)#
Today I’m just starting Chapter 8: Sentiment Analysis in the book. But this time I’m not just busy scribbling down notes or copying code verbatim into my Python REPL. I decided to experiment with SQLite databases! (as you can probably infer from the title) Hopefully, I will be able to practice both my ML and my DB skills 😅.
Consider this piece of code from the book, which is used to extract data from the IMDb dataset:
basepath = 'aclImdb'
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()
for s in ('test', 'train'):
for l in ('pos', 'neg'):
path = os.path.join(basepath, s, l)
for file in sorted(os.listdir(path)):
with open(os.path.join(path, file),
'r', encoding='utf-8') as infile:
txt = infile.read()
df = df.append([[txt, labels[l]]],
ignore_index=True)
pbar.update()
df.columns = ['review', 'sentiment']
(pos and neg refer to sentiment labels, all text files under pos are positive reviews and vice versa)
Instead of directly storing the data in a Pandas DataFrame, I attempted to store it in a SQLite Database
(by using the sqlite3
package in the stdlib):
1import sqlite3
2from pathlib import Path
3
4con = sqlite3.connect("IMDb.sqlite3")
5con.execute("""
6CREATE TABLE IF NOT EXISTS IMDb (
7 id INTEGER PRIMARY KEY,
8 review TEXT,
9 sentiment INTEGER
10)
11""")
12
13def get_count():
14 return con.execute("SELECT COUNT(*) FROM IMDb").fetchone()[0]
15
16if get_count() == 0:
17 def reviews():
18 for file in Path("aclImdb").glob("**/pos/*.txt"):
19 with open(file, encoding="utf-8") as f:
20 yield (f.read(), 1)
21
22 for file in Path("aclImdb").glob("**/neg/*.txt"):
23 with open(file, encoding="utf-8") as f:
24 yield (f.read(), 0)
25
26 cur = con.cursor()
27 with con:
28 cur.executemany(
29 "INSERT INTO IMDb (review, sentiment) VALUES (?, ?)",
30 reviews()
31 )
32
33 print(get_count())
First I created a connection with sqlite3.connect()
. Then I created the table and inserted the
reviews into the table with basic CRUD operations.
Note
Highlights in the code above:
CREATE TABLE IF NOT EXISTS
: only create the table if it does not existreviews()
is a generator function (yield
), so that I can use it in combination withcon.executemany()
with con
: Connection objects as context managers (autoCOMMIT
orROLLBACK
)VALUES (?, ?)
: SQL interpolation, prevents SQL injection
And finally, I need to read the reviews into a DataFrame, so as to preprocess the data. This time I chose polars. Polars provides a function that utilizes connectorx
to load data from a variety of SQL databases. For me, I only need to provide the SQL SELECT
statement and the connection url:
import polars as pl
df = pl.read_database(
"SELECT review, sentiment FROM IMDb",
"sqlite://IMDb.sqlite3",
)
That’s it! I succeeded in storing and retrieving data from a SQLite database.
Now it’s time for some less hands-on stuff.
TF-IDF#
I’ve always heard people discussing TF-IDF, but not until today did I truly understand what it refers to and why it’s useful in NLP.
TF-IDF, which stands for term frequency-inverse document frequency, is used for highighting important words (but not common ones such as “is” or “a”) in a document. It is defined as follows:
where \(t\) is the term and \(d\) is the document.
\(\mathrm{tf}(t, d)\), of course is the number of occurances of a term in a document. \(\mathrm{idf}(t, d)\), on the other hand is more complicated.
It’s defined as:
where \(n_d\) is the total number of documents and \(\mathrm{df}(d, t)\) refers to the document frequency (how many documents contain the term \(t\)).
However, in scikit-learn, it is instead defined as:
(which helps to prevent the singularity of \(\log x\) at \(x = 0\))
This way, when words such as “is” or “a” appear in every document, \(\mathrm{df}(d,t)\) will be very large, and thus \(\mathrm{tfidf}(t, d)\) will be small.
That’s all I have to share today.