Conversation
|
@chrished , can you have a look at this? after it, I will run it. my idea is to use |
src/dataprep/pipeline.sh
Outdated
| # ### Calculate reduced-dimension paper concepts | ||
| python -m $sript_path.link.fit_svd_model \ | ||
| --start 1980 \ | ||
| --end 2020 \ |
There was a problem hiding this comment.
can we include 2021 (and 2022)?
chrished
left a comment
There was a problem hiding this comment.
looks reasonable, would only increase the upper year limit
|
I added the model checks. what's now left to do
we discussed that, after aggregating at the entity level (department-year or author-year), we should appropriately normalize the scores embedding vectors. But I'm now not certain about this anymore, since the embedding values could also be negative, in which case a simple normalization does not make sense. -> we need to think more about this. |
The normalization is not necessary at that level, as the cosine similarity normalizes itself by the length of the vectors, just taking the difference in angle: cosine similarity The discussion was relevant when considering dimension reduction at the author/department level, as there the length of the vector was relevant. |
|
the prediction on all papers is running, I commit and then update the similarity part when it’s done |
381470b to
30e3b0d
Compare
|
preidcted vectors are way too large, 1024 columns X 260 million papers takes 2TB storage. We can instead use the svd model on the fly when calculating the similarities
|
…imilarity implemented only
…tions.py. Adjust to load all fields including level 0 in fit_svd
…, fix index accessibility issues, remaining: missing FielfOfStudyId in topics_collaborators_affiliations df
|
@f-hafner |
I'll have a look! |
|
@chrished , here is an example for similarities by array: from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
rng = np.random.default_rng(23535)
# generate data
n_students = 100
n_researchers = 1000
emb_students = rng.uniform(size=(n_students,16))
emb_researchers = rng.uniform(size=(n_researchers,16))
data = {}
data["student_id"] = 30 + np.arange(n_students)
for idx in range(emb_students.shape[1]):
col = emb_students[:,idx]
data[f"emb_{idx}"] = col
d_students = pd.DataFrame(data)
data = {}
data["researcher_id"] = 500 + np.arange(n_researchers)
for idx in range(emb_researchers.shape[1]):
col = emb_researchers[:,idx]
data[f"emb_{idx}"] = col
d_researchers = pd.DataFrame(data)
# similarity
d_researchers = d_researchers.set_index("researcher_id")
d_students = d_students.set_index("student_id")
similarities = cosine_similarity(d_students, d_researchers)
# convert to dataframe
a = pd.DataFrame(similarities)
a.columns = d_researchers.index
a.head()
a["student_id"] = d_students.index
a = a.set_index("student_id")
a.head()
# reshape to long
b = a.stack()
b = b.reset_index()
b = b.rename(columns={0: "sim"})
b.head()for students own similarity, could you use also, Otherwise it looks fine to me, but I did not look through the whole code. |
|
combining the functions is a possibility but I am not sure it is worth it right now. Open issues:
SELECT *
FROM graduates_similarity_to_self AS gss
LEFT JOIN graduates_similarity_to_self_svd AS gss_svd
ON gss.AuthorId = gss_svd.AuthorId AND gss.max_level = gss_svd.max_level
WHERE gss_svd.AuthorId IS NULL AND gss.max_level = 2
LIMIT 10;
AuthorId|similarity|max_level|AuthorId|similarity|max_level
31354139|0.148778879513861|2|||
32263467|0.234119255922107|2|||
43860561|0.186600285900166|2|||
96708831|0.442487846995853|2|||
118646155|0.0|2|||
137389383|0.220691473367387|2|||
207844304|0.138700745291053|2|||
221862811|0.211684941713149|2|||
261048268|0.192376176839773|2|||
319527065|0.437573593393941|2||| |
work in progress