䷉Table: Scientific NLP and Reasoning Systems Readings

Send your thoughts via twitter or mail

A sporadically updated table of papers I come across.  

Similar: Hacking on Scientific Text at Scale and Readings in Comp. Linguistics

Shoutout to myself for convincing Ric to make my friend Ric da Silva  for creating ivywrite.io, which can compile messy Roam Research pages into something publishable like this



MedType: Improving Medical Entity Linking with Semantic Type Prediction

By typing entities into a semantic type you can reduce candidate labels by 10x. Improves [[entity linking]]. Reported 6-10 points in AUC increase for [[entity disambiguation]]

UMLS has 127 types. They grouped them into 24 semantic groups (see Table 2)

Authors create WikiMed (650k mentions normalized to UMLS) and PubMedDS [[dataset]] and pre-train MedType with it before fine-tuning on medical data


High-Precision Extraction of Emerging Concepts from Scientific Literature

Heuristic Algorithm: Intuition: New terms (eg. Self-Taught Hashing, gradient penalty) are introduced by a single paper that gets cited a lot later. Problem before paper: phrases extracted often too general or too narrow.

AllenAI's ForeCite achieves a double digit precision improvement on extracting new concepts from papers

ExpBERT: Representation Engineering with Natural Language Explanations

Suppose we want to specify the inductive bias that married couples typically go on honeymoons for the task of extracting pairs of spouses from text. In this paper, **we allow model developers to specify these types of inductive biases as natural language explanations.

Across three relation extraction tasks, our method, ExpBERT, matches a BERT **baseline but with 3--20x less labeled data **and improves on the baseline by 3--10 F1 points with the same amount of labeled data


Learning from Explanations with Neural Execution Tree

In this paper, we propose a novel Neural Execution Tree (NExT) framework to augment training data for text classification using NL explanations.

After transforming NL** explanations into executable logical forms by [[semantic parsing]], NExT generalizes different types of actions specified by the logical forms for labeling data instances, which substantially increases the coverage of each NL explanation.


Machine learning in chemoinformatics and drug discovery

Converting a compound structure into chemical information applicable for machine learning tasks requires multilayer computational processing from chemical graph retrieval, descriptor generation, fingerprint construction to similarity analysis


Dependency Forest

A dependency forest encodes all valid dependency trees of a sentence into a 3D space that syntax parsing is differentiable

This method allows us to merge a parser into a relation extraction model so that the parser can be jointly updated based on end-task loss.


TL;DR: More efficient text encoders emerge from an attempt (failed?) to do "[[GAN]]'s for Text".

Going Binary improves sample efficiency: Instead of [MASKED] token, the Discriminator only has to predict if a viable swap was sampled from a generator (painting -> mask -> car). Later the generator is thrown away

30x training efficiency over BERT

Constructing Large Scale Biomedical Knowledge Bases from Scratch with Rapid Annotation of Interpretable Patterns

How to populate your knowledge base without predefined rules or training data? Pick a predefined binary relationship type, do a 80/20 simplification of the sentence and find the simplest, good patterns through a bit of expert quality control.

Combining Neural and Pattern-Based Similarity Search

Domain experts often face the need to efficiently search for specific kind of information in a large collection of documents.

In this work, we propose a method to do so that combines exact-match search over symbolic structures, with the ability of modern neural models to provide rich semantic representations which generalize over surface forms


Generalizing the NLP pipeline through Span-relation Representation

Humans can analyze language in a single format? so machines might as well

SpanRel: A large number of [[NLP]] [[subtask]] ([[NER]], [[relation extraction]], [[Semantic Role Labeling | SRL]], sentiment analysis) can be represented in a single format : Spans, and relations between Spans

DARE: Data augmented Relation Extraction

Uses GPT-2 to generate examples for specific relation types

Combine generated data with gold data and train BERT relation extraction classifiers on them

TeKnow: Better Search for Computer Science

{{[[embed]]: Aspect-based retrieval. Search with a task in mind.}}

Fine-grained pre-requisites for knowledge acquisition: showing the [[skill tree]] in [[computer science]]


Top2Vec: Distributed Representations of Topics CODE

Top2Vec is an algorithm for [[topic modeling]] and [[semantic search]]. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Crackpot Idea: Maybe good idea to embed topics, document, word, entity pair, entity and dependency paths (aka Improving Biomedical Analogical Retrieval with Embedding of Structural
) in one [[vector space]] ?

"More informative topics than probabilistic generative models"


A Novel Cascade Binary Tagging Framework for Relational Triple Extraction CODE

Adresses Overlapping Triple Problem [relations sharing the same entities], ie. EntityPairOverlap (EPO) and SingleEntityOverlap (SEO).

Method: Instead of discrete labels, **relations are reframed as functions **that map subjects to objects in a sentence, which naturally handles the overlapping problem.

This binary tagging model can sit on top of a Transformer encoder.



AutoMATES: Automated Model Assembly from Text, Equations,
and Software

Besides natural language and equations, models must be implemented in code.

DEMO needs some UX work

A Unified Representation for Code, Text, Equations

Concept Parsing


Architecture Diagram:


Relation of the Relations: A New Paradigm of the
Relation Extraction Problem

Only skimmed paper, but the idea of modeling the ** interdependency** among multiple relations in the same discourse felt right. Similar to the [[Shortest Dependency Path]] [[clustering]] approach in A global network of biomedical relationships derived from text

"Accordingly, we develop a data-driven approach that does not require hand-crafted rules but learns by itself the RoR, using Graph Neural Networks and a relation matrix transformer"



Includes: [[dataset]] for discourse and claims prediction, which I understand as a [[span prediction]] task, along with an [[annotation tool]]

Fun [[statistic]]: 55% of their annotated claims were in the last sentence of the abstract [biomedicine]

Use Cases: Find similar claims using different methods (literature review); find related claims (-> claim reliability for [[peer review]])


Translating Literature Into Causal Graphs

See also: INDRA

We annotate results in literature, which we convert into logical constraints on causal structure; with these constraints, we find consistent causal graphs using a state-of-the-art, causal discovery algorithm based on [[answer set programming]].

**Experiment Selection: **Because these causal graphs show which relations are underdetermined, biologists can use this pipeline to select their next experiment.


Controlled Crowdsourcing for High-Quality QA-SRL

I mentioned QA-SRL in Hacking on Scientific Text at Scale

"Applying our proposed controlled crowdsourcing protocol to QA-SRL successfully attains truly scalable high-quality annotation by laymen"


TransOMCS: From Linguistic Graphs to Commonsense Knowledge

[[dataset]] at 100x size of ConceptNet

Method: Use linguistic knowledge ([[Dependency Tree]]) to extract [[common sense]] knowledge


Language (Re)modelling: Towards Embodied Language Understanding

out of my depth here. Embodied Cognitive Linguistics ECL [[semantic parsing]] seem similar to [[force dynamics]] @ Extraction of causal structure from procedural text for discourse representations. Suggests symbiosis of cognitive science, embodied AI, and [[computational linguistics]].

Claim: Natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction.


HDAG-Explorer: A System for Hierarchical DAG
Summarization and Exploration

[[interface]]** innovation**: Summarize a Hierarchical Directed A-cyclical Graph [H-DAG] along highly important and diverse

Features: summarized visualization, interactive exploration,
and structural statistics report.


Discrete Word Embedding for Logical Natural Language Understanding

A discrete, [[interpretable]] binary word [[embedding]] that preservers vector arithmetic and can handle ambiguity like its continuous sibling.

"Our model represents each word as a [[symbolic]] action that modifies the binary (ie. propositional) recurrent states through effects." Our Framework can extract the effects applied by each word as explicit logical formulae [??]

New: a) Robust to redundancy [the red red apple - would skew a continuous word vector] b) [[interoperability]] with classical planners


Training Classifiers with Natural Language Explanations

High Bandwidth Supervision: The annotator already spent the effort of reading and understanding the example, so let's get more information - increase [[annotation bandwidth]]

BabbleLabble: In addition to the binary label, annotators provide explanations which are parsed ([[semantic parsing]]) into logic forms with labeling functions. #[[data programming]]


Towards Universal Semantic Representation

Designing better semantic representations

Existing approaches such as semantic role labeling (SRL) and abstract meaning representation (AMR) still have features related to the peculiarities of the particular language


Improving Biomedical Analogical Retrieval with Embedding of Structural

Problem: Neural embeddings are trained for co-occurence events without making use of [[syntax]]. But, relational structure from sentence dependencies is essential for analogical reasoning (cognitive theory).

[ESD: Encoding Structural Dependencies] Entities, Pairs of Entities and [[dependency]] Paths are all represented in the same [[vector space]]

The well known [King - Man + Woman = Queen] arithmetic only seems to works because the target terms are similar in the vector space.


In Laymans Terms: Semi-Open Relation Extraction from Scientific Texts

Create FOBIE dataset [distant supervision with denoising] and use SORE (Semi-open relation extraction) to get <span class="roam-page">[[[trade-off]]](trade-off)</span> relationships from biological text

Trade-off is one of many complex relationship types;


REALM: Retrieval-Augmented Language Model Pre-Training

Idea: combine Masked [[language model]] with [[differentiable]] retriever

retrieve-then-predict: if a retrieval improves [[language model]] [[perplexity]] then reward

Advance in Indexing an Entire Corpus

Don't understand the rest of the paper


Patching as Translation: the Data and the Metaphor

The NMT (neural machine translation) metaphor is not adequate to describe patching buggy to working code.


Expanding a Database-derived Biomedical Knowledge Graph via Multi-relation Extraction from Biomedical Abstracts

[[data programming]] is cool but writing [[labeling function]]s is not easy. Can we reuse them for other edges or nodes?


Extraction of causal structure from procedural text for discourse representations

Surface forms link to force dynamics [linguistic theory]. Use that insight to extract causal chains, process graphs from text

RuleTaker: Transformers as Soft Reasoners over Language

DEMO Can transformers be trained to reason (or emulate reasoning) over rules expressed in language?

RuleTakers, are trained on datasets of synthetic rule bases plus derived conclusions, provided here. The resulting models provide the first demonstration that this kind of soft reasoning over language is indeed learnable.

Metals conduct electricity.
Insulators do not conduct electricity.
If something is made of iron then it is metal.
Nails are made of iron.
**RuleTaker will (correctly) predict:
**Nails conduct electricity? TRUE

Train and You’ll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings

REPL based ML Development

With Pretrained-Embeddings You Get REPL Speed

Hierarchical sequence labeling for extracting BEL statements from biomedical literature

We first map the sentence-level BEL statements in the BC-V training corpus to the corresponding text segments, thus generating hierarchically tagged training instances.

F-measure improvement of 31.6%

Unlike the previous relation extraction task, where a relationship is purely between two entities, the BEL task aims to discover the hierarchical relations between biomedical entities, meaning that the relationship (increases or decreases) can hold among multiple entities and complex biomedical functions (such as complex() or tloc()) can also be involved.

Fig. 4

Snorkel: rapid training data creation with weak supervision

Three principles that have shaped Snorkel’s design:

NERO: A Neural Rule Grounding Framework for Label-Efficient Relation Extraction

They perform exact string matching on the unlabeled dataset, and a sentence is either matched or not matched by a rule.

Soft Matching (Based On Similarity)

**We first extract the frequent patterns from large raw corpora, then ask human annotators to assign labels to the patterns. **


TODO: amazing paper with many new ideas on [[program synthesis]]

On Reducing the Program Search Space

Will This Idea Spread Beyond Academia? Understanding Knowledge Transfer of Scientific Concepts across Text Corpora

Paper follows the trajectory of 450.000 new concepts extracted from ~40M papers (web of science) to find why only some end up in inventions or drug trials

Knowledge transfer defined as concepts that end up in patents (~3%) or clinical trials (~11%)

Concepts: curricula, tools, programs, ideas, theories, substances, methods, processes and proposition -- the basic units of scientific discovery and advance

Their models, based on the features has high accuracy; good starting point for understanding translational research

Literature using Natural Language Processing](https://arxiv.org/pdf/2101.01508.pdf)

Overall, the framework presented here can be a generic and