䷉Table: Scientific NLP and Reasoning Systems Readings
A sporadically updated table of papers I come across.
Similar: Hacking on Scientific Text at Scale and Readings in Comp. Linguistics
Shoutout to myself for convincing Ric to make my friend Ric da Silva for creating ivywrite.io, which can compile messy Roam Research pages into something publishable like this
|
excerpt |
released |
---|---|---|
MedType: Improving Medical Entity Linking with Semantic Type Prediction | Code and [[dataset]] By typing entities into a semantic type you can reduce candidate labels by 10x. Improves [[entity linking]]. Reported 6-10 points in AUC increase for [[entity disambiguation]] UMLS has 127 types. They grouped them into 24 semantic groups (see Table 2) Authors create WikiMed (650k mentions normalized to UMLS) and PubMedDS [[dataset]] and pre-train MedType with it before fine-tuning on medical data | 05/20 |
High-Precision Extraction of Emerging Concepts from Scientific Literature | Heuristic Algorithm: Intuition: New terms (eg. Self-Taught Hashing, gradient penalty) are introduced by a single paper that gets cited a lot later. Problem before paper: phrases extracted often too general or too narrow. AllenAI's ForeCite achieves a double digit precision improvement on extracting new concepts from papers | |
ExpBERT: Representation Engineering with Natural Language Explanations | Suppose we want to specify the inductive bias that married couples typically go on honeymoons for the task of extracting pairs of spouses from text. In this paper, **we allow model developers to specify these types of inductive biases as natural language explanations. Across three relation extraction tasks, our method, ExpBERT, matches a BERT **baseline but with 3--20x less labeled data **and improves on the baseline by 3--10 F1 points with the same amount of labeled data | 05/20 |
In this paper, we propose a novel Neural Execution Tree (NExT) framework to augment training data for text classification using NL explanations. After transforming NL** explanations into executable logical forms by [[semantic parsing]], NExT generalizes different types of actions specified by the logical forms for labeling data instances, which substantially increases the coverage of each NL explanation. | 11/19 | |
Converting a compound structure into chemical information applicable for machine learning tasks requires multilayer computational processing from chemical graph retrieval, descriptor generation, fingerprint construction to similarity analysis | xx/18 | |
A dependency forest encodes all valid dependency trees of a sentence into a 3D space that syntax parsing is differentiable This method allows us to merge a parser into a relation extraction model so that the parser can be jointly updated based on end-task loss. | ||
TL;DR: More efficient text encoders emerge from an attempt (failed?) to do "[[GAN]]'s for Text". Going Binary improves sample efficiency: Instead of [MASKED] token, the Discriminator only has to predict if a viable swap was sampled from a generator (painting -> mask -> car). Later the generator is thrown away 30x training efficiency over BERT | ||
How to populate your knowledge base without predefined rules or training data? Pick a predefined binary relationship type, do a 80/20 simplification of the sentence and find the simplest, good patterns through a bit of expert quality control. | ||
Domain experts often face the need to efficiently search for specific kind of information in a large collection of documents. In this work, we propose a method to do so that combines exact-match search over symbolic structures, with the ability of modern neural models to provide rich semantic representations which generalize over surface forms | 07/20 | |
Generalizing the NLP pipeline through Span-relation Representation | Humans can analyze language in a single format? so machines might as well SpanRel: A large number of [[NLP]] [[subtask]] ([[NER]], [[relation extraction]], [[Semantic Role Labeling | SRL]], sentiment analysis) can be represented in a single format : Spans, and relations between Spans | |
DARE: Data augmented Relation Extraction | Uses GPT-2 to generate examples for specific relation types Combine generated data with gold data and train BERT relation extraction classifiers on them | |
TeKnow: Better Search for Computer Science | {{[[embed]]: Aspect-based retrieval. Search with a task in mind.}} Fine-grained pre-requisites for knowledge acquisition: showing the [[skill tree]] in [[computer science]] | 07/20 |
Top2Vec is an algorithm for [[topic modeling]] and [[semantic search]]. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Crackpot Idea: Maybe good idea to embed topics, document, word, entity pair, entity and dependency paths (aka Improving Biomedical Analogical Retrieval with Embedding of Structural "More informative topics than probabilistic generative models" | 08/20 | |
A Novel Cascade Binary Tagging Framework for Relational Triple Extraction CODE | Adresses Overlapping Triple Problem [relations sharing the same entities], ie. EntityPairOverlap (EPO) and SingleEntityOverlap (SEO). Method: Instead of discrete labels, **relations are reframed as functions **that map subjects to objects in a sentence, which naturally handles the overlapping problem. This binary tagging model can sit on top of a Transformer encoder. Results: | 06/20 |
AutoMATES: Automated Model Assembly from Text, Equations, | Besides natural language and equations, models must be implemented in code. DEMO needs some UX work A Unified Representation for Code, Text, Equations Concept Parsing Challenge: Architecture Diagram: | 01/20 |
Relation of the Relations: A New Paradigm of the | Only skimmed paper, but the idea of modeling the ** interdependency** among multiple relations in the same discourse felt right. Similar to the [[Shortest Dependency Path]] [[clustering]] approach in A global network of biomedical relationships derived from text "Accordingly, we develop a data-driven approach that does not require hand-crafted rules but learns by itself the RoR, using Graph Neural Networks and a relation matrix transformer" | 06/20 |
CLAIM EXTRACTION IN BIOMEDICAL PUBLICATIONS USING DEEP DISCOURSE MODEL AND TRANSFER LEARNING CODE | Includes: [[dataset]] for discourse and claims prediction, which I understand as a [[span prediction]] task, along with an [[annotation tool]] Fun [[statistic]]: 55% of their annotated claims were in the last sentence of the abstract [biomedicine] Use Cases: Find similar claims using different methods (literature review); find related claims (-> claim reliability for [[peer review]]) | 07/19 |
See also: INDRA We annotate results in literature, which we convert into logical constraints on causal structure; with these constraints, we find consistent causal graphs using a state-of-the-art, causal discovery algorithm based on [[answer set programming]]. **Experiment Selection: **Because these causal graphs show which relations are underdetermined, biologists can use this pipeline to select their next experiment. | xx/17 | |
I mentioned QA-SRL in Hacking on Scientific Text at Scale "Applying our proposed controlled crowdsourcing protocol to QA-SRL successfully attains truly scalable high-quality annotation by laymen" | 11/19 | |
[[dataset]] at 100x size of ConceptNet Method: Use linguistic knowledge ([[Dependency Tree]]) to extract [[common sense]] knowledge | 05/20 | |
Language (Re)modelling: Towards Embodied Language Understanding |
Claim: Natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. | 05/20 |
HDAG-Explorer: A System for Hierarchical DAG | [[interface]]** innovation**: Summarize a Hierarchical Directed A-cyclical Graph [H-DAG] along highly important and diverse Features: summarized visualization, interactive exploration, | xx/20 |
Discrete Word Embedding for Logical Natural Language Understanding | A discrete, [[interpretable]] binary word [[embedding]] that preservers vector arithmetic and can handle ambiguity like its continuous sibling. "Our model represents each word as a [[symbolic]] action that modifies the binary (ie. propositional) recurrent states through effects." Our Framework can extract the effects applied by each word as explicit logical formulae [??] New: a) Robust to redundancy [the red red apple - would skew a continuous word vector] b) [[interoperability]] with classical planners | 08/20 |
High Bandwidth Supervision: The annotator already spent the effort of reading and understanding the example, so let's get more information - increase [[annotation bandwidth]] BabbleLabble: In addition to the binary label, annotators provide explanations which are parsed ([[semantic parsing]]) into logic forms with labeling functions. #[[data programming]] | 05/18 | |
Designing better semantic representations Existing approaches such as semantic role labeling (SRL) and abstract meaning representation (AMR) still have features related to the peculiarities of the particular language | 08/19 | |
Improving Biomedical Analogical Retrieval with Embedding of Structural | From the authors of A global network of biomedical relationships derived from text Problem: Neural embeddings are trained for co-occurence events without making use of [[syntax]]. But, relational structure from sentence dependencies is essential for analogical reasoning (cognitive theory). [ESD: Encoding Structural Dependencies] Entities, Pairs of Entities and [[dependency]] Paths are all represented in the same [[vector space]] The well known [King - Man + Woman = Queen] arithmetic only seems to works because the target terms are similar in the vector space. | 07/20 |
In Laymans Terms: Semi-Open Relation Extraction from Scientific Texts |
Trade-off is one of many complex relationship types; | 05/20 |
Idea: combine Masked [[language model]] with [[differentiable]] retriever retrieve-then-predict: if a retrieval improves [[language model]] [[perplexity]] then reward Advance in Indexing an Entire Corpus Don't understand the rest of the paper | 02/20 | |
The NMT (neural machine translation) metaphor is not adequate to describe patching buggy to working code. | 09/20 | |
[[data programming]] is cool but writing [[labeling function]]s is not easy. Can we reuse them for other edges or nodes? Results: | ||
Extraction of causal structure from procedural text for discourse representations | Surface forms link to force dynamics [linguistic theory]. Use that insight to extract causal chains, process graphs from text | |
DEMO Can transformers be trained to reason (or emulate reasoning) over rules expressed in language? RuleTakers, are trained on datasets of synthetic rule bases plus derived conclusions, provided here. The resulting models provide the first demonstration that this kind of soft reasoning over language is indeed learnable. Metals conduct electricity. | ||
REPL based ML Development With Pretrained-Embeddings You Get REPL Speed | ||
Hierarchical sequence labeling for extracting BEL statements from biomedical literature | We first map the sentence-level BEL statements in the BC-V training corpus to the corresponding text segments, thus generating hierarchically tagged training instances. F-measure improvement of 31.6% Unlike the previous relation extraction task, where a relationship is purely between two entities, the BEL task aims to discover the hierarchical relations between biomedical entities, meaning that the relationship (increases or decreases) can hold among multiple entities and complex biomedical functions (such as complex() or tloc()) can also be involved. | |
Three principles that have shaped Snorkel’s design: | ||
NERO: A Neural Rule Grounding Framework for Label-Efficient Relation Extraction | They perform exact string matching on the unlabeled dataset, and a sentence is either matched or not matched by a rule. Soft Matching (Based On Similarity) **We first extract the frequent patterns from large raw corpora, then ask human annotators to assign labels to the patterns. ** | |
TODO: amazing paper with many new ideas on [[program synthesis]] On Reducing the Program Search Space | ||
Paper follows the trajectory of 450.000 new concepts extracted from ~40M papers (web of science) to find why only some end up in inventions or drug trials Knowledge transfer defined as concepts that end up in patents (~3%) or clinical trials (~11%) Concepts: curricula, tools, programs, ideas, theories, substances, methods, processes and proposition -- the basic units of scientific discovery and advance Their models, based on the features has high accuracy; good starting point for understanding translational research | ||
Literature using Natural Language Processing](https://arxiv.org/pdf/2101.01508.pdf) | Overall, the framework presented here can be a generic and |