Count occurrences of given character per cell

Google Sheets

Google Sheets Problem Overview


Question

For example if I wanted to count the number of Ns in a column of strings how can I do this in Google Spreadsheets at a per cell basis (i.e. a formula that points at one cell at a time that I can drag down)?

String/Count table

Background

I'm having to decide a threshold -min-overlap <integer> for a program called TOMTOM** which compares similarity between PWMs*** of small DNA motifs****, N is a regular expression for any linear combination of the letters A, C, G and T. It would be nice if I could get an idea of the distribution of non-N lengths of my DNA motifs to help inform me of a proper -min-overlap <integer> value for TOMTOM.

And here are some real examples:

enter image description here

** TOMTOM is a tool for comparing a DNA motif to a database of known motifs. See here for more info.

*** PWM stands for Position Weight Matrix:

  • According to Wiki: A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.
  • According to this paper, it could be defined as:

> Position weight matrix (PWM) or PWM‐like models are widely used to > represent DNA‐binding preferences of proteins (Stormo, 2000). In these > models, a matrix is used to represent the TF‐binding site (TFBS), with > each element representing the contribution to the overall binding > affinity from a nucleotide at the corresponding position. An inherent > assumption of traditional PWM models is position independence; that > is, the contribution of different nucleotide positions within a TFBS > to the overall binding affinity is assumed to be additive. Although > this approximation is broadly valid, nevertheless, it does not hold > for several proteins (Man & Stormo, 2001; Bulyk et al, 2002). To > improve quantitative modeling, PWM models have been extended to > include additional parameters, such as k‐mer features, to account for > position dependencies within TFBSs (Zhao et al, 2012; Mathelier & > Wasserman, 2013; Mordelet et al, 2013; Weirauch et al, 2013; Riley et > al, 2015). Interdependencies between nucleotide positions have a > structural origin. For example, stacking interactions between adjacent > base pairs form the local three‐dimensional DNA structure. TFs have > preferences for sequence‐dependent DNA conformation, which we call DNA > shape readout (Rohs et al, 2009, 2010).

OR, more contemporarily:

> Based on this rationale, an alternative approach to augment > traditional PWM models is the inclusion of DNA structural features. > Models of TF–DNA binding specificity incorporating these DNA shape > features achieved comparable performance levels to models > incorporating higher‐order k‐mer features, while requiring a much > smaller number of parameters (Zhou et al, 2015). We previously > revealed the importance of DNA shape readout for members of the basic > helix‐loop‐helix (bHLH) and homeodomain TF families (Dror et al, 2014; > Yang et al, 2014; Zhou et al, 2015). We were also able, for Hox TFs, > to identify which regions in the TFBSs used DNA shape readout, > demonstrating the power of the approach to reveal mechanistic insights > into TF–DNA recognition (Abe et al, 2015). This capability was > extensively shown for only two protein families, due to the lack of > large‐scale high‐quality TF–DNA binding data. With the recent > abundance of high‐throughput measurements of protein–DNA binding, it > is now possible to dissect the role of DNA shape readout for many TF > families.

**** DNA motif: wiki: In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids, which may not be adjacent.

Google Sheets Solutions


Solution 1 - Google Sheets

An alternative for one cell at a time (formula to be copied down):

=len(A2)-len(SUBSTITUTE(A2,"N",""))

Solution 2 - Google Sheets

I don't know if this is gonna help but let's say you have those strings in range A2:A6 and you enter

=ArrayFormula(LEN(REGEXREPLACE(A2:A6, "[^N]", "")))

in B2, that should output the N count for the whole range.

Solution 3 - Google Sheets

=len(A2)-len(SUBSTITUTE(A2,"N",""))

this works, but if you want to find all numbers matching a specific pattern, say, 3. Then:

=len(A2)-len(SUBSTITUTE(A2,"3",""))

Is what you need.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionhello_there_andyView Question on Stackoverflow
Solution 1 - Google SheetspnutsView Answer on Stackoverflow
Solution 2 - Google SheetsJPVView Answer on Stackoverflow
Solution 3 - Google SheetsX.MaView Answer on Stackoverflow